Scoring HTML Links

The purpose of scoring HTML links is to ensure that a Honeyclient browser agent visits popular links first. By popular, we mean more likely to be clicked on by a human websurfer (HWS). When a web page is visited, the agent extracts the links and assigns a score to each. The Honeyclient will visit the highest-scoring link immediately following the current URL, and so on in order of decreasing score. This is part of a larger effort to imbue Honeyclients with the capability to browse the web in a more human-like manner. We do this to impede attackers from identifying Honeyclient activity using server log analysis. If attackers are able to detect Honeyclient activity on their malicious servers, they could move or alter the malicious code in order to avoid detection. From a practical standpoint, it also makes sense to organize the link visitation queue in some user-defined manner. We call this organization link scoring.

  1. Explanation of Scoring Methodology
    1. Length of link text
    2. Image presence
    3. Image size
    4. Good word presence
    5. Bad word presence
  2. Detailed Discussion of Link Scoring Code
  3. Final Thoughts

Explanation of Scoring Methodology

The idea of web crawlers visiting links in an organized fashion is not new. One technique was outlined by Billy Hoffman in his paper on Covert Crawling. We used these ideas as a starting point for our methodology. The Honeyclient link scoring implementation is meant to be a generic framework around which an advanced user can build a customized scoring system.

The following sections describe the default behavior of the Honeyclient agent. Each HTML link contains two pieces of data. The first item is the text of the link itself. Second, each link has surrounding context. The Honeyclient::Agent::Driver::Browser module evaluates each link and its context based on the following criteria:

  • length of the link text
  • image presence
  • image size
  • good word presence
  • bad word presence

Each link begins with a score of zero.

Length of link text

If the length of the text which represents the link is within a specified range, not too long and not too short, it will be have points added to its score. The reasoning behind this scoring rule is that a very short link is harder to see on the page. For example consider the following hyperlink z. This is an odd way to represent a link because it gets lost in the surrounding text. We interpret this to mean that an HWS is more likely to overlook the link. Furthermore, the following link is odd as well: This is the longest link you are going to see in a long, long time because the link text is an entire paragraph which is one very, very lengthy run-on sentence with duplicated and unnecessary words and really makes no sense at all if you are trying to figure out why someone might want to write a link in this manner -- plus it is very uncommon to see this type of thing on the actual World Wide Web (a.k.a. the Internet). We interpret these long links as unusual, and likely to be viewed as undesirable by an HWS. These upper and lower bounds indicate that there is some medium-length link text which makes a link more popular.

Image presence

If there is an image associated with a link, this makes it much more likely to be clicked by an HWS. We can detect the presence of an image because of the img= attribute contained within the HTML context. When this attribute is present, we should add a significant number of points to the link score.

Image size

Not all images are created equal. Using similar reasoning as in the section on the length of the link text, we can say that smaller images are harder for an HWS to see than large images. Large images usually indicate links to large subsections of a website or to advertisements. There are many ways to calculate the size of the image and score it accordingly. Hoffman suggested calculating the area of the image using the height= and width= attributes and then scoring it in proportion to the size of a 1024 X 768 display.

Good word presence

Certain words serve as indicators of desirable links. For example, the word "new" is often an indicator of something that might attract an HWS to that link. Other (usually) good words are "news," "latest," "main," "update," "sell," "buy," and "free." For each good word contained in the text associated with the link (including the alt= text), the score is increased.

Bad word presence

Analogous to the idea that there are good words which cause a link to be more popular, there are also bad words which usually indicate unpopular content. For example, the legal disclaimer included at the bottom of most commercial web sites is rarely clicked by an HWS. Other examples of (usually) bad words are "privacy," "copyright," "about," and "jobs." For each bad word contained in the text associated with the link (including the alt= text), the score is decreased. Note that this makes it possible to have links with negative scores.

Detailed Discussion of Link Scoring Code

In the Honeyclient::Agent::Driver::Browser module, there is a subroutine called _scoreLinks() which takes three inputs:

  • $base
  • $content
  • %wordlists

The variable $base contains the base URL of the web page. Our $content variable contains the raw HTML of the web page we are analyzing. The hash variable %wordlists has two keys, good and bad. The values associated with these keys are arrays containing the good and bad words. These are passed to the agent module from the Honeyclient manager. Since different users will have different ideas for the good and bad word lists, the default words can be changed by editing the file etc/honeyclient.xml.

In the generic case, we are only concerned about scoring <href>-type references within a web page. The other link types, <link>, <map>, and <area> are not considered under our basic framework. Although it is possible to score these link types, the current code does not implement it.

We use a regular expression, or regex, to find and parse each <a> tag in the HTML content. The following code demonstrates this:

	while ($content =~ m{<a\b([^>]+)>(.*?)</a>}ig) {
		my $attr = $1;
		my $text = $2;
		my $score = 0;

The regex m{<a\b([^>]+)>(.*?)</a>}ig allows us to grab the attributes and text associated with each <a> tag and store them in the variables $attr and $text, respectively. We now check to see if the <a> tag contains an href using the following regex:

		if ($attr =~ m{
				\b HREF
				\s* = \s*
				(?:
				 "([^"]*)"
				 |
				 '([^']*)'
				 |
				 {[^'">\s]+}
				)
			  }xi)
		 {
		 	$url = $+;

For a more detailed explanation of how this regex operates, we refer the reader to Mastering Regular Expressions, 3rd edition by Jeffrey E.F. Friedl. Specifically, begin reading on page 200 under the heading "HTML-Related Examples."

Now that we have obtained an <a> tag which contains an href, we should check the protocol of the reference. It is possible for a web page to include mailto: links, ftp: links, and many others besides http:. For a Honeyclient browser agent, we are only interested in links beginning with http: or https:.

Having sifted the irrelevant links out of our scoring algorithm, we are ready to begin analyzing the $attr and $text variables. The complete scoring code is shown here:

		 	# Some programmatic values
		 	my $min_text_length = 6;
   		 	my $max_text_length = 20;
   		 	my $image_bonus = 50;
   		 	my $default_display_size = 1024 * 768;
   		 	my $word_value = 6;

			# Score length of link text
			if ($text !~ /img/i &&
				length($text) > $min_text_length &&
				length($text) < $max_text_length) {
				$score += length($text);
			}

                        # Score the image content, if it exists
                        # We score the size proportional to a 1024 X 768 display
			# Image bonus
			if ($text =~ /img=/i) {
				$score += $image_bonus;
			}
			# Score image size
			my $width;
			my $height;
			if ($text =~ /\b WIDTH\s*=\s*.(\d+)/xi) {
				$width = $1;
			}
			if ($text =~ /\b HEIGHT\s*=\s*.(\d+)/xi) {
			  	$height = $1;
			}
			if ($width && $height) {
				$score += int(($width*$height)/($default_display_size)*100);
			}
			elsif ($width) {
				$score += int($width/10);
			}
			elsif ($height) {
				$score += int($height/10);
			}
			# Good word bonus
			foreach (@good_words) {
                            if ($text =~ /$_/i) {
	                        $score += $word_value;
	                    }
			}
			# Bad word penalty
			foreach (@bad_words) {
	                    if ($text =~ /$_/i) {
	                        $score -= $word_value;
	                    }
			}

First, we increase the score by the length of the link text if the length is between 7 and 19. These numbers are arbitrary, and not scientific at all.

Second, we look for the presence of an image, awarding 50 points if the link is associated with an image.

Next, we check for the height and width of the image. If both values are present, we increase the score by the percentage of space that the image would take up on a 1024 X 768 display. For example, an image taking up half the display area would get 50 points. If only one value of height or width is present, we add 10% of the given value to the score. An image with a height or width of 300 pixels would get 30 points.

Lastly, we add six points for each good word found in the text and subtract six points for each bad word.

Final Thoughts

There are many possible ways to score the links on a web page. This method seeks to implement a simple heuristic which will direct the Honeyclient agent to links which are likely to be attractive to a human websurfer. More complex and complete schemes are possible, with customization of good words and bad words, analysis of image maps, changing the point values assigned to each criteria, and so on. Testing has shown that our method does not always choose wisely, but as a general rule it does tend to steer the Honeyclient agent toward the "flashier" parts of a web page first.