Changeset 61
- Timestamp:
- 12/01/06 10:29:21 (2 years ago)
- Files:
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
honeyclient/branches/exp/stephenson-link_scoring/lib/HoneyClient/Agent/Driver/Browser.pm
r41 r61 17 17 # as published by the Free Software Foundation, using version 2 18 18 # of the License. 19 # 19 # 20 20 # This program is distributed in the hope that it will be useful, 21 21 # but WITHOUT ANY WARRANTY; without even the implied warranty of 22 22 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 23 23 # GNU General Public License for more details. 24 # 24 # 25 25 # You should have received a copy of the GNU General Public License 26 26 # along with this program; if not, write to the Free Software … … 55 55 'http://www.google.com' => 1, 56 56 'http://www.cnn.com' => 1, 57 }, 57 }, 58 58 ); 59 59 … … 76 76 print "Status:\n"; 77 77 print Dumper($browser->status()); 78 78 79 79 } 80 80 … … 90 90 91 91 This library allows the Agent module to drive an instance of any broswer, 92 running inside the HoneyClient VM. The purpose 92 running inside the HoneyClient VM. The purpose 93 93 of this module is to programmatically navigate the browser to different 94 94 websites, in order to become purposefully infected with new malware. 95 95 The module implements the logic necessary to decide the order in which 96 the 97 98 This module is object-oriented in design, retaining all state information 96 the 97 98 This module is object-oriented in design, retaining all state information 99 99 within itself for easy access. A specific browser class must inherit from 100 100 Browser. … … 116 116 external links in a random fashion. B<However>, this cannot be 117 117 guaranteed, as additional links from the same server may be found 118 later, after processing the contents of an external link. 118 later, after processing the contents of an external link. 119 119 120 120 As the browser driver navigates the browser to each link, it … … 122 122 visited (see L<links_visited>); when invalid links were found 123 123 (see L<links_ignored>); and when the browser attempted to visit 124 a link but the operation timed out (see L<links_timed_out>). 124 a link but the operation timed out (see L<links_timed_out>). 125 125 By maintaining this internal history, the driver will B<never> 126 126 navigate the browser to the same link twice. … … 194 194 #if ($Config{osname} !~ /^MSWin32$/) { 195 195 # Carp::croak "Error: " . __PACKAGE__ . " will only run on Win32 platforms!\n"; 196 #} 196 #} 197 197 198 198 $SIG{PIPE} = 'IGNORE'; # Do not exit on broken pipes. … … 223 223 # TODO: Need unit testing. 224 224 use HoneyClient::Util::SOAP qw(getClientHandle); 225 225 226 226 # TODO: Need unit testing. 227 227 use LWP::UserAgent; … … 245 245 B<new()> function, as arguments. 246 246 247 Furthermore, as each parameter is initialized, each can be individually 247 Furthermore, as each parameter is initialized, each can be individually 248 248 retrieved and set at any time, using the following syntax: 249 249 … … 287 287 resource (i.e., "javascript:doNetDetect()"). 288 288 289 Specifically, each 'key' corresponds to an absolute URL and the 289 Specifically, each 'key' corresponds to an absolute URL and the 290 290 'value' is a string representing the date and time of when the link 291 291 was visited. … … 308 308 back into the B<links_to_visit> hashtable. 309 309 310 When driving to the next link, this hashtable is exhausted prior 310 When driving to the next link, this hashtable is exhausted prior 311 311 to the main B<links_to_visit> hashtable. This allows a 312 312 browser to navigate to all links hosted on the same server, prior … … 325 325 It is updated dynamically, any time $object->getNextLink() is called. 326 326 327 When the browser is ready to drive to the next link, B<next_link_to_visit> 327 When the browser is ready to drive to the next link, B<next_link_to_visit> 328 328 is checked first. If that value is B<undef>, then the B<relative_links_to_visit> 329 329 hashtable is checked next. If that hashtable is empty, then finally the … … 341 341 timing out. 342 342 343 Specifically, each 'key' corresponds to an absolute URL and the 343 Specifically, each 'key' corresponds to an absolute URL and the 344 344 'value' is a string representing the date and time of when access to 345 the resource was attempted. 345 the resource was attempted. 346 346 347 347 B<Note>: See internal documentation of _getTimestamp() for the … … 385 385 =cut 386 386 387 my %PARAMS = ( 387 my %PARAMS = ( 388 388 389 389 # This is a hashtable of fully qualified URLs … … 396 396 # 'key' is the absolute URL and the 'value' is a string 397 397 # representing the date and time of when the link was visited. 398 # 398 # 399 399 # Note: See _getTimestamp() for the corresponding date/time 400 400 # format. … … 411 411 # The 'key' is the absolute URL and the 'value' is a string 412 412 # representing the date and time of when the link was visited. 413 # 413 # 414 414 # Note: See _getTimestamp() for the corresponding date/time 415 415 # format. … … 418 418 # This is a hashtable of fully qualified URLs 419 419 # that all share a common *hostname*. This hashtable should be 420 # initially empty. As the driver extracts and removes new URLs 421 # off the 'links_to_visit' hashtable, driving the browser to each URL, 420 # initially empty. As the driver extracts and removes new URLs 421 # off the 'links_to_visit' hashtable, driving the browser to each URL, 422 422 # any *relative* links found are added into this hashtable; any 423 423 # *external* links found are added back into the 'links_to_visit' 424 424 # hashtable. 425 425 # 426 # When navigating to the next link, this hashtable is exhausted prior 426 # When navigating to the next link, this hashtable is exhausted prior 427 427 # to the main 'links_to_visit' hashtable. This allows a 428 428 # browser to navigate to all links hosted on the same server, prior 429 429 # to contacting a different server. 430 # 430 # 431 431 # Specifically, the 'key' is the absolute URL and the 'value' 432 432 # is always 1. … … 448 448 # The 'key' is the absolute URL and the 'value' is a string 449 449 # representing the date and time of when the link was visited. 450 # 450 # 451 451 # Note: See _getTimestamp() for the corresponding date/time 452 452 # format. … … 477 477 # websites. 478 478 max_relative_links_to_visit => getVar(name => "max_relative_links_to_visit"), 479 479 480 480 ); 481 481 … … 491 491 # 492 492 # When getting the next link, 'next_link_to_visit' is checked first. 493 # If that value is undef, then the 'relative_links_to_visit' hashtable 494 # is checked next. If that hashtable is empty, then finally the 493 # If that value is undef, then the 'relative_links_to_visit' hashtable 494 # is checked next. If that hashtable is empty, then finally the 495 495 # 'links_to_visit' hashtable is checked. 496 496 # … … 501 501 # Get the object state. 502 502 my $self = shift; 503 504 # Set the link to find as undef, initially. 503 504 # Set the link to find as undef, initially. 505 505 # We use undef to signify that our URL *_links_to_visit hashtables 506 506 # are empty. If we were to use the empty string instead, as our … … 540 540 } 541 541 542 # Return the next link found. 542 # Return the next link found. 543 543 return $link; 544 544 } … … 556 556 $dt->hms(':') . "." . 557 557 $dt->nanosecond(); 558 } 558 } 559 559 560 560 # Helper function designed to "pop" a key off a given hashtable. 561 561 # When given a hashtable reference, this function will extract a valid key 562 # from the hashtable and delete the (key, value) pair from the 562 # from the hashtable and delete the (key, value) pair from the 563 563 # hashtable. The link with the highest score is returned. 564 564 # 565 # 565 # 566 566 # 567 567 # Inputs: hashref … … 575 575 my @array = sort {$$hash{$b} <=> $$hash{$a}} keys %{$hash}; 576 576 my $topkey = $array[0]; 577 577 578 578 # Delete the key from the hashtable. 579 579 if (defined($topkey)) { … … 603 603 } 604 604 605 # Get the URL supplied. 605 # Get the URL supplied. 606 606 my $url = $arg . "/"; # Tack on an ending delimeter. 607 607 … … 631 631 # - If a link is new and "invalid", then it gets added to 632 632 # the 'links_ignored' hashtable. 633 # 633 # 634 634 # - If a link is old and "invalid", then it gets 635 635 # ignored. … … 638 638 # 639 639 # - If a link is new and "valid", then we check to see if 640 # the referring URL's hostname[:port] and the link's 640 # the referring URL's hostname[:port] and the link's 641 641 # hostname[:port] match. If they match, then the link 642 642 # is added to the 'relative_links_to_visit' hash. … … 655 655 # Get the referrer and the corresponding arrays of links and scores. 656 656 my ($referrer, %links) = @_; 657 657 658 658 foreach my $url (keys %links) { 659 659 my $score = $links{$url}; … … 676 676 # Link is new and valid; go ahead and add to the appropriate 677 677 # hashtable. 678 678 679 679 # Extract the core hostname of the URL to visit. 680 680 # If $url is undef, then this function will return an empty string. 681 681 my $hostname = _extractHostname($url); 682 682 683 683 # If the referrer's hostname and the URL's hostname match... 684 684 if ($hostname eq $referrer) { … … 692 692 } 693 693 } 694 694 695 695 # Return the modified object state. 696 696 return $self; … … 698 698 699 699 # Helper function designed to validate supplied links. 700 # 700 # 701 701 # When a link is provided as an argument: 702 702 # … … 708 708 # already exists within the history, then it is considered 709 709 # invalid. 710 # 710 # 711 711 # If the link is valid, then it is returned. Otherwise, undef 712 712 # is returned for all invalid links. Also, all invalid links … … 717 717 # Outputs: url if valid, empty string if invalid 718 718 sub _validateLink { 719 719 720 720 # Get the object state. 721 721 my $self = shift; … … 759 759 (scalar(%{$self->links_ignored}) and 760 760 exists($self->links_ignored->{$link}))) { 761 761 762 762 # Link is valid but already visited, so return undef. 763 763 return; … … 785 785 my $stub = getClientHandle(address => 'localhost', 786 786 namespace => 'HoneyClient::Agent'); 787 787 788 788 my $som = $stub->killProcess($self->process_name); 789 789 … … 804 804 of these methods were implementations of the parent Driver interface. 805 805 806 As such, the following code descriptions pertain to this particular 806 As such, the following code descriptions pertain to this particular 807 807 Driver implementation. For further information about the generic 808 808 Driver interface, see the L<HoneyClient::Agent::Driver> documentation. … … 818 818 B<$param> is an optional parameter variable. 819 819 B<$value> is $param's corresponding value. 820 820 821 821 Note: If any $param(s) are supplied, then an equal number of 822 822 corresponding $value(s) B<must> also be specified. … … 904 904 905 905 B<Warning>: This method will B<croak> if the IE driver object is B<unable> 906 to navigate to a new link, because its list of links to visit is empty. 906 to navigate to a new link, because its list of links to visit is empty. 907 907 908 908 =back … … 947 947 # before registering attempt as a failure. 948 948 my $timeout : shared = $self->timeout(); 949 949 950 # Get the good words and bad words from config file 951 if ($Config{goodwords}) { 952 print "There are good words!"; 953 } 954 950 955 # Use LWP::UserAgent to get the desired $args{'url'} and associated content 951 # TODO: Analyze all the options LWP::UserAgent provides, in case we've 956 # TODO: Analyze all the options LWP::UserAgent provides, in case we've 952 957 # missed something useful. 953 958 # Create a new user agent. … … 967 972 $ua->max_size(1*1024*1024); # Don't get values larger than 1MB for testing 968 973 $ua->timeout($timeout); 969 974 970 975 my $response = $ua->request( 971 976 HTTP::Request->new( … … 982 987 my $content = $response->content; 983 988 my %scored_links; 984 989 985 990 # Get the current time. 986 991 my $timestamp = _getTimestamp(); 987 992 988 993 # Score the new links based on their surrounding HTML context 989 994 # If %scored_links is emtpy upon return, there are no links … … 992 997 %scored_links = _scoreLinks($base, $content); 993 998 } 994 999 995 1000 # Check to see if the request timed out. 996 1001 # TODO: Need better error detection. … … 1024 1029 $self->max_relative_links_to_visit; 1025 1030 } elsif ($self->_remaining_number_of_relative_links_to_visit > 1) { 1026 1031 1027 1032 # The counter is positive, so decrement it. 1028 1033 $self->{_remaining_number_of_relative_links_to_visit}--; … … 1062 1067 1063 1068 sub getNextLink { 1064 1069 1065 1070 # Get the object state. 1066 1071 my $self = shift; 1067 1072 1068 1073 # Sanity check: Make sure we've been fed an object. 1069 1074 unless (ref($self)) { … … 1072 1077 } 1073 1078 1074 # Set the link to find as undef, initially. 1079 # Set the link to find as undef, initially. 1075 1080 my $link = undef; 1076 1081 … … 1098 1103 1099 1104 Specifically, the returned data is a reference to a hashtable, containing 1100 detailed information about which resources, hostnames, IPs, protocols, and 1105 detailed information about which resources, hostnames, IPs, protocols, and 1101 1106 ports that the browser will contact upon the next drive() iteration. 1102 1107 … … 1104 1109 1105 1110 $hashref = { 1106 1111 1107 1112 # The set of servers that the driver will contact upon 1108 1113 # the next drive() operation. … … 1119 1124 'udp' => [ 53, 123 ], 1120 1125 }, 1121 1126 1122 1127 # Or, more generically: 1123 1128 'hostname_or_IP' => { … … 1133 1138 }; 1134 1139 1135 B<Note>: For this implementation of the Driver interface, 1140 B<Note>: For this implementation of the Driver interface, 1136 1141 unless getNextLink() returns undef, the returned hashtable 1137 1142 from this method will B<always> contain only B<one> hostname … … 1161 1166 # Get the object state. 1162 1167 my $self = shift; 1163 1168 1164 1169 # Sanity check: Make sure we've been fed an object. 1165 1170 unless (ref($self)) { … … 1203 1208 } 1204 1209 } 1205 1206 # Finally, construct the corresponding hash reference. 1210 1211 # Finally, construct the corresponding hash reference. 1207 1212 $nextSite = { 1208 1213 targets => { … … 1258 1263 my %links = (); 1259 1264 my $url; 1260 open(FILEH,">> scoring.txt") || die("Cannot Open File");1261 1265 open(FILEH,">>link_scores.txt") || die("Cannot Open File"); 1266 1262 1267 if (!$content) { 1263 1268 return %links; 1264 1269 } 1265 1266 # Begin to scour the HTML content for <a> tags 1270 1271 # Begin to scour the HTML content for <a> tags 1267 1272 while ($content =~ m{<a\b([^>]+)>(.*?)</a>}ig) { 1268 1273 my $attr = $1; 1269 1274 my $text = $2; 1270 1275 my $score = 0; 1271 1276 1272 1277 if ($attr =~ m{ 1273 1278 \b HREF … … 1283 1288 { 1284 1289 $url = $+; 1285 1290 1286 1291 # We have to make this an absolute url (if it's not) 1287 1292 # before using it as a key in the %links hash 1288 1293 $url = url($url, $base)->abs; 1289 1290 # The link must be an HREF and be a http(s) link 1294 1295 # The link must be an HREF and be a http(s) link 1291 1296 if ($url =~ /^http/i) { 1292 1297 # Image bonus 1293 1298 if ($text =~ /img/i) { 1294 1299 $score += 50; 1295 print FILEH "Image bonus!\n"; 1300 print FILEH "Image bonus!\n"; 1296 1301 } 1297 1302 # Score image size … … 1300 1305 my $width = $1; 1301 1306 $score += int($width/10); 1302 print FILEH "Image area bonus! $width\n"; 1307 print FILEH "Image area bonus! $width\n"; 1303 1308 } 1304 1309 if ($text =~ /\b HEIGHT\s*=\s*.(\d+)/xi) … … 1306 1311 my $height = $1; 1307 1312 $score += int($height/10); 1308 print FILEH "Image area bonus! $height\n"; 1313 print FILEH "Image area bonus! $height\n"; 1309 1314 } 1310 1315 # Score length of link text … … 1323 1328 print FILEH "Bad word penalty!\n"; 1324 1329 } 1325 1330 1326 1331 print FILEH "The attributes for $url are $attr\n" unless (!$attr); 1327 1332 print FILEH "The text for $url is $text\n" unless (!$text); 1328 1333 print FILEH "It scored $score\n"; 1329 1334 1330 1335 $links{$url} = $score; 1331 1336 $url = undef; … … 1334 1339 } 1335 1340 } 1336 1337 close(FILEH); 1341 1342 close(FILEH); 1338 1343 return %links; 1339 1344 } … … 1376 1381 # Get the object state. 1377 1382 my $self = shift; 1378 1383 1379 1384 # Sanity check: Make sure we've been fed an object. 1380 1385 unless (ref($self)) { … … 1388 1393 scalar(%{$self->relative_links_to_visit}) or 1389 1394 scalar(%{$self->links_to_visit}))) 1390 1395 1391 1396 } 1392 1397 … … 1411 1416 'relative_links_remaining' => 10, # Number of URLs left to 1412 1417 # process, at a given site. 1413 'links_remaining' => 56, # Number of URLs left to 1418 'links_remaining' => 56, # Number of URLs left to 1414 1419 # process, for all sites. 1415 1420 'links_processed' => 44, # Number of URLs processed. … … 1436 1441 1437 1442 sub status { 1438 1443 1439 1444 # Get the object state. 1440 1445 my $self = shift; 1441 1446 1442 1447 # Sanity check: Make sure we've been fed an object. 1443 1448 unless (ref($self)) { … … 1454 1459 scalar(keys(%{$self->links_ignored})); 1455 1460 1456 # Set the number of relative links to process. 1461 # Set the number of relative links to process. 1457 1462 $status->{relative_links_remaining} = scalar(keys(%{$self->relative_links_to_visit})); 1458 1463 1459 1464 # Figure out how many total links are left to process. 1460 1465 $status->{links_remaining} = scalar(keys(%{$self->relative_links_to_visit})) + … … 1462 1467 1463 1468 # Set the total number of links in the object's state. 1464 $status->{links_total} = $status->{links_processed} + 1469 $status->{links_total} = $status->{links_processed} + 1465 1470 $status->{links_remaining}; 1466 1471 … … 1470 1475 $status->{links_total} = 1; 1471 1476 } 1472 $status->{percent_complete} = sprintf("%.2f%%", (($status->{links_processed} + 0.00) / 1477 $status->{percent_complete} = sprintf("%.2f%%", (($status->{links_processed} + 0.00) / 1473 1478 ($status->{links_total} + 0.00)) * 100.00); 1474 1479 … … 1519 1524 $object->drive() iteration. 1520 1525 1521 For example, if at one given point, the status of B<percent_complete> 1522 is 30% and then this value drops to 15% upon another iteration, then 1523 this means that the total number of links to drive to has greatly 1526 For example, if at one given point, the status of B<percent_complete> 1527 is 30% and then this value drops to 15% upon another iteration, then 1528 this means that the total number of links to drive to has greatly 1524 1529 increased. 1525 1530 … … 1560 1565 as published by the Free Software Foundation, using version 2 1561 1566 of the License. 1562 1567 1563 1568 This program is distributed in the hope that it will be useful, 1564 1569 but WITHOUT ANY WARRANTY; without even the implied warranty of 1565 1570 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 1566 1571 GNU General Public License for more details. 1567 1572 1568 1573 You should have received a copy of the GNU General Public License 1569 1574 along with this program; if not, write to the Free Software
