Changeset 166
- Timestamp:
- 01/19/07 14:25:50 (2 years ago)
- Files:
-
- honeyclient/branches/exp/mbriggs-db/etc/honeyclient.xml (modified) (3 diffs)
- honeyclient/branches/exp/mbriggs-db/etc/honeyclient_log.conf (modified) (1 diff)
- honeyclient/branches/exp/mbriggs-db/lib/HoneyClient/Agent/Driver/Browser.pm (modified) (61 diffs)
- honeyclient/branches/exp/mbriggs-db/lib/HoneyClient/Agent/Driver/Browser/FF.pm (modified) (5 diffs)
- honeyclient/branches/exp/mbriggs-db/lib/HoneyClient/Manager.pm (modified) (6 diffs)
- honeyclient/branches/exp/mbriggs-db/lib/HoneyClient/Manager/FW.pm (modified) (54 diffs)
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
honeyclient/branches/exp/mbriggs-db/etc/honeyclient.xml
r149 r166 69 69 <!-- TODO: Update this. --> 70 70 <timeout description="How long the Driver waits during a drive operation, before timing out (in seconds)." default="60"> 71 571 10 72 72 </timeout> 73 73 <Browser> … … 84 84 -1 85 85 </max_relative_links_to_visit> 86 <goodwords description="A comma-separated list of good words which will increase the score of links within a webpage." default=""> 87 news,new,big,latest,main,update,sell,free,buy 88 </goodwords> 89 <badwords description="A comma-separated list of bad words which will decrease the score of links within a webpage." default=""> 90 archive,privacy,legal,disclaim,about,contact,copyright,jobs,careers 91 </badwords> 86 92 <IE> 87 93 <!-- HoneyClient::Agent::Driver::IE Options --> … … 174 180 </Agent> 175 181 <Manager> 182 <!-- TODO: Update this. --> 183 <manager_state description="Upon termination, the Manager will attempt to save a complete copy of its state into this file, if specified." default=""> 184 Manager.dump 185 </manager_state> 176 186 <!-- TODO: Update this. --> 177 187 <address description="The IP or hostname that all Manager modules should use, when accepting SOAP requests." default="localhost"> honeyclient/branches/exp/mbriggs-db/etc/honeyclient_log.conf
r149 r166 60 60 61 61 log4perl.rootLogger=INFO, Screen 62 #log4perl.logger.HoneyClient.Agent.Integrity.Registry=DEBUG, Screen 62 63 # Suppress Parser Debugging Messages 63 64 #log4perl.logger.HoneyClient.Agent.Integrity.Registry.Parser=INFO, Screen honeyclient/branches/exp/mbriggs-db/lib/HoneyClient/Agent/Driver/Browser.pm
r149 r166 17 17 # as published by the Free Software Foundation, using version 2 18 18 # of the License. 19 # 19 # 20 20 # This program is distributed in the hope that it will be useful, 21 21 # but WITHOUT ANY WARRANTY; without even the implied warranty of 22 22 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 23 23 # GNU General Public License for more details. 24 # 24 # 25 25 # You should have received a copy of the GNU General Public License 26 26 # along with this program; if not, write to the Free Software … … 55 55 'http://www.google.com' => 1, 56 56 'http://www.cnn.com' => 1, 57 }, 57 }, 58 58 ); 59 59 … … 76 76 print "Status:\n"; 77 77 print Dumper($browser->status()); 78 78 79 79 } 80 80 … … 94 94 become purposefully infected with new malware. 95 95 96 This module is object-oriented in design, retaining all state information 96 This module is object-oriented in design, retaining all state information 97 97 within itself for easy access. A specific browser class must inherit from 98 98 Browser. … … 114 114 external links in a random fashion. B<However>, this cannot be 115 115 guaranteed, as additional links from the same server may be found 116 later, after processing the contents of an external link. 116 later, after processing the contents of an external link. 117 117 118 118 As the browser driver navigates the browser to each link, it … … 120 120 visited (see L<links_visited>); when invalid links were found 121 121 (see L<links_ignored>); and when the browser attempted to visit 122 a link but the operation timed out (see L<links_timed_out>). 122 a link but the operation timed out (see L<links_timed_out>). 123 123 By maintaining this internal history, the driver will B<never> 124 124 navigate the browser to the same link twice. … … 192 192 #if ($Config{osname} !~ /^MSWin32$/) { 193 193 # Carp::croak "Error: " . __PACKAGE__ . " will only run on Win32 platforms!\n"; 194 #} 194 #} 195 195 196 196 $SIG{PIPE} = 'IGNORE'; # Do not exit on broken pipes. … … 222 222 # TODO: Need unit testing. 223 223 use HoneyClient::Util::SOAP qw(getClientHandle); 224 224 225 225 # TODO: Need unit testing. 226 226 use LWP::UserAgent; … … 244 244 B<new()> function, as arguments. 245 245 246 Furthermore, as each parameter is initialized, each can be individually 246 Furthermore, as each parameter is initialized, each can be individually 247 247 retrieved and set at any time, using the following syntax: 248 248 … … 286 286 resource (i.e., "javascript:doNetDetect()"). 287 287 288 Specifically, each 'key' corresponds to an absolute URL and the 288 Specifically, each 'key' corresponds to an absolute URL and the 289 289 'value' is a string representing the date and time of when the link 290 290 was visited. … … 307 307 back into the B<links_to_visit> hashtable. 308 308 309 When driving to the next link, this hashtable is exhausted prior 309 When driving to the next link, this hashtable is exhausted prior 310 310 to the main B<links_to_visit> hashtable. This allows a 311 311 browser to navigate to all links hosted on the same server, prior … … 324 324 It is updated dynamically, any time $object->getNextLink() is called. 325 325 326 When the browser is ready to drive to the next link, B<next_link_to_visit> 326 When the browser is ready to drive to the next link, B<next_link_to_visit> 327 327 is checked first. If that value is B<undef>, then the B<relative_links_to_visit> 328 328 hashtable is checked next. If that hashtable is empty, then finally the … … 340 340 timing out. 341 341 342 Specifically, each 'key' corresponds to an absolute URL and the 342 Specifically, each 'key' corresponds to an absolute URL and the 343 343 'value' is a string representing the date and time of when access to 344 the resource was attempted. 344 the resource was attempted. 345 345 346 346 B<Note>: See internal documentation of _getTimestamp() for the … … 383 383 =cut 384 384 385 my %PARAMS = ( 385 my %PARAMS = ( 386 386 387 387 # This is a hashtable of fully qualified URLs … … 394 394 # 'key' is the absolute URL and the 'value' is a string 395 395 # representing the date and time of when the link was visited. 396 # 396 # 397 397 # Note: See _getTimestamp() for the corresponding date/time 398 398 # format. … … 409 409 # The 'key' is the absolute URL and the 'value' is a string 410 410 # representing the date and time of when the link was visited. 411 # 411 # 412 412 # Note: See _getTimestamp() for the corresponding date/time 413 413 # format. … … 416 416 # This is a hashtable of fully qualified URLs 417 417 # that all share a common *hostname*. This hashtable should be 418 # initially empty. As the driver extracts and removes new URLs 419 # off the 'links_to_visit' hashtable, driving the browser to each URL, 418 # initially empty. As the driver extracts and removes new URLs 419 # off the 'links_to_visit' hashtable, driving the browser to each URL, 420 420 # any *relative* links found are added into this hashtable; any 421 421 # *external* links found are added back into the 'links_to_visit' 422 422 # hashtable. 423 423 # 424 # When navigating to the next link, this hashtable is exhausted prior 424 # When navigating to the next link, this hashtable is exhausted prior 425 425 # to the main 'links_to_visit' hashtable. This allows a 426 426 # browser to navigate to all links hosted on the same server, prior 427 427 # to contacting a different server. 428 # 428 # 429 429 # Specifically, the 'key' is the absolute URL and the 'value' 430 430 # is always 1. … … 446 446 # The 'key' is the absolute URL and the 'value' is a string 447 447 # representing the date and time of when the link was visited. 448 # 448 # 449 449 # Note: See _getTimestamp() for the corresponding date/time 450 450 # format. … … 474 474 # websites. 475 475 max_relative_links_to_visit => getVar(name => "max_relative_links_to_visit"), 476 476 477 # Comma-separated string containing the good words and bad words for link scoring purposes 478 goodwords => getVar(name => "goodwords", namespace => "HoneyClient::Agent::Driver::Browser"), 479 badwords => getVar(name => "badwords", namespace => "HoneyClient::Agent::Driver::Browser"), 480 477 481 ); 478 482 … … 488 492 # 489 493 # When getting the next link, 'next_link_to_visit' is checked first. 490 # If that value is undef, then the 'relative_links_to_visit' hashtable 491 # is checked next. If that hashtable is empty, then finally the 494 # If that value is undef, then the 'relative_links_to_visit' hashtable 495 # is checked next. If that hashtable is empty, then finally the 492 496 # 'links_to_visit' hashtable is checked. 493 497 # … … 498 502 # Get the object state. 499 503 my $self = shift; 500 501 # Set the link to find as undef, initially. 504 505 # Set the link to find as undef, initially. 502 506 # We use undef to signify that our URL *_links_to_visit hashtables 503 507 # are empty. If we were to use the empty string instead, as our … … 537 541 } 538 542 539 # Return the next link found. 543 # Return the next link found. 540 544 return $link; 541 545 } … … 553 557 $dt->hms(':') . "." . 554 558 $dt->nanosecond(); 555 } 559 } 556 560 557 561 # Helper function designed to "pop" a key off a given hashtable. 558 562 # When given a hashtable reference, this function will extract a valid key 559 # from the hashtable and delete the (key, value) pair from the 560 # hashtable. 561 # 562 # Note: There is no guaranteed order about how this function picks 563 # keys from the hashtable. 563 # from the hashtable and delete the (key, value) pair from the 564 # hashtable. The link with the highest score is returned. 565 # 566 # 564 567 # 565 568 # Inputs: hashref … … 570 573 my $hash = shift; 571 574 572 # Get a new key.573 my @ keys = keys(%{$hash});574 my $ key = pop(@keys);575 575 # Get the highest score. 576 my @array = sort {$$hash{$b} <=> $$hash{$a}} keys %{$hash}; 577 my $topkey = $array[0]; 578 576 579 # Delete the key from the hashtable. 577 if (defined($ key)) {578 delete $hash->{$ key};580 if (defined($topkey)) { 581 delete $hash->{$topkey}; 579 582 } 580 583 581 584 # Return the key found. 582 return $key; 583 } 584 585 # This is the abstract function which actually fetches the web content using 586 # a specific browser implementation. Must be implemented by each browser class. 587 588 sub getContent { 589 590 } 591 592 # Helper function which parses the HTTP::Response from LWP::UserAgent 593 # and returns an array of the links contained in the response 594 # 595 # Inputs: HTTP::Response object 596 # Outputs: Array containing all href links within the response 597 598 sub _getAllLinks { 599 600 my $response = shift; 601 my $hostname = shift; 602 my @links = (); 603 my $thislink; 604 605 my $html = $response->content; 606 607 while( $html =~ m/<A HREF=\"(.*?)\"/gi ) { 608 $thislink = $1; 609 610 # For relative links, prepend the hostname 611 # TODO: Probably shouldn't assume the HTTP protocol... 612 if ($thislink =~ /^\//) { 613 $thislink = "http://" . $hostname . $thislink; 614 } 615 616 push @links, $thislink; 617 } 618 619 #Return the list of absolute links 620 return @links; 585 return $topkey; 621 586 } 622 587 … … 639 604 } 640 605 641 # Get the URL supplied. 606 # Get the URL supplied. 642 607 my $url = $arg . "/"; # Tack on an ending delimeter. 643 608 … … 652 617 # Helper function, designed to process all links found at a 653 618 # given URL, once the browser has been driven to that URL 654 # and has collected all corresponding links. 619 # and has collected all corresponding links. The links are 620 # sorted in increasing order as determined by their score. 655 621 # 656 622 # When supplied with the array of URL strings, … … 666 632 # - If a link is new and "invalid", then it gets added to 667 633 # the 'links_ignored' hashtable. 668 # 634 # 669 635 # - If a link is old and "invalid", then it gets 670 636 # ignored. … … 673 639 # 674 640 # - If a link is new and "valid", then we check to see if 675 # the referring URL's hostname[:port] and the link's 641 # the referring URL's hostname[:port] and the link's 676 642 # hostname[:port] match. If they match, then the link 677 643 # is added to the 'relative_links_to_visit' hash. … … 681 647 # Inputs: HoneyClient::Agent::Driver::Browser object, 682 648 # hostname[:port] of referring URL, 683 # array of URL strings649 # hash of URL strings and scores, the url is the key 684 650 # Outputs: HoneyClient::Agent::Driver::Browser object 685 651 sub _processLinks { … … 688 654 my $self = shift; 689 655 690 # Get the referrer and the corresponding array of links. 691 my ($referrer, @links) = @_; 692 693 foreach my $url (@links) { 656 # Get the referrer and the corresponding arrays of links and scores. 657 my ($referrer, %links) = @_; 658 659 foreach my $url (keys %links) { 660 my $score = $links{$url}; 694 661 695 662 # Skip over any undefined links. … … 710 677 # Link is new and valid; go ahead and add to the appropriate 711 678 # hashtable. 712 679 713 680 # Extract the core hostname of the URL to visit. 714 681 # If $url is undef, then this function will return an empty string. 715 682 my $hostname = _extractHostname($url); 716 683 717 684 # If the referrer's hostname and the URL's hostname match... 718 685 if ($hostname eq $referrer) { 719 686 # Then add the URL to the 'relative_links_to_visit' hashtable, 720 687 # since we're visiting links that share the same hostname. 721 $self->relative_links_to_visit->{$url} = 1;688 $self->relative_links_to_visit->{$url} = $score; 722 689 } else { 723 690 # Else, add the URL to the 'links_to_visit' hashtable, 724 691 # since we're visiting links that do NOT share the same hostname. 725 $self->links_to_visit->{$url} = 1;692 $self->links_to_visit->{$url} = $score; 726 693 } 727 694 } 728 695 729 696 # Return the modified object state. 730 697 return $self; … … 732 699 733 700 # Helper function designed to validate supplied links. 734 # 701 # 735 702 # When a link is provided as an argument: 736 703 # … … 742 709 # already exists within the history, then it is considered 743 710 # invalid. 744 # 711 # 745 712 # If the link is valid, then it is returned. Otherwise, undef 746 713 # is returned for all invalid links. Also, all invalid links … … 751 718 # Outputs: url if valid, empty string if invalid 752 719 sub _validateLink { 753 720 754 721 # Get the object state. 755 722 my $self = shift; … … 793 760 (scalar(%{$self->links_ignored}) and 794 761 exists($self->links_ignored->{$link}))) { 795 762 796 763 # Link is valid but already visited, so return undef. 797 764 return; … … 819 786 my $stub = getClientHandle(address => 'localhost', 820 787 namespace => 'HoneyClient::Agent'); 821 788 822 789 my $som = $stub->killProcess($self->process_name); 823 790 … … 838 805 of these methods were implementations of the parent Driver interface. 839 806 840 As such, the following code descriptions pertain to this particular 807 As such, the following code descriptions pertain to this particular 841 808 Driver implementation. For further information about the generic 842 809 Driver interface, see the L<HoneyClient::Agent::Driver> documentation. … … 852 819 B<$param> is an optional parameter variable. 853 820 B<$value> is $param's corresponding value. 854 821 855 822 Note: If any $param(s) are supplied, then an equal number of 856 823 corresponding $value(s) B<must> also be specified. … … 941 908 942 909 B<Warning>: This method will B<croak> if the Browser driver object is B<unable> 943 to navigate to a new link, because its list of links to visit is empty. 910 to navigate to a new link, because its list of links to visit is empty. 944 911 945 912 =back … … 984 951 # before registering attempt as a failure. 985 952 my $timeout : shared = $self->timeout(); 986 953 987 954 # Use LWP::UserAgent to get the desired $args{'url'} and associated content 988 my @links = undef; 989 990 # TODO: Analyze all the options LWP::UserAgent provides, in case we've 955 # TODO: Analyze all the options LWP::UserAgent provides, in case we've 991 956 # missed something useful. 992 957 # Create a new user agent. 993 958 my $ua = LWP::UserAgent->new( 994 959 timeout => $timeout, # Fixed timeout. 995 max_redirect => 0, # Ignore redirects.960 #max_redirect => 0, # Ignore redirects. 996 961 protocols_allowed => [ 'http', 'https' ], # Allow only web protocols. 962 max_size => 1*1024*1024, # Don't get larger than 1MB for testing 997 963 ); 998 964 965 # TODO: Look at the content type "text/html" on the response, to make this 966 # a little better. 999 967 # TODO: Set the default headers, to mimic a regular browser (if need be). 1000 968 # I'm thinking this could be set by IE/FF and passed via $args{'default_headers'} 1001 969 # as a HTTP::Headers object. 1002 1003 # TODO: Look at the content type "text/html" on the response, to make this1004 # a little better.1005 970 $ua->default_header( 'Accept' => 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' ); 1006 $ua->max_size(1*1024*1024); # Don't get values larger than 1MB for testing1007 $ua->timeout($timeout);1008 1009 # XXX: This is old code; delete eventually.1010 # my $response = $ua->get($args{'url'});1011 1012 # Get the links1013 # @links = _getAllLinks($response, _extractHostname($args{'url'}));1014 1015 # Make the parser. Unfortunately, we don't know the base yet1016 # (it might be diffent from $url)1017 #my $parser = HTML::LinkExtor->new(\&extractLinks);1018 my $parser = HTML::LinkExtor->new();1019 971 1020 972 my $response = $ua->request( … … 1025 977 'Accept' => 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 1026 978 ), 1027 ), 1028 sub { $parser->parse($_[0]) }, 979 ) 1029 980 ); 1030 1031 # Extract only the <a href ...> links, for now. 1032 # TODO: Handle other link types. 1033 foreach my $entry ($parser->links) { 1034 if ($entry->[0] eq 'a') { 1035 push(@links, $entry->[2]); 1036 } 1037 } 1038 1039 # Expand all relative links found to absolute ones. 981 982 # Get the base url from the response 1040 983 my $base = $response->base; 1041 @links = map { $_ = url($_, $base)->abs; } @links;984 my $content = $response->content; 1042 985 1043 986 # Get the current time. 1044 987 my $timestamp = _getTimestamp(); 1045 988 989 # Score the new links based on their surrounding HTML context 990 # If %scored_links is emtpy upon return, there are no links 991 # and we will not perform any of the following code 992 my %scored_links; 993 if ($content) { 994 # Extract the good word and bad word lists into arrays; 995 my @good_words = split /,/, $self->goodwords; 996 my @bad_words = split /,/, $self->badwords; 997 my %wordlists = ('good' => \@good_words, 'bad' => \@bad_words); 998 # Call the link scoring function 999 %scored_links = _scoreLinks($base, $content, %wordlists); 1000 } 1001 1046 1002 # Check to see if the request timed out. 1047 1003 # TODO: Need better error detection. 1048 if (! @links) {1004 if (!%scored_links) { 1049 1005 $self->links_timed_out->{$args{'url'}} = $timestamp; 1050 1006 … … 1059 1015 $self->links_visited->{$args{'url'}} = $timestamp; 1060 1016 1061 # Get all links found on this page.1017 # Add all links found on this page to our sorted queues. 1062 1018 # This function modifies the $self object internally and its 1063 1019 # returned content does not need to be checked. 1064 $self->_processLinks(_extractHostname($args{'url'}), @links);1020 $self->_processLinks(_extractHostname($args{'url'}), %scored_links); 1065 1021 } 1066 1022 … … 1075 1031 $self->max_relative_links_to_visit; 1076 1032 } elsif ($self->_remaining_number_of_relative_links_to_visit > 1) { 1077 1033 1078 1034 # The counter is positive, so decrement it. 1079 1035 $self->{_remaining_number_of_relative_links_to_visit}--; … … 1112 1068 1113 1069 sub getNextLink { 1114 1070 1115 1071 # Get the object state. 1116 1072 my $self = shift; 1117 1073 1118 1074 # Sanity check: Make sure we've been fed an object. 1119 1075 unless (ref($self)) { … … 1122 1078 } 1123 1079 1124 # Set the link to find as undef, initially. 1080 # Set the link to find as undef, initially. 1125 1081 my $link = undef; 1126 1082 … … 1148 1104 1149 1105 Specifically, the returned data is a reference to a hashtable, containing 1150 detailed information about which resources, hostnames, IPs, protocols, and 1106 detailed information about which resources, hostnames, IPs, protocols, and 1151 1107 ports that the browser will contact upon the next drive() iteration. 1152 1108 … … 1154 1110 1155 1111 $hashref = { 1156 1112 1157 1113 # The set of servers that the driver will contact upon 1158 1114 # the next drive() operation. … … 1169 1125 'udp' => [ 53, 123 ], 1170 1126 }, 1171 1127 1172 1128 # Or, more generically: 1173 1129 'hostname_or_IP' => { … … 1183 1139 }; 1184 1140 1185 B<Note>: For this implementation of the Driver interface, 1141 B<Note>: For this implementation of the Driver interface, 1186 1142 unless getNextLink() returns undef, the returned hashtable 1187 1143 from this method will B<always> contain only B<one> hostname … … 1211 1167 # Get the object state. 1212 1168 my $self = shift; 1213 1169 1214 1170 # Sanity check: Make sure we've been fed an object. 1215 1171 unless (ref($self)) { … … 1253 1209 } 1254 1210 } 1255 1256 # Finally, construct the corresponding hash reference. 1211 1212 # Finally, construct the corresponding hash reference. 1257 1213 $nextSite = { 1258 1214 targets => { … … 1271 1227 =pod 1272 1228 1229 =head2 _scoreLinks() 1230 1231 =over 4 1232 1233 The _scoreLinks helper function takes a scalar which is the base url for 1234 the web page, a scalar which holds the content of the page (HTML), and a 1235 hash which contain the good and bad words. 1236 1237 This function will calculate the "popularity" scores of the links. 1238 The function returns a hash which is keyed on the _absolute_ url 1239 and contains the value of the score. 1240 1241 I<Output>: The populated %scored_links hash if the page is not empty. An empty 1242 hash otherwise. 1243 1244 For example, if your raw HTML content is $content, and the base url is 1245 $base you would use the following call to this function. 1246 1247 if ($content) { 1248 # Extract the good word and bad word lists into arrays; 1249 my @good_words = split /,/, $self->goodwords; 1250 my @bad_words = split /,/, $self->badwords; 1251 my %wordlists = ('good' => \@good_words, 'bad' => \@bad_words); 1252 # Call the link scoring function 1253 %scored_links = _scoreLinks($base, $content, %wordlists); 1254 } 1255 1256 =back 1257 1258 =begin testing 1259 1260 # XXX: Test this. 1261 1; 1262 1263 =end testing 1264 1265 =cut 1266 1267 sub _scoreLinks { 1268 my ($base, $content, %wordlists) = @_; 1269 my @good_words = @{$wordlists{good}}; 1270 my @bad_words = @{$wordlists{bad}}; 1271 my %links = (); 1272 my $url; 1273 1274 # If the page is blank, there is no point trying to parse it 1275 if (!$content) { 1276 return %links; 1277 } 1278 1279 # Begin to scour the HTML content for <a> tags, parsing attributes and text 1280 while ($content =~ m{<a\b([^>]+)>(.*?)</a>}ig) { 1281 my $attr = $1; 1282 my $text = $2; 1283 my $score = 0; 1284 1285 # Look for the link in the attribute data 1286 if ($attr =~ m{ 1287 \b HREF 1288 \s* = \s* 1289 (?: 1290 "([^"]*)" 1291 | 1292 '([^']*)' 1293 | 1294 {[^'">\s]+} 1295 ) 1296 }xi) 1297 { 1298 $url = $+; 1299 1300 # Some programmatic values 1301 my $min_text_length = 6; 1302 my $max_text_length = 20; 1303 my $image_bonus = 50; 1304 my $default_display_size = 1024 * 768; 1305 my $word_value = 6; 1306 1307 # We have to make this an absolute url (if it's not) 1308 # before using it as a key in the %links hash 1309 $url = url($url, $base)->abs; 1310 1311 # The link must be an HREF and be a http(s) link 1312 if ($url =~ /^http/i) { 1313 # Begin scoring the link based on surrounding context 1314 # This can be improved/customized in many different ways. 1315 # Our implementation is only one possible way to assign 1316 # values to the context elements. 1317 1318 # Score length of link text. These are arbitrary lengths, but 1319 # the reasoning is that really short text links are not too 1320 # visible (we are excluding image links from this criteria), 1321 # and really long text would be weird or abnormal to the human 1322 # web surfer. 1323 if ($text !~ /img /i && 1324 length($text) > $min_text_length && 1325 length($text) < $max_text_length) { 1326 $score += length($text); 1327 } 1328 1329 # Score the image content, if it exists 1330 # We score the size proportional to a 1024 X 768 display 1331 # Image bonus 1332 if ($text =~ /img /i) { 1333 $score += $image_bonus; 1334 } 1335 # Score image size 1336 my $width; 1337 my $height; 1338 if ($text =~ /\b WIDTH\s*=\s*.(\d+)/xi) { 1339 $width = $1; 1340 } 1341 if ($text =~ /\b HEIGHT\s*=\s*.(\d+)/xi) { 1342 $height = $1; 1343 } 1344 if ($width && $height) { 1345 $score += int(($width*$height)/($default_display_size)*100); 1346 } 1347 elsif ($width) { 1348 $score += int($width/10); 1349 } 1350 elsif ($height) { 1351 $score += int($height/10); 1352 } 1353 1354 # Good word bonus 1355 foreach (@good_words) { 1356 if ($text =~ /$_/i) { 1357 $score += $word_value; 1358 } 1359 } 1360 1361 # Bad word penalty 1362 foreach (@bad_words) { 1363 if ($text =~ /$_/i) { 1364 $score -= $word_value; 1365 } 1366 } 1367 1368 # Put it in the return value hash and zero the score 1369 $links{$url} = $score; 1370 $url = undef; 1371 } 1372 } 1373 } 1374 return %links; 1375 } 1376 1377 =pod 1378 1273 1379 =head2 $object->isFinished() 1274 1380 … … 1306 1412 # Get the object state. 1307 1413 my $self = shift; 1308 1414 1309 1415 # Sanity check: Make sure we've been fed an object. 1310 1416 unless (ref($self)) { … … 1318 1424 scalar(%{$self->relative_links_to_visit}) or 1319 1425 scalar(%{$self->links_to_visit}))) 1320 1426 1321 1427 } 1322 1428 … … 1341 1447 'relative_links_remaining' => 10, # Number of URLs left to 1342 1448 # process, at a given site. 1343 'links_remaining' => 56, # Number of URLs left to 1449 'links_remaining' => 56, # Number of URLs left to 1344 1450 # process, for all sites. 1345 1451 'links_processed' => 44, # Number of URLs processed. … … 1366 1472 1367 1473 sub status { 1368 1474 1369 1475 # Get the object state. 1370 1476 my $self = shift; 1371 1477 1372 1478 # Sanity check: Make sure we've been fed an object. 1373 1479 unless (ref($self)) { … … 1384 1490 scalar(keys(%{$self->links_ignored})); 1385 1491 1386 # Set the number of relative links to process. 1492 # Set the number of relative links to process. 1387 1493 $status->{relative_links_remaining} = scalar(keys(%{$self->relative_links_to_visit})); 1388 1494 1389 1495 # Figure out how many total links are left to process. 1390 1496 $status->{links_remaining} = scalar(keys(%{$self->relative_links_to_visit})) + … … 1392 1498 1393 1499 # Set the total number of links in the object's state. 1394 $status->{links_total} = $status->{links_processed} + 1500 $status->{links_total} = $status->{links_processed} + 1395 1501 $status->{links_remaining}; 1396 1502 … … 1400 1506 $status->{links_total} = 1; 1401 1507 } 1402 $status->{percent_complete} = sprintf("%.2f%%", (($status->{links_processed} + 0.00) / 1508 $status->{percent_complete} = sprintf("%.2f%%", (($status->{links_processed} + 0.00) / 1403 1509 ($status->{links_total} + 0.00)) * 100.00); 1404 1510 … … 1441 1547 $object->drive() iteration. 1442 1548 1443 For example, if at one given point, the status of B<percent_complete> 1444 is 30% and then this value drops to 15% upon another iteration, then 1445 this means that the total number of links to drive to has greatly 1549 For example, if at one given point, the status of B<percent_complete> 1550 is 30% and then this value drops to 15% upon another iteration, then 1551 this means that the total number of links to drive to has greatly 1446 1552 increased. 1447 1553 … … 1478 1584 as published by the Free Software Foundation, using version 2 1479 1585 of the License. 1480 1586 1481 1587 This program is distributed in the hope that it will be useful, 1482 1588 but WITHOUT ANY WARRANTY; without even the implied warranty of 1483 1589 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 1484 1590 GNU General Public License for more details. 1485 1591 1486 1592 You should have received a copy of the GNU General Public License 1487 1593 along with this program; if not, write to the Free Software honeyclient/branches/exp/mbriggs-db/lib/HoneyClient/Agent/Driver/Browser/FF.pm
r149 r166 34 34 use strict; 35 35 use warnings; 36 use Config; 36 37 use Carp (); 37 use Config;38 use Win32::Job; #For starting browser39 use HTML::LinkExtor; #For extracting links from HTML40 use HTML::HeadParser; #For extracting the meta w/ URL that LinkExtor misses41 use LWP::UserAgent; #Perl-based "browser"42 use URI; #For absolutizing relative URLs43 #use Data::Dumper; #For Debugging44 38 45 39 # Traps signals, allowing END: blocks to perform cleanup. 46 40 use sigtrap qw(die untrapped normal-signals error-signals); 47 41 48 ####################################################################### ########49 # Module Initialization #50 ####################################################################### ########42 ####################################################################### 43 # Module Initialization # 44 ####################################################################### 51 45 52 46 BEGIN { … … 57 51 58 52 # Set our package version. 59 $VERSION = 0.9 2;53 $VERSION = 0.9; 60 54 61 55 # Define inherited modules. 62 use HoneyClient::Agent::Driver ;63 64 @ISA = qw(Exporter HoneyClient::Agent::Driver );56 use HoneyClient::Agent::Driver::Browser; 57 58 @ISA = qw(Exporter HoneyClient::Agent::Driver::Browser); 65 59 66 60 # Symbols to export on request … … 75 69 # Do not simply export all your public functions/methods/constants. 76 70 77 # This allows declaration use HoneyClient::Agent::Driver:: FF':all';71 # This allows declaration use HoneyClient::Agent::Driver::Browser::IE ':all'; 78 72 # If you do not need this, moving things directly into @EXPORT or @EXPORT_OK 79 73 # will save memory. … … 88 82 @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } ); 89 83 84 # XXX: Fix this! 85 # Check to make sure our OS is Windows-based. 86 #if ($Config{osname} !~ /^MSWin32$/) { 87 # Carp::croak "Error: " . __PACKAGE__ . " will only run on Win32 platforms!\n"; 88 #} 89 90 90 $SIG{PIPE} = 'IGNORE'; # Do not exit on broken pipes. 91 91 } 92 92 our ( @EXPORT_OK, $VERSION ); 93 93 94 ############################################################################### 94 #TODO: Rewrite the test module 95 96 =pod 97 98 =begin testing 99 100 =end testing 101 102 =cut 103 104 ####################################################################### 105 106 #TODO: Remove any of these use statements that aren't needed 95 107 96 108 # Include the Global Configuration Processing Library … … 100 112 use DateTime::HiRes; 101 113 114 # Use fractional second sleeping. 115 # TODO: Need unit testing. 116 use Time::HiRes qw(sleep); 117 102 118 # Use Storable Library 103 119 use Storable qw(dclone); 104 120 105 my %PARAMS = ( 106 107 # This is a hashtable of fully qualified URLs 108 # to visit by the browser. Specifically, the 'key' is 109 # the absolute URL and the 'value' is always 1. 110 links_to_visit => {}, 111 112 # This is a hashtable of fully qualified URLs that the 113 # browser has already visited. Specifically, the 114 # 'key' is the absolute URL and the 'value' is a string 115 # representing the date and time of when the link was visited. 116 # 117 # Note: See _getTimestamp() for the corresponding date/time 118 # format. 119 links_visited => {}, 120 121 # This is a hashtable of URLs that the browser has found 122 # during its traversal process, but the browser could not 123 # access the link. 124 # 125 # Links could be added to this list if access requires any type of 126 # authentication, or if the link points to a non-HTTP or HTTPS 127 # resource (i.e., "javascript:doNetDetect()"). 128 # 129 # The 'key' is the absolute URL and the 'value' is a string 130 # representing the date and time of when the link was visited. 131 # 132 # Note: See _getTimestamp() for the corresponding date/time 133 # format. 134 links_ignored => {}, 135 136 # This is a hashtable of fully qualified URLs 137 # that all share a common *hostname*. This hashtable should be 138 # initially empty. As the driver extracts and removes new URLs 139 # off the 'links_to_visit' hashtable, driving the browser to each URL, 140 # any *relative* links found are added into this hashtable; any 141 # *external* links
