Changeset 158

Show
Ignore:
Timestamp:
01/10/07 15:39:26 (2 years ago)
Author:
kindlund
Message:

Resolved conflicts between trunk and stephenson-link_scoring branches.

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • honeyclient/trunk/etc/honeyclient.xml

    r146 r158  
    6969            <!-- TODO: Update this. --> 
    7070            <timeout description="How long the Driver waits during a drive operation, before timing out (in seconds)." default="60"> 
    71                 5  
     71                10  
    7272            </timeout> 
    7373            <Browser> 
     
    8484                    -1 
    8585                </max_relative_links_to_visit> 
     86                <goodwords description="A comma-separated list of good words which will increase the score of links within a webpage." default=""> 
     87                    news,new,big,latest,main,update,sell,free,buy 
     88                </goodwords> 
     89                <badwords description="A comma-separated list of bad words which will decrease the score of links within a webpage." default=""> 
     90                    archive,privacy,legal,disclaim,about,contact,copyright,jobs,careers 
     91                </badwords> 
    8692                <IE> 
    8793                    <!-- HoneyClient::Agent::Driver::IE Options --> 
     
    174180    </Agent> 
    175181    <Manager> 
     182        <!-- TODO: Update this. --> 
     183        <manager_state description="Upon termination, the Manager will attempt to save a complete copy of its state into this file, if specified." default=""> 
     184            Manager.dump 
     185        </manager_state> 
    176186        <!-- TODO: Update this. --> 
    177187        <address description="The IP or hostname that all Manager modules should use, when accepting SOAP requests." default="localhost"> 
  • honeyclient/trunk/etc/honeyclient_log.conf

    r131 r158  
    6060 
    6161log4perl.rootLogger=INFO, Screen 
     62#log4perl.logger.HoneyClient.Agent.Integrity.Registry=DEBUG, Screen 
    6263# Suppress Parser Debugging Messages 
    6364#log4perl.logger.HoneyClient.Agent.Integrity.Registry.Parser=INFO, Screen 
  • honeyclient/trunk/lib/HoneyClient/Agent/Driver/Browser.pm

    r136 r158  
    1717# as published by the Free Software Foundation, using version 2 
    1818# of the License. 
    19 #  
     19# 
    2020# This program is distributed in the hope that it will be useful, 
    2121# but WITHOUT ANY WARRANTY; without even the implied warranty of 
    2222# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
    2323# GNU General Public License for more details. 
    24 #  
     24# 
    2525# You should have received a copy of the GNU General Public License 
    2626# along with this program; if not, write to the Free Software 
     
    5555          'http://www.google.com'  => 1, 
    5656          'http://www.cnn.com'     => 1, 
    57       },  
     57      }, 
    5858  ); 
    5959 
     
    7676      print "Status:\n"; 
    7777      print Dumper($browser->status()); 
    78        
     78 
    7979  } 
    8080 
     
    9494become purposefully infected with new malware. 
    9595 
    96 This module is object-oriented in design, retaining all state information  
     96This module is object-oriented in design, retaining all state information 
    9797within itself for easy access.  A specific browser class must inherit from 
    9898Browser. 
     
    114114external links in a random fashion.  B<However>, this cannot be 
    115115guaranteed, as additional links from the same server may be found 
    116 later, after processing the contents of an external link.  
     116later, after processing the contents of an external link. 
    117117 
    118118As the browser driver navigates the browser to each link, it 
     
    120120visited (see L<links_visited>); when invalid links were found 
    121121(see L<links_ignored>); and when the browser attempted to visit 
    122 a link but the operation timed out (see L<links_timed_out>).  
     122a link but the operation timed out (see L<links_timed_out>). 
    123123By maintaining this internal history, the driver will B<never> 
    124124navigate the browser to the same link twice. 
     
    192192    #if ($Config{osname} !~ /^MSWin32$/) { 
    193193    #    Carp::croak "Error: " . __PACKAGE__ . " will only run on Win32 platforms!\n"; 
    194     #}     
     194    #} 
    195195 
    196196    $SIG{PIPE} = 'IGNORE'; # Do not exit on broken pipes. 
     
    222222# TODO: Need unit testing. 
    223223use HoneyClient::Util::SOAP qw(getClientHandle); 
    224      
     224 
    225225# TODO: Need unit testing. 
    226226use LWP::UserAgent; 
     
    244244B<new()> function, as arguments. 
    245245 
    246 Furthermore, as each parameter is initialized, each can be individually  
     246Furthermore, as each parameter is initialized, each can be individually 
    247247retrieved and set at any time, using the following syntax: 
    248248 
     
    286286resource (i.e., "javascript:doNetDetect()"). 
    287287 
    288 Specifically, each 'key' corresponds to an absolute URL and the  
     288Specifically, each 'key' corresponds to an absolute URL and the 
    289289'value' is a string representing the date and time of when the link 
    290290was visited. 
     
    307307back into the B<links_to_visit> hashtable. 
    308308 
    309 When driving to the next link, this hashtable is exhausted prior  
     309When driving to the next link, this hashtable is exhausted prior 
    310310to the main B<links_to_visit> hashtable.  This allows a 
    311311browser to navigate to all links hosted on the same server, prior 
     
    324324It is updated dynamically, any time $object->getNextLink() is called. 
    325325 
    326 When the browser is ready to drive to the next link, B<next_link_to_visit>  
     326When the browser is ready to drive to the next link, B<next_link_to_visit> 
    327327is checked first.  If that value is B<undef>, then the B<relative_links_to_visit> 
    328328hashtable is checked next.  If that hashtable is empty, then finally the 
     
    340340timing out. 
    341341 
    342 Specifically, each 'key' corresponds to an absolute URL and the  
     342Specifically, each 'key' corresponds to an absolute URL and the 
    343343'value' is a string representing the date and time of when access to 
    344 the resource was attempted.  
     344the resource was attempted. 
    345345 
    346346B<Note>: See internal documentation of _getTimestamp() for the 
     
    383383=cut 
    384384 
    385 my %PARAMS = (  
     385my %PARAMS = ( 
    386386 
    387387    # This is a hashtable of fully qualified URLs 
     
    394394    # 'key' is the absolute URL and the 'value' is a string 
    395395    # representing the date and time of when the link was visited. 
    396     #  
     396    # 
    397397    # Note: See _getTimestamp() for the corresponding date/time 
    398398    # format. 
     
    409409    # The 'key' is the absolute URL and the 'value' is a string 
    410410    # representing the date and time of when the link was visited. 
    411     #  
     411    # 
    412412    # Note: See _getTimestamp() for the corresponding date/time 
    413413    # format. 
     
    416416    # This is a hashtable of fully qualified URLs 
    417417    # that all share a common *hostname*.  This hashtable should be 
    418     # initially empty.  As the driver extracts and removes new URLs  
    419     # off the 'links_to_visit' hashtable, driving the browser to each URL,  
     418    # initially empty.  As the driver extracts and removes new URLs 
     419    # off the 'links_to_visit' hashtable, driving the browser to each URL, 
    420420    # any *relative* links found are added into this hashtable; any 
    421421    # *external* links found are added back into the 'links_to_visit' 
    422422    # hashtable. 
    423423    # 
    424     # When navigating to the next link, this hashtable is exhausted prior  
     424    # When navigating to the next link, this hashtable is exhausted prior 
    425425    # to the main 'links_to_visit' hashtable.  This allows a 
    426426    # browser to navigate to all links hosted on the same server, prior 
    427427    # to contacting a different server. 
    428     #    
     428    # 
    429429    # Specifically, the 'key' is the absolute URL and the 'value' 
    430430    # is always 1. 
     
    446446    # The 'key' is the absolute URL and the 'value' is a string 
    447447    # representing the date and time of when the link was visited. 
    448     #  
     448    # 
    449449    # Note: See _getTimestamp() for the corresponding date/time 
    450450    # format. 
     
    474474    # websites. 
    475475    max_relative_links_to_visit => getVar(name => "max_relative_links_to_visit"), 
    476      
     476 
     477    # Comma-separated string containing the good words and bad words for link scoring purposes 
     478    goodwords => getVar(name => "goodwords", namespace => "HoneyClient::Agent::Driver::Browser"), 
     479    badwords => getVar(name => "badwords", namespace => "HoneyClient::Agent::Driver::Browser"), 
     480 
    477481); 
    478482 
     
    488492# 
    489493# When getting the next link, 'next_link_to_visit' is checked first. 
    490 # If that value is undef, then the 'relative_links_to_visit' hashtable  
    491 # is checked next.  If that hashtable is empty, then finally the  
     494# If that value is undef, then the 'relative_links_to_visit' hashtable 
     495# is checked next.  If that hashtable is empty, then finally the 
    492496# 'links_to_visit' hashtable is checked. 
    493497# 
     
    498502    # Get the object state. 
    499503    my $self = shift; 
    500      
    501     # Set the link to find as undef, initially.  
     504 
     505    # Set the link to find as undef, initially. 
    502506    # We use undef to signify that our URL *_links_to_visit hashtables 
    503507    # are empty.  If we were to use the empty string instead, as our 
     
    537541    } 
    538542 
    539     # Return the next link found.  
     543    # Return the next link found. 
    540544    return $link; 
    541545} 
     
    553557           $dt->hms(':') . "." . 
    554558           $dt->nanosecond(); 
    555 }  
     559} 
    556560 
    557561# Helper function designed to "pop" a key off a given hashtable. 
    558562# When given a hashtable reference, this function will extract a valid key 
    559 # from the hashtable and delete the (key, value) pair from the  
    560 # hashtable. 
    561 
    562 # Note: There is no guaranteed order about how this function picks 
    563 # keys from the hashtable. 
     563# from the hashtable and delete the (key, value) pair from the 
     564# hashtable.  The link with the highest score is returned. 
     565
     566
    564567# 
    565568# Inputs: hashref 
     
    570573    my $hash = shift; 
    571574 
    572     # Get a new key
    573     my @keys = keys(%{$hash})
    574     my $key = pop(@keys)
    575      
     575    # Get the highest score
     576    my @array = sort {$$hash{$b} <=> $$hash{$a}} keys %{$hash}
     577    my $topkey = $array[0]
     578 
    576579    # Delete the key from the hashtable. 
    577     if (defined($key)) { 
    578         delete $hash->{$key}; 
     580    if (defined($topkey)) { 
     581        delete $hash->{$topkey}; 
    579582    } 
    580583 
    581584    # Return the key found. 
    582     return $key; 
    583 
    584  
    585 # This is the abstract function which actually fetches the web content using 
    586 # a specific browser implementation.  Must be implemented by each browser class. 
    587  
    588 sub getContent { 
    589  
    590 
    591  
    592 # Helper function which parses the HTTP::Response from LWP::UserAgent 
    593 # and returns an array of the links contained in the response 
    594 
    595 # Inputs: HTTP::Response object 
    596 # Outputs: Array containing all href links within the response 
    597  
    598 sub _getAllLinks { 
    599      
    600     my $response = shift; 
    601     my $hostname = shift; 
    602     my @links = (); 
    603     my $thislink; 
    604      
    605     my $html = $response->content; 
    606      
    607     while( $html =~ m/<A HREF=\"(.*?)\"/gi ) { 
    608         $thislink = $1; 
    609  
    610         # For relative links, prepend the hostname 
    611         # TODO:  Probably shouldn't assume the HTTP protocol... 
    612         if ($thislink =~ /^\//) { 
    613             $thislink = "http://" . $hostname . $thislink; 
    614         } 
    615          
    616         push @links, $thislink; 
    617     } 
    618  
    619     #Return the list of absolute links 
    620     return @links; 
     585    return $topkey; 
    621586} 
    622587 
     
    639604    } 
    640605 
    641     # Get the URL supplied.  
     606    # Get the URL supplied. 
    642607    my $url = $arg . "/"; # Tack on an ending delimeter. 
    643608 
     
    652617# Helper function, designed to process all links found at a 
    653618# given URL, once the browser has been driven to that URL 
    654 # and has collected all corresponding links. 
     619# and has collected all corresponding links.  The links are 
     620# sorted in increasing order as determined by their score. 
    655621# 
    656622# When supplied with the array of URL strings, 
     
    666632# - If a link is new and "invalid", then it gets added to 
    667633#   the 'links_ignored' hashtable. 
    668 #    
     634# 
    669635# - If a link is old and "invalid", then it gets 
    670636#   ignored. 
     
    673639# 
    674640# - If a link is new and "valid", then we check to see if 
    675 #   the referring URL's hostname[:port] and the link's  
     641#   the referring URL's hostname[:port] and the link's 
    676642#   hostname[:port] match.  If they match, then the link 
    677643#   is added to the 'relative_links_to_visit' hash. 
     
    681647# Inputs: HoneyClient::Agent::Driver::Browser object, 
    682648#         hostname[:port] of referring URL, 
    683 #         array of URL strings 
     649#         hash of URL strings and scores, the url is the key 
    684650# Outputs: HoneyClient::Agent::Driver::Browser object 
    685651sub _processLinks { 
     
    688654    my $self = shift; 
    689655 
    690     # Get the referrer and the corresponding array of links. 
    691     my ($referrer, @links) = @_; 
    692      
    693     foreach my $url (@links) { 
     656    # Get the referrer and the corresponding arrays of links and scores. 
     657    my ($referrer, %links) = @_; 
     658 
     659    foreach my $url (keys %links) { 
     660        my $score = $links{$url}; 
    694661 
    695662        # Skip over any undefined links. 
     
    710677        # Link is new and valid; go ahead and add to the appropriate 
    711678        # hashtable. 
    712         
     679 
    713680        # Extract the core hostname of the URL to visit. 
    714681        # If $url is undef, then this function will return an empty string. 
    715682        my $hostname = _extractHostname($url); 
    716        
     683 
    717684        # If the referrer's hostname and the URL's hostname match... 
    718685        if ($hostname eq $referrer) { 
    719686            # Then add the URL to the 'relative_links_to_visit' hashtable, 
    720687            # since we're visiting links that share the same hostname. 
    721             $self->relative_links_to_visit->{$url} = 1
     688            $self->relative_links_to_visit->{$url} = $score
    722689        } else { 
    723690            # Else, add the URL to the 'links_to_visit' hashtable, 
    724691            # since we're visiting links that do NOT share the same hostname. 
    725             $self->links_to_visit->{$url} = 1
     692            $self->links_to_visit->{$url} = $score
    726693        } 
    727694    } 
    728      
     695 
    729696    # Return the modified object state. 
    730697    return $self; 
     
    732699 
    733700# Helper function designed to validate supplied links. 
    734 #  
     701# 
    735702# When a link is provided as an argument: 
    736703# 
     
    742709#    already exists within the history, then it is considered 
    743710#    invalid. 
    744 #  
     711# 
    745712# If the link is valid, then it is returned.  Otherwise, undef 
    746713# is returned for all invalid links.  Also, all invalid links 
     
    751718# Outputs: url if valid, empty string if invalid 
    752719sub _validateLink { 
    753      
     720 
    754721    # Get the object state. 
    755722    my $self = shift; 
     
    793760        (scalar(%{$self->links_ignored}) and 
    794761         exists($self->links_ignored->{$link}))) { 
    795          
     762 
    796763        # Link is valid but already visited, so return undef. 
    797764        return; 
     
    819786    my $stub = getClientHandle(address   => 'localhost', 
    820787                               namespace => 'HoneyClient::Agent'); 
    821             
     788 
    822789    my $som = $stub->killProcess($self->process_name); 
    823790 
     
    838805of these methods were implementations of the parent Driver interface. 
    839806 
    840 As such, the following code descriptions pertain to this particular  
     807As such, the following code descriptions pertain to this particular 
    841808Driver implementation.  For further information about the generic 
    842809Driver interface, see the L<HoneyClient::Agent::Driver> documentation. 
     
    852819 B<$param> is an optional parameter variable. 
    853820 B<$value> is $param's corresponding value. 
    854   
     821 
    855822Note: If any $param(s) are supplied, then an equal number of 
    856823corresponding $value(s) B<must> also be specified. 
     
    941908 
    942909B<Warning>: This method will B<croak> if the Browser driver object is B<unable> 
    943 to navigate to a new link, because its list of links to visit is empty.  
     910to navigate to a new link, because its list of links to visit is empty. 
    944911 
    945912=back 
     
    984951    # before registering attempt as a failure. 
    985952    my $timeout : shared = $self->timeout(); 
    986      
     953 
    987954    # Use LWP::UserAgent to get the desired $args{'url'} and associated content 
    988     my @links = undef;  
    989  
    990     # TODO: Analyze all the options LWP::UserAgent provides, in case we've  
     955    # TODO: Analyze all the options LWP::UserAgent provides, in case we've 
    991956    # missed something useful. 
    992957    # Create a new user agent. 
    993958    my $ua = LWP::UserAgent->new( 
    994959        timeout           => $timeout,            # Fixed timeout. 
    995         max_redirect      => 0,                   # Ignore redirects. 
     960        #max_redirect      => 0,                   # Ignore redirects. 
    996961        protocols_allowed => [ 'http', 'https' ], # Allow only web protocols. 
     962        max_size          => 1*1024*1024,         # Don't get larger than 1MB for testing 
    997963    ); 
    998964 
     965    # TODO: Look at the content type "text/html" on the response, to make this 
     966    # a little better. 
    999967    # TODO: Set the default headers, to mimic a regular browser (if need be). 
    1000968    # I'm thinking this could be set by IE/FF and passed via $args{'default_headers'} 
    1001969    # as a HTTP::Headers object. 
    1002  
    1003     # TODO: Look at the content type "text/html" on the response, to make this 
    1004     # a little better. 
    1005970    $ua->default_header( 'Accept' => 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' ); 
    1006     $ua->max_size(1*1024*1024); # Don't get values larger than 1MB for testing 
    1007     $ua->timeout($timeout); 
    1008  
    1009     # XXX: This is old code; delete eventually. 
    1010 #   my $response = $ua->get($args{'url'}); 
    1011  
    1012     # Get the links 
    1013 #    @links = _getAllLinks($response, _extractHostname($args{'url'})); 
    1014  
    1015     # Make the parser.  Unfortunately, we don't know the base yet 
    1016     # (it might be diffent from $url) 
    1017     #my $parser = HTML::LinkExtor->new(\&extractLinks); 
    1018     my $parser = HTML::LinkExtor->new(); 
    1019971 
    1020972    my $response = $ua->request( 
     
    1025977                                'Accept' => 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 
    1026978                            ), 
    1027                         ), 
    1028                         sub { $parser->parse($_[0]) }, 
     979                        ) 
    1029980    ); 
    1030      
    1031     # Extract only the <a href ...> links, for now. 
    1032     # TODO: Handle other link types. 
    1033     foreach my $entry ($parser->links) { 
    1034         if ($entry->[0] eq 'a') { 
    1035             push(@links, $entry->[2]); 
    1036         } 
    1037     } 
    1038  
    1039     # Expand all relative links found to absolute ones. 
     981 
     982    # Get the base url from the response 
    1040983    my $base = $response->base; 
    1041     @links = map { $_ = url($_, $base)->abs; } @links
     984    my $content = $response->content
    1042985 
    1043986    # Get the current time. 
    1044987    my $timestamp = _getTimestamp(); 
    1045988 
     989    # Score the new links based on their surrounding HTML context 
     990    # If %scored_links is emtpy upon return, there are no links 
     991    # and we will not perform any of the following code 
     992    my %scored_links; 
     993    if ($content) { 
     994        # Extract the good word and bad word lists into arrays; 
     995        my @good_words = split /,/, $self->goodwords; 
     996        my @bad_words = split /,/, $self->badwords; 
     997        my %wordlists = ('good' => \@good_words, 'bad' => \@bad_words); 
     998        # Call the link scoring function 
     999        %scored_links = _scoreLinks($base, $content, %wordlists); 
     1000    } 
     1001 
    10461002    # Check to see if the request timed out. 
    10471003    # TODO: Need better error detection. 
    1048     if (!@links) { 
     1004    if (!%scored_links) { 
    10491005        $self->links_timed_out->{$args{'url'}} = $timestamp; 
    10501006 
     
    10591015        $self->links_visited->{$args{'url'}} = $timestamp; 
    10601016 
    1061         # Get all links found on this page
     1017        # Add all links found on this page to our sorted queues
    10621018        # This function modifies the $self object internally and its 
    10631019        # returned content does not need to be checked. 
    1064         $self->_processLinks(_extractHostname($args{'url'}), @links); 
     1020        $self->_processLinks(_extractHostname($args{'url'}), %scored_links); 
    10651021    } 
    10661022 
     
    10751031            $self->max_relative_links_to_visit; 
    10761032    } elsif ($self->_remaining_number_of_relative_links_to_visit > 1) { 
    1077              
     1033 
    10781034        # The counter is positive, so decrement it. 
    10791035        $self->{_remaining_number_of_relative_links_to_visit}--; 
     
    11121068 
    11131069sub getNextLink { 
    1114      
     1070 
    11151071    # Get the object state. 
    11161072    my $self = shift; 
    1117      
     1073 
    11181074    # Sanity check: Make sure we've been fed an object. 
    11191075    unless (ref($self)) { 
     
    11221078    } 
    11231079 
    1124     # Set the link to find as undef, initially.  
     1080    # Set the link to find as undef, initially. 
    11251081    my $link = undef; 
    11261082 
     
    11481104 
    11491105Specifically, the returned data is a reference to a hashtable, containing 
    1150 detailed information about which resources, hostnames, IPs, protocols, and  
     1106detailed information about which resources, hostnames, IPs, protocols, and 
    11511107ports that the browser will contact upon the next drive() iteration. 
    11521108 
     
    11541110 
    11551111  $hashref = { 
    1156    
     1112 
    11571113      # The set of servers that the driver will contact upon 
    11581114      # the next drive() operation. 
     
    11691125              'udp' => [ 53, 123 ], 
    11701126          }, 
    1171   
     1127 
    11721128          # Or, more generically: 
    11731129          'hostname_or_IP' => { 
     
    11831139  }; 
    11841140 
    1185 B<Note>: For this implementation of the Driver interface,  
     1141B<Note>: For this implementation of the Driver interface, 
    11861142unless getNextLink() returns undef, the returned hashtable 
    11871143from this method will B<always> contain only B<one> hostname 
     
    12111167    # Get the object state. 
    12121168    my $self = shift; 
    1213      
     1169 
    12141170    # Sanity check: Make sure we've been fed an object. 
    12151171    unless (ref($self)) { 
     
    12531209        } 
    12541210    } 
    1255     
    1256     # Finally, construct the corresponding hash reference.  
     1211 
     1212    # Finally, construct the corresponding hash reference. 
    12571213    $nextSite = { 
    12581214        targets => { 
     
    12711227=pod 
    12721228 
     1229=head2 _scoreLinks() 
     1230 
     1231=over 4 
     1232 
     1233The _scoreLinks helper function takes a scalar which is the base url for 
     1234the web page, a scalar which holds the content of the page (HTML), and a 
     1235hash which contain the good and bad words. 
     1236 
     1237This function will calculate the "popularity" scores of the links. 
     1238The function returns a hash which is keyed on the _absolute_ url 
     1239and contains the value of the score. 
     1240 
     1241I<Output>: The populated %scored_links hash if the page is not empty. An empty 
     1242hash otherwise. 
     1243 
     1244For example, if your raw HTML content is $content, and the base url is 
     1245$base you would use the following call to this function. 
     1246 
     1247if ($content) { 
     1248    # Extract the good word and bad word lists into arrays; 
     1249    my @good_words = split /,/, $self->goodwords; 
     1250    my @bad_words = split /,/, $self->badwords; 
     1251    my %wordlists = ('good' => \@good_words, 'bad' => \@bad_words); 
     1252    # Call the link scoring function 
     1253    %scored_links = _scoreLinks($base, $content, %wordlists); 
     1254} 
     1255 
     1256=back 
     1257 
     1258=begin testing 
     1259 
     1260# XXX: Test this. 
     12611; 
     1262 
     1263=end testing 
     1264 
     1265=cut 
     1266 
     1267sub _scoreLinks { 
     1268    my ($base, $content, %wordlists) = @_; 
     1269    my @good_words = @{$wordlists{good}}; 
     1270    my @bad_words = @{$wordlists{bad}}; 
     1271    my %links = (); 
     1272    my $url; 
     1273 
     1274    # If the page is blank, there is no point trying to parse it 
     1275    if (!$content) { 
     1276        return %links; 
     1277    } 
     1278 
     1279    # Begin to scour the HTML content for <a> tags, parsing attributes and text 
     1280    while ($content =~ m{<a\b([^>]+)>(.*?)</a>}ig) { 
     1281        my $attr = $1; 
     1282        my $text = $2; 
     1283        my $score = 0; 
     1284 
     1285        # Look for the link in the attribute data 
     1286        if ($attr =~ m{ 
     1287                        \b HREF 
     1288                        \s* = \s* 
     1289                        (?: 
     1290                          "([^"]*)" 
     1291                          | 
     1292                          '([^']*)' 
     1293                          | 
     1294                          {[^'">\s]+} 
     1295                        ) 
     1296                     }xi) 
     1297         { 
     1298            $url = $+; 
     1299 
     1300            # Some programmatic values 
     1301            my $min_text_length = 6; 
     1302            my $max_text_length = 20; 
     1303            my $image_bonus = 50; 
     1304            my $default_display_size = 1024 * 768; 
     1305            my $word_value = 6; 
     1306 
     1307            # We have to make this an absolute url (if it's not) 
     1308            # before using it as a key in the %links hash 
     1309            $url = url($url, $base)->abs; 
     1310 
     1311            # The link must be an HREF and be a http(s) link 
     1312            if ($url =~ /^http/i) { 
     1313                # Begin scoring the link based on surrounding context 
     1314                # This can be improved/customized in many different ways. 
     1315                # Our implementation is only one possible way to assign 
     1316                # values to the context elements. 
     1317 
     1318                # Score length of link text. These are arbitrary lengths, but 
     1319                # the reasoning is that really short text links are not too 
     1320                # visible (we are excluding image links from this criteria), 
     1321                # and really long text would be weird or abnormal to the human 
     1322                # web surfer. 
     1323                if ($text !~ /img /i && 
     1324                    length($text) > $min_text_length && 
     1325                    length($text) < $max_text_length) { 
     1326                    $score += length($text); 
     1327                } 
     1328 
     1329                # Score the image content, if it exists 
     1330                # We score the size proportional to a 1024 X 768 display 
     1331                # Image bonus 
     1332                if ($text =~ /img /i) { 
     1333                    $score += $image_bonus; 
     1334                } 
     1335                # Score image size 
     1336                my $width; 
     1337                my $height; 
     1338                if ($text =~ /\b WIDTH\s*=\s*.(\d+)/xi) { 
     1339                    $width = $1; 
     1340                } 
     1341                if ($text =~ /\b HEIGHT\s*=\s*.(\d+)/xi) { 
     1342                    $height = $1; 
     1343                } 
     1344                if ($width && $height) { 
     1345                    $score += int(($width*$height)/($default_display_size)*100); 
     1346                } 
     1347                elsif ($width) { 
     1348                    $score += int($width/10); 
     1349                } 
     1350                elsif ($height) { 
     1351                    $score += int($height/10); 
     1352                } 
     1353 
     1354                # Good word bonus 
     1355                foreach (@good_words) { 
     1356                    if ($text =~ /$_/i) { 
     1357                        $score += $word_value; 
     1358                    } 
     1359                } 
     1360 
     1361                # Bad word penalty 
     1362                foreach (@bad_words) { 
     1363                    if ($text =~ /$_/i) { 
     1364                        $score -= $word_value; 
     1365                    } 
     1366                } 
     1367 
     1368                # Put it in the return value hash and zero the score 
     1369                $links{$url} = $score; 
     1370                $url = undef; 
     1371            } 
     1372        } 
     1373    } 
     1374    return %links; 
     1375} 
     1376 
     1377=pod 
     1378 
    12731379=head2 $object->isFinished() 
    12741380 
     
    13061412    # Get the object state. 
    13071413    my $self = shift; 
    1308      
     1414 
    13091415    # Sanity check: Make sure we've been fed an object. 
    13101416    unless (ref($self)) { 
     
    13181424              scalar(%{$self->relative_links_to_visit}) or 
    13191425              scalar(%{$self->links_to_visit}))) 
    1320                              
     1426 
    13211427} 
    13221428 
     
    13411447      'relative_links_remaining' =>       10, # Number of URLs left to 
    13421448                                              # process, at a given site. 
    1343       'links_remaining'          =>       56, # Number of URLs left to  
     1449      'links_remaining'          =>       56, # Number of URLs left to 
    13441450                                              # process, for all sites. 
    13451451      'links_processed'          =>       44, # Number of URLs processed. 
     
    13661472 
    13671473sub status { 
    1368      
     1474 
    13691475    # Get the object state. 
    13701476    my $self = shift; 
    1371      
     1477 
    13721478    # Sanity check: Make sure we've been fed an object. 
    13731479    unless (ref($self)) { 
     
    13841490                                 scalar(keys(%{$self->links_ignored})); 
    13851491 
    1386     # Set the number of relative links to process.  
     1492    # Set the number of relative links to process. 
    13871493    $status->{relative_links_remaining} = scalar(keys(%{$self->relative_links_to_visit})); 
    1388      
     1494 
    13891495    # Figure out how many total links are left to process. 
    13901496    $status->{links_remaining} = scalar(keys(%{$self->relative_links_to_visit})) + 
     
    13921498 
    13931499    # Set the total number of links in the object's state. 
    1394     $status->{links_total} = $status->{links_processed} +  
     1500    $status->{links_total} = $status->{links_processed} + 
    13951501                             $status->{links_remaining}; 
    13961502 
     
    14001506        $status->{links_total} = 1; 
    14011507    } 
    1402     $status->{percent_complete} = sprintf("%.2f%%", (($status->{links_processed} + 0.00) /  
     1508    $status->{percent_complete} = sprintf("%.2f%%", (($status->{links_processed} + 0.00) / 
    14031509                                                     ($status->{links_total} + 0.00)) * 100.00); 
    14041510 
     
    14411547$object->drive() iteration. 
    14421548 
    1443 For example, if at one given point, the status of B<percent_complete>  
    1444 is 30% and then this value drops to 15% upon another iteration, then  
    1445 this means that the total number of links to drive to has greatly  
     1549For example, if at one given point, the status of B<percent_complete> 
     1550is 30% and then this value drops to 15% upon another iteration, then 
     1551this means that the total number of links to drive to has greatly 
    14461552increased. 
    14471553 
     
    14781584as published by the Free Software Foundation, using version 2 
    14791585of the License. 
    1480   
     1586 
    14811587This program is distributed in the hope that it will be useful, 
    14821588but WITHOUT ANY WARRANTY; without even the implied warranty of 
    14831589MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
    14841590GNU General Public License for more details. 
    1485   
     1591 
    14861592You should have received a copy of the GNU General Public License 
    14871593along with this program; if not, write to the Free Software 
  • honeyclient/trunk/lib/HoneyClient/Agent/Driver/Browser/FF.pm

    r136 r158  
    3434use strict; 
    3535use warnings; 
     36use Config; 
    3637use Carp (); 
    37 use Config; 
    38 use Win32::Job;          #For starting browser 
    39 use HTML::LinkExtor;     #For extracting links from HTML 
    40 use HTML::HeadParser;    #For extracting the meta w/ URL that LinkExtor misses 
    41 use LWP::UserAgent;      #Perl-based "browser" 
    42 use URI;                 #For absolutizing relative URLs 
    43 #use Data::Dumper;       #For Debugging 
    4438 
    4539# Traps signals, allowing END: blocks to perform cleanup. 
    4640use sigtrap qw(die untrapped normal-signals error-signals); 
    4741 
    48 ############################################################################### 
    49 # Module Initialization                                                      
    50 ############################################################################### 
     42####################################################################### 
     43# Module Initialization                                               
     44####################################################################### 
    5145 
    5246BEGIN { 
     
    5751 
    5852    # Set our package version. 
    59     $VERSION = 0.92
     53    $VERSION = 0.9
    6054 
    6155    # Define inherited modules. 
    62     use HoneyClient::Agent::Driver
    63  
    64     @ISA = qw(Exporter HoneyClient::Agent::Driver); 
     56    use HoneyClient::Agent::Driver::Browser
     57 
     58    @ISA = qw(Exporter HoneyClient::Agent::Driver::Browser); 
    6559 
    6660    # Symbols to export on request 
     
    7569    # Do not simply export all your public functions/methods/constants. 
    7670 
    77     # This allows declaration use HoneyClient::Agent::Driver::FF ':all'; 
     71    # This allows declaration use HoneyClient::Agent::Driver::Browser::IE ':all'; 
    7872    # If you do not need this, moving things directly into @EXPORT or @EXPORT_OK 
    7973    # will save memory. 
     
    8882    @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } ); 
    8983 
     84# XXX: Fix this! 
     85# Check to make sure our OS is Windows-based. 
     86#if ($Config{osname} !~ /^MSWin32$/) { 
     87#    Carp::croak "Error: " . __PACKAGE__ . " will only run on Win32 platforms!\n"; 
     88#} 
     89 
    9090    $SIG{PIPE} = 'IGNORE';    # Do not exit on broken pipes. 
    9191} 
    9292our ( @EXPORT_OK, $VERSION ); 
    9393 
    94 ############################################################################### 
     94#TODO: Rewrite the test module 
     95 
     96=pod 
     97 
     98=begin testing 
     99 
     100=end testing 
     101 
     102=cut 
     103 
     104####################################################################### 
     105 
     106#TODO: Remove any of these use statements that aren't needed 
    95107 
    96108# Include the Global Configuration Processing Library 
     
    100112use DateTime::HiRes; 
    101113 
     114# Use fractional second sleeping. 
     115# TODO: Need unit testing. 
     116use Time::HiRes qw(sleep); 
     117 
    102118# Use Storable Library 
    103119use Storable qw(dclone); 
    104120 
    105 my %PARAMS = ( 
    106  
    107     # This is a hashtable of fully qualified URLs 
    108     # to visit by the browser.  Specifically, the 'key' is 
    109     # the absolute URL and the 'value' is always 1. 
    110     links_to_visit => {}, 
    111  
    112     # This is a hashtable of fully qualified URLs that the 
    113     # browser has already visited.  Specifically, the 
    114     # 'key' is the absolute URL and the 'value' is a string 
    115     # representing the date and time of when the link was visited. 
    116     # 
    117     # Note: See _getTimestamp() for the corresponding date/time 
    118     # format. 
    119     links_visited => {}, 
    120  
    121     # This is a hashtable of URLs that the browser has found 
    122     # during its traversal process, but the browser could not 
    123     # access the link. 
    124     # 
    125     # Links could be added to this list if access requires any type of 
    126     # authentication, or if the link points to a non-HTTP or HTTPS 
    127     # resource (i.e., "javascript:doNetDetect()"). 
    128     # 
    129     # The 'key' is the absolute URL and the 'value' is a string 
    130     # representing the date and time of when the link was visited. 
    131     # 
    132     # Note: See _getTimestamp() for the corresponding date/time 
    133     # format. 
    134     links_ignored => {}, 
    135  
    136     # This is a hashtable of fully qualified URLs 
    137     # that all share a common *hostname*.  This hashtable should be 
    138     # initially empty.  As the driver extracts and removes new URLs 
    139     # off the 'links_to_visit' hashtable, driving the browser to each URL, 
    140     # any *relative* links found are added into this hashtable; any 
    141     # *external* links found are added back into the 'links_to_visit' 
    142     # hashtable. 
    143     # 
    144     # When navigating to the next link, this hashtable is exhausted prior 
    145     # to the main 'links_to_visit' hashtable.  This allows a 
    146     # browser to navigate to all links hosted on the same server, prior 
    147     # to contacting a different server. 
    148     # 
    149     # Specifically, the 'key' is the absolute URL and the 'value' 
    150     # is always 1. 
    151