Changeset 61

Show
Ignore:
Timestamp:
12/01/06 10:29:21 (2 years ago)
Author:
stephenson
Message:

Abstracting the goodwords and badwords for link scoring into the config xml file.

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • honeyclient/branches/exp/stephenson-link_scoring/lib/HoneyClient/Agent/Driver/Browser.pm

    r41 r61  
    1717# as published by the Free Software Foundation, using version 2 
    1818# of the License. 
    19 #  
     19# 
    2020# This program is distributed in the hope that it will be useful, 
    2121# but WITHOUT ANY WARRANTY; without even the implied warranty of 
    2222# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
    2323# GNU General Public License for more details. 
    24 #  
     24# 
    2525# You should have received a copy of the GNU General Public License 
    2626# along with this program; if not, write to the Free Software 
     
    5555          'http://www.google.com'  => 1, 
    5656          'http://www.cnn.com'     => 1, 
    57       },  
     57      }, 
    5858  ); 
    5959 
     
    7676      print "Status:\n"; 
    7777      print Dumper($browser->status()); 
    78        
     78 
    7979  } 
    8080 
     
    9090 
    9191This library allows the Agent module to drive an instance of any broswer, 
    92 running inside the HoneyClient VM.  The purpose  
     92running inside the HoneyClient VM.  The purpose 
    9393of this module is to programmatically navigate the browser to different 
    9494websites, in order to become purposefully infected with new malware. 
    9595The module implements the logic necessary to decide the order in which 
    96 the  
    97  
    98 This module is object-oriented in design, retaining all state information  
     96the 
     97 
     98This module is object-oriented in design, retaining all state information 
    9999within itself for easy access.  A specific browser class must inherit from 
    100100Browser. 
     
    116116external links in a random fashion.  B<However>, this cannot be 
    117117guaranteed, as additional links from the same server may be found 
    118 later, after processing the contents of an external link.  
     118later, after processing the contents of an external link. 
    119119 
    120120As the browser driver navigates the browser to each link, it 
     
    122122visited (see L<links_visited>); when invalid links were found 
    123123(see L<links_ignored>); and when the browser attempted to visit 
    124 a link but the operation timed out (see L<links_timed_out>).  
     124a link but the operation timed out (see L<links_timed_out>). 
    125125By maintaining this internal history, the driver will B<never> 
    126126navigate the browser to the same link twice. 
     
    194194    #if ($Config{osname} !~ /^MSWin32$/) { 
    195195    #    Carp::croak "Error: " . __PACKAGE__ . " will only run on Win32 platforms!\n"; 
    196     #}     
     196    #} 
    197197 
    198198    $SIG{PIPE} = 'IGNORE'; # Do not exit on broken pipes. 
     
    223223# TODO: Need unit testing. 
    224224use HoneyClient::Util::SOAP qw(getClientHandle); 
    225      
     225 
    226226# TODO: Need unit testing. 
    227227use LWP::UserAgent; 
     
    245245B<new()> function, as arguments. 
    246246 
    247 Furthermore, as each parameter is initialized, each can be individually  
     247Furthermore, as each parameter is initialized, each can be individually 
    248248retrieved and set at any time, using the following syntax: 
    249249 
     
    287287resource (i.e., "javascript:doNetDetect()"). 
    288288 
    289 Specifically, each 'key' corresponds to an absolute URL and the  
     289Specifically, each 'key' corresponds to an absolute URL and the 
    290290'value' is a string representing the date and time of when the link 
    291291was visited. 
     
    308308back into the B<links_to_visit> hashtable. 
    309309 
    310 When driving to the next link, this hashtable is exhausted prior  
     310When driving to the next link, this hashtable is exhausted prior 
    311311to the main B<links_to_visit> hashtable.  This allows a 
    312312browser to navigate to all links hosted on the same server, prior 
     
    325325It is updated dynamically, any time $object->getNextLink() is called. 
    326326 
    327 When the browser is ready to drive to the next link, B<next_link_to_visit>  
     327When the browser is ready to drive to the next link, B<next_link_to_visit> 
    328328is checked first.  If that value is B<undef>, then the B<relative_links_to_visit> 
    329329hashtable is checked next.  If that hashtable is empty, then finally the 
     
    341341timing out. 
    342342 
    343 Specifically, each 'key' corresponds to an absolute URL and the  
     343Specifically, each 'key' corresponds to an absolute URL and the 
    344344'value' is a string representing the date and time of when access to 
    345 the resource was attempted.  
     345the resource was attempted. 
    346346 
    347347B<Note>: See internal documentation of _getTimestamp() for the 
     
    385385=cut 
    386386 
    387 my %PARAMS = (  
     387my %PARAMS = ( 
    388388 
    389389    # This is a hashtable of fully qualified URLs 
     
    396396    # 'key' is the absolute URL and the 'value' is a string 
    397397    # representing the date and time of when the link was visited. 
    398     #  
     398    # 
    399399    # Note: See _getTimestamp() for the corresponding date/time 
    400400    # format. 
     
    411411    # The 'key' is the absolute URL and the 'value' is a string 
    412412    # representing the date and time of when the link was visited. 
    413     #  
     413    # 
    414414    # Note: See _getTimestamp() for the corresponding date/time 
    415415    # format. 
     
    418418    # This is a hashtable of fully qualified URLs 
    419419    # that all share a common *hostname*.  This hashtable should be 
    420     # initially empty.  As the driver extracts and removes new URLs  
    421     # off the 'links_to_visit' hashtable, driving the browser to each URL,  
     420    # initially empty.  As the driver extracts and removes new URLs 
     421    # off the 'links_to_visit' hashtable, driving the browser to each URL, 
    422422    # any *relative* links found are added into this hashtable; any 
    423423    # *external* links found are added back into the 'links_to_visit' 
    424424    # hashtable. 
    425425    # 
    426     # When navigating to the next link, this hashtable is exhausted prior  
     426    # When navigating to the next link, this hashtable is exhausted prior 
    427427    # to the main 'links_to_visit' hashtable.  This allows a 
    428428    # browser to navigate to all links hosted on the same server, prior 
    429429    # to contacting a different server. 
    430     #    
     430    # 
    431431    # Specifically, the 'key' is the absolute URL and the 'value' 
    432432    # is always 1. 
     
    448448    # The 'key' is the absolute URL and the 'value' is a string 
    449449    # representing the date and time of when the link was visited. 
    450     #  
     450    # 
    451451    # Note: See _getTimestamp() for the corresponding date/time 
    452452    # format. 
     
    477477    # websites. 
    478478    max_relative_links_to_visit => getVar(name => "max_relative_links_to_visit"), 
    479      
     479 
    480480); 
    481481 
     
    491491# 
    492492# When getting the next link, 'next_link_to_visit' is checked first. 
    493 # If that value is undef, then the 'relative_links_to_visit' hashtable  
    494 # is checked next.  If that hashtable is empty, then finally the  
     493# If that value is undef, then the 'relative_links_to_visit' hashtable 
     494# is checked next.  If that hashtable is empty, then finally the 
    495495# 'links_to_visit' hashtable is checked. 
    496496# 
     
    501501    # Get the object state. 
    502502    my $self = shift; 
    503      
    504     # Set the link to find as undef, initially.  
     503 
     504    # Set the link to find as undef, initially. 
    505505    # We use undef to signify that our URL *_links_to_visit hashtables 
    506506    # are empty.  If we were to use the empty string instead, as our 
     
    540540    } 
    541541 
    542     # Return the next link found.  
     542    # Return the next link found. 
    543543    return $link; 
    544544} 
     
    556556           $dt->hms(':') . "." . 
    557557           $dt->nanosecond(); 
    558 }  
     558} 
    559559 
    560560# Helper function designed to "pop" a key off a given hashtable. 
    561561# When given a hashtable reference, this function will extract a valid key 
    562 # from the hashtable and delete the (key, value) pair from the  
     562# from the hashtable and delete the (key, value) pair from the 
    563563# hashtable.  The link with the highest score is returned. 
    564564# 
    565 #  
     565# 
    566566# 
    567567# Inputs: hashref 
     
    575575    my @array = sort {$$hash{$b} <=> $$hash{$a}} keys %{$hash}; 
    576576    my $topkey = $array[0]; 
    577      
     577 
    578578    # Delete the key from the hashtable. 
    579579    if (defined($topkey)) { 
     
    603603    } 
    604604 
    605     # Get the URL supplied.  
     605    # Get the URL supplied. 
    606606    my $url = $arg . "/"; # Tack on an ending delimeter. 
    607607 
     
    631631# - If a link is new and "invalid", then it gets added to 
    632632#   the 'links_ignored' hashtable. 
    633 #    
     633# 
    634634# - If a link is old and "invalid", then it gets 
    635635#   ignored. 
     
    638638# 
    639639# - If a link is new and "valid", then we check to see if 
    640 #   the referring URL's hostname[:port] and the link's  
     640#   the referring URL's hostname[:port] and the link's 
    641641#   hostname[:port] match.  If they match, then the link 
    642642#   is added to the 'relative_links_to_visit' hash. 
     
    655655    # Get the referrer and the corresponding arrays of links and scores. 
    656656    my ($referrer, %links) = @_; 
    657      
     657 
    658658    foreach my $url (keys %links) { 
    659659        my $score = $links{$url}; 
     
    676676        # Link is new and valid; go ahead and add to the appropriate 
    677677        # hashtable. 
    678         
     678 
    679679        # Extract the core hostname of the URL to visit. 
    680680        # If $url is undef, then this function will return an empty string. 
    681681        my $hostname = _extractHostname($url); 
    682        
     682 
    683683        # If the referrer's hostname and the URL's hostname match... 
    684684        if ($hostname eq $referrer) { 
     
    692692        } 
    693693    } 
    694          
     694 
    695695    # Return the modified object state. 
    696696    return $self; 
     
    698698 
    699699# Helper function designed to validate supplied links. 
    700 #  
     700# 
    701701# When a link is provided as an argument: 
    702702# 
     
    708708#    already exists within the history, then it is considered 
    709709#    invalid. 
    710 #  
     710# 
    711711# If the link is valid, then it is returned.  Otherwise, undef 
    712712# is returned for all invalid links.  Also, all invalid links 
     
    717717# Outputs: url if valid, empty string if invalid 
    718718sub _validateLink { 
    719      
     719 
    720720    # Get the object state. 
    721721    my $self = shift; 
     
    759759        (scalar(%{$self->links_ignored}) and 
    760760         exists($self->links_ignored->{$link}))) { 
    761          
     761 
    762762        # Link is valid but already visited, so return undef. 
    763763        return; 
     
    785785    my $stub = getClientHandle(address   => 'localhost', 
    786786                               namespace => 'HoneyClient::Agent'); 
    787             
     787 
    788788    my $som = $stub->killProcess($self->process_name); 
    789789 
     
    804804of these methods were implementations of the parent Driver interface. 
    805805 
    806 As such, the following code descriptions pertain to this particular  
     806As such, the following code descriptions pertain to this particular 
    807807Driver implementation.  For further information about the generic 
    808808Driver interface, see the L<HoneyClient::Agent::Driver> documentation. 
     
    818818 B<$param> is an optional parameter variable. 
    819819 B<$value> is $param's corresponding value. 
    820   
     820 
    821821Note: If any $param(s) are supplied, then an equal number of 
    822822corresponding $value(s) B<must> also be specified. 
     
    904904 
    905905B<Warning>: This method will B<croak> if the IE driver object is B<unable> 
    906 to navigate to a new link, because its list of links to visit is empty.  
     906to navigate to a new link, because its list of links to visit is empty. 
    907907 
    908908=back 
     
    947947    # before registering attempt as a failure. 
    948948    my $timeout : shared = $self->timeout(); 
    949      
     949 
     950    # Get the good words and bad words from config file 
     951    if ($Config{goodwords}) { 
     952        print "There are good words!"; 
     953    } 
     954 
    950955    # Use LWP::UserAgent to get the desired $args{'url'} and associated content 
    951     # TODO: Analyze all the options LWP::UserAgent provides, in case we've  
     956    # TODO: Analyze all the options LWP::UserAgent provides, in case we've 
    952957    # missed something useful. 
    953958    # Create a new user agent. 
     
    967972    $ua->max_size(1*1024*1024); # Don't get values larger than 1MB for testing 
    968973    $ua->timeout($timeout); 
    969      
     974 
    970975    my $response = $ua->request( 
    971976                        HTTP::Request->new( 
     
    982987    my $content = $response->content; 
    983988    my %scored_links; 
    984      
     989 
    985990    # Get the current time. 
    986991    my $timestamp = _getTimestamp(); 
    987      
     992 
    988993    # Score the new links based on their surrounding HTML context 
    989994    # If %scored_links is emtpy upon return, there are no links 
     
    992997        %scored_links = _scoreLinks($base, $content); 
    993998    } 
    994      
     999 
    9951000    # Check to see if the request timed out. 
    9961001    # TODO: Need better error detection. 
     
    10241029            $self->max_relative_links_to_visit; 
    10251030    } elsif ($self->_remaining_number_of_relative_links_to_visit > 1) { 
    1026              
     1031 
    10271032        # The counter is positive, so decrement it. 
    10281033        $self->{_remaining_number_of_relative_links_to_visit}--; 
     
    10621067 
    10631068sub getNextLink { 
    1064      
     1069 
    10651070    # Get the object state. 
    10661071    my $self = shift; 
    1067      
     1072 
    10681073    # Sanity check: Make sure we've been fed an object. 
    10691074    unless (ref($self)) { 
     
    10721077    } 
    10731078 
    1074     # Set the link to find as undef, initially.  
     1079    # Set the link to find as undef, initially. 
    10751080    my $link = undef; 
    10761081 
     
    10981103 
    10991104Specifically, the returned data is a reference to a hashtable, containing 
    1100 detailed information about which resources, hostnames, IPs, protocols, and  
     1105detailed information about which resources, hostnames, IPs, protocols, and 
    11011106ports that the browser will contact upon the next drive() iteration. 
    11021107 
     
    11041109 
    11051110  $hashref = { 
    1106    
     1111 
    11071112      # The set of servers that the driver will contact upon 
    11081113      # the next drive() operation. 
     
    11191124              'udp' => [ 53, 123 ], 
    11201125          }, 
    1121   
     1126 
    11221127          # Or, more generically: 
    11231128          'hostname_or_IP' => { 
     
    11331138  }; 
    11341139 
    1135 B<Note>: For this implementation of the Driver interface,  
     1140B<Note>: For this implementation of the Driver interface, 
    11361141unless getNextLink() returns undef, the returned hashtable 
    11371142from this method will B<always> contain only B<one> hostname 
     
    11611166    # Get the object state. 
    11621167    my $self = shift; 
    1163      
     1168 
    11641169    # Sanity check: Make sure we've been fed an object. 
    11651170    unless (ref($self)) { 
     
    12031208        } 
    12041209    } 
    1205     
    1206     # Finally, construct the corresponding hash reference.  
     1210 
     1211    # Finally, construct the corresponding hash reference. 
    12071212    $nextSite = { 
    12081213        targets => { 
     
    12581263    my %links = (); 
    12591264    my $url; 
    1260     open(FILEH,">>scoring.txt") || die("Cannot Open File"); 
    1261      
     1265    open(FILEH,">>link_scores.txt") || die("Cannot Open File"); 
     1266 
    12621267    if (!$content) { 
    12631268        return %links; 
    12641269    } 
    1265      
    1266     # Begin to scour the HTML content for <a> tags  
     1270 
     1271    # Begin to scour the HTML content for <a> tags 
    12671272    while ($content =~ m{<a\b([^>]+)>(.*?)</a>}ig) { 
    12681273        my $attr = $1; 
    12691274        my $text = $2; 
    12701275        my $score = 0; 
    1271      
     1276 
    12721277        if ($attr =~ m{ 
    12731278                        \b HREF 
     
    12831288         { 
    12841289            $url = $+; 
    1285              
     1290 
    12861291            # We have to make this an absolute url (if it's not) 
    12871292            # before using it as a key in the %links hash 
    12881293            $url = url($url, $base)->abs; 
    1289              
    1290             # The link must be an HREF and be a http(s) link    
     1294 
     1295            # The link must be an HREF and be a http(s) link 
    12911296            if ($url =~ /^http/i) { 
    12921297                # Image bonus 
    12931298                if ($text =~ /img/i) { 
    12941299                    $score += 50; 
    1295                     print FILEH "Image bonus!\n";  
     1300                    print FILEH "Image bonus!\n"; 
    12961301                } 
    12971302                # Score image size 
     
    13001305                    my $width = $1; 
    13011306                    $score += int($width/10); 
    1302                     print FILEH "Image area bonus! $width\n";  
     1307                    print FILEH "Image area bonus! $width\n"; 
    13031308                } 
    13041309                if ($text =~ /\b HEIGHT\s*=\s*.(\d+)/xi) 
     
    13061311                    my $height = $1; 
    13071312                    $score += int($height/10); 
    1308                     print FILEH "Image area bonus! $height\n";  
     1313                    print FILEH "Image area bonus! $height\n"; 
    13091314                } 
    13101315                # Score length of link text 
     
    13231328                    print FILEH "Bad word penalty!\n"; 
    13241329                } 
    1325      
     1330 
    13261331                print FILEH "The attributes for $url are $attr\n" unless (!$attr); 
    13271332                print FILEH "The text for $url is $text\n" unless (!$text); 
    13281333                print FILEH "It scored $score\n"; 
    1329                  
     1334 
    13301335                $links{$url} = $score; 
    13311336                $url = undef; 
     
    13341339        } 
    13351340    } 
    1336      
    1337     close(FILEH);       
     1341 
     1342    close(FILEH); 
    13381343    return %links; 
    13391344} 
     
    13761381    # Get the object state. 
    13771382    my $self = shift; 
    1378      
     1383 
    13791384    # Sanity check: Make sure we've been fed an object. 
    13801385    unless (ref($self)) { 
     
    13881393              scalar(%{$self->relative_links_to_visit}) or 
    13891394              scalar(%{$self->links_to_visit}))) 
    1390                              
     1395 
    13911396} 
    13921397 
     
    14111416      'relative_links_remaining' =>       10, # Number of URLs left to 
    14121417                                              # process, at a given site. 
    1413       'links_remaining'          =>       56, # Number of URLs left to  
     1418      'links_remaining'          =>       56, # Number of URLs left to 
    14141419                                              # process, for all sites. 
    14151420      'links_processed'          =>       44, # Number of URLs processed. 
     
    14361441 
    14371442sub status { 
    1438      
     1443 
    14391444    # Get the object state. 
    14401445    my $self = shift; 
    1441      
     1446 
    14421447    # Sanity check: Make sure we've been fed an object. 
    14431448    unless (ref($self)) { 
     
    14541459                                 scalar(keys(%{$self->links_ignored})); 
    14551460 
    1456     # Set the number of relative links to process.  
     1461    # Set the number of relative links to process. 
    14571462    $status->{relative_links_remaining} = scalar(keys(%{$self->relative_links_to_visit})); 
    1458      
     1463 
    14591464    # Figure out how many total links are left to process. 
    14601465    $status->{links_remaining} = scalar(keys(%{$self->relative_links_to_visit})) + 
     
    14621467 
    14631468    # Set the total number of links in the object's state. 
    1464     $status->{links_total} = $status->{links_processed} +  
     1469    $status->{links_total} = $status->{links_processed} + 
    14651470                             $status->{links_remaining}; 
    14661471 
     
    14701475        $status->{links_total} = 1; 
    14711476    } 
    1472     $status->{percent_complete} = sprintf("%.2f%%", (($status->{links_processed} + 0.00) /  
     1477    $status->{percent_complete} = sprintf("%.2f%%", (($status->{links_processed} + 0.00) / 
    14731478                                                     ($status->{links_total} + 0.00)) * 100.00); 
    14741479 
     
    15191524$object->drive() iteration. 
    15201525 
    1521 For example, if at one given point, the status of B<percent_complete>  
    1522 is 30% and then this value drops to 15% upon another iteration, then  
    1523 this means that the total number of links to drive to has greatly  
     1526For example, if at one given point, the status of B<percent_complete> 
     1527is 30% and then this value drops to 15% upon another iteration, then 
     1528this means that the total number of links to drive to has greatly 
    15241529increased. 
    15251530 
     
    15601565as published by the Free Software Foundation, using version 2 
    15611566of the License. 
    1562   
     1567 
    15631568This program is distributed in the hope that it will be useful, 
    15641569but WITHOUT ANY WARRANTY; without even the implied warranty of 
    15651570MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
    15661571GNU General Public License for more details. 
    1567   
     1572 
    15681573You should have received a copy of the GNU General Public License 
    15691574along with this program; if not, write to the Free Software