NOTE: This information is applicable only to HoneyClient versions .99 and below. 1.0 and newer use a kernel module integrity check methodology based on modified CaptureBAT code.

Parsing the Windows Registry

Parsing the Windows registry is hard, but doable. Most perl-based registry tools simply use complex regular expressions to handle this task. Unfortunately, this method is slow and may not be able to interpret every possible type of registry key/value pair. In this article, we explain how we used a native perl-based parsing engine, Parse::Yapp, in order to reduce the time to enumerate a (usually) 30MB registry hive HKEY_LOCAL_MACHINE —- from 10 minutes to (about) 30 seconds.

  1. Analysis
    1. Obtaining the Registry Data
    2. Initial Interpretation
  2. The Gotchas
    1. So, what's the problem?
    2. So, what do these "problematic" segments look like?
      1. Malformed DIR_NAMEs
      2. Weird KEY_NAMEs
      3. Sketchy String-based KEY_VALUEs
      4. Annoying Binary-based KEY_VALUEs
    3. So, what's the problem with using a simple perl regexp?
  3. The Parser
    1. The Grammar
    2. The Lexer
      1. KEY_NAME tokens
      2. KEY_VALUE tokens
    3. Optimizations Used
  4. Notes / Troubleshooting
    1. What about registry directory/key/values containing NULL characters?
    2. Why is the parsing code exceptionally slower, when I use it?

Analysis

In our Honeyclient implementation, one of the integrity checks we perform inside the Windows-based Honeyclient VM is to enumerate and compare any registry changes. While performing any type of integrity checks inside a (potentially) compromised environment is not ideal, we accept this risk and move on (for now). Regardless of where the registry comparison occurs, we still need to perform this comparison at some point.

Obtaining the Registry Data

Remember, our Agent code runs inside a Cygwin Perl environment, as stated in the definitions page. Initially, we tried accessing the Windows registry directly via the Win32::TieRegistry module; however, we found that the Cygwin port of libwin32 does not seem to reliably provide access to the Windows registry.

Note: If anyone has gotten Win32::TieRegistry to work in Cygwin, please email us with the details, as we would like to know how to get that to work.

In the meantime, we simply use the Microsoft-provided REGEDIT.EXE tool, to dump each registry hive out to disk, in a readable format. Here's an example:

regedit.exe /a "dump.reg" "HKEY_CLASSES_ROOT"

Let's take a look at small portion of an example registry dump:

REGEDIT4

[HKEY_CLASSES_ROOT\Directory_Name]
@="FOO"
"String_Key"="BAR"
"Binary_Key"=hex:f4,ff,ff,ff,00,00,00,00,00,00,00,00,00,00,00,00,bc,02,00,00,00,\
  00,00,00,00,00,00,00,54,00,61,00,68,00,6f,00,6d,00,61,00,00,00,f0,77,3f,00,\
  3f,00,3f,00,3f,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,78,00,1c,10,fc,\
  7f,22,14,fc,7f,b0,fe,12,00,00,00,00,00,00,00,00,00,98,23,eb,77

Seems simple enough, doesn't it? This format of the Windows registry is a modified form of the INI file format used by Microsoft, since the dark ages. However, there are a couple of nuances in this format.

Initial Interpretation

Note: By default, Windows text files use \r\n as line terminators (these show up in outputs of cat -e dump.reg as ^M characters). In all examples, we assume all \r characters are completely stripped out.

1. All valid registry dumps contain a HEADER that looks like:

REGEDIT4\n\n

2. Registry KEY_NAME and KEY_VALUE pairs are grouped together, in (what appears to be) contiguous blocks that look like the regular expression of:

[DIR_NAME]\n.*\n

Note: The (.*) segment represents 0 or more characters, spanning multiple lines.

In the previous example, the DIR_NAME was HKEY_CLASSES_ROOT\Directory_Name.

3. Within each directory group, 0 or more key/value pairs can appear, in one the following formats:

@="KEY_VALUE"
@=KEY_VALUE
"KEY_NAME"="KEY_VALUE"
"KEY_NAME"=KEY_VALUE

4. When the KEY_NAME is the at symbol (@), the corresponding KEY_VALUE reflects the default value associated with the corresponding DIR_NAME.

In the previous example, the default KEY_VALUE associated with HKEY_CLASSES_ROOT\Directory_Name was FOO.

5. When the KEY_NAME is specified, then the name is always enclosed within unescaped double-quotations (").

6. The KEY_VALUE can be represented as string data or binary data.

7. String-versions of KEY_VALUE are always enclosed within unescaped double-quotations ("). They may span multiple lines.

8. Binary-versions of KEY_VALUE start after the equals (=) delimiter. They may span multiple lines. If they do span multiple lines, then every continuation line will always end with (\, followed by \n). Thus, the end of this KEY_VALUE can be found at the current (or next subsequent) line that ends with (\n) and not (\, followed by \n).

The Gotchas

So, the format is still easy to deal with, right? You may be thinking, "someone has to have created a perl-based registry parser that I can already use", right? Then, you proceed to find various tools that all claim to read and manipulate data stored in INI file formats easily within Perl.

So, what's the problem?

Assuming your registry hives (especially HKEY_LOCAL_MACHINE) are fairly complex, you'll find that when you feed your registry dumps to these tools, either:

  • The tool successfully terminates, but misses some of the directory/key/value segments
  • Or, the tool terminates with failures, as it can't parse some of these segments.

So, what do these "problematic" segments look like?

Malformed DIR_NAMEs

Consider this example:

REGEDIT4

[HKEY_CURRENT_USER\]MalformedDir[]
@="Foo"

Yes, the DIR_NAME is:

HKEY_CURRENT_USER\]MalformedDir[

This example imports cleanly into the Windows registry. Directory names can have unescaped square brackets as valid portions of their name. Essentially, the assumptions are:

  • A valid DIR_NAME will never span multiple lines.
  • A valid DIR_NAME will always end with (], followed by \n).

Weird KEY_NAMEs

Consider this example:

REGEDIT4

[HKEY_CURRENT_USER\Test]
"\"really=annoying key\""="foo"

Yes, the KEY_NAME is:

"really=annoying key"

This example imports cleanly into the Windows registry. Key names can have escaped double quotes and unescaped equal signs as valid portions of their name. Essentially, the assumptions are:

  • A valid KEY_NAME will never span multiple lines.
  • A valid KEY_NAME will always be 255 characters or less (tested, via experimentation).
  • IF a valid KEY_NAME contains double quotes as a portion of its name, then ALL of those double quotes will be escaped.

Sketchy String-based KEY_VALUEs

Consider this example:

REGEDIT4

[HKEY_CURRENT_USER\Test]
"foo"="\"really=annoying value\""
"long"="this is a really
long value that spans
multiple lines and has
HTML code mixed in, like:
<HTML><BODY><!-- Comment -->
</BODY></HTML>
"

Yes, the first KEY_VALUE is:

"really=annoying key"

And the second KEY_VALUE is:

this is a really
long value that spans
multiple lines and has
HTML code mixed in, like:
<HTML><BODY><!-- Comment -->
</BODY></HTML>

Ironically, this example does NOT import cleanly into the Windows registry. For some reason, some Windows-based applications can create multi-line string-based values like this in the registry, and REGEDIT.EXE can properly dump these keys out to a file — but REGEDIT.EXE prevents such keys from being created, via the manual import process.

Note: This explains why simply exporting your entire registry hive out to disk and reimporting it back into your system (via REGEDIT.EXE) may not fully restore all registry content.

If you want to find an example of these types of keys, simply cat the HKEY_LOCAL_MACHINE hive dump and grep for (^"$) — as that usually provides an example of this type of value. String-based key values can have escaped double quotes and unescaped equal signs as valid portions of their name. Essentially, the assumptions are:

  • A valid string-based KEY_VALUE can span multiple lines.
  • IF a valid string-based KEY_VALUE contains double quotes as a portion of its name, then ALL of those double quotes will be escaped.

Annoying Binary-based KEY_VALUEs

Consider this example:

REGEDIT4

[HKEY_CURRENT_USER\Test]
"Foo"=hex:f4,ff,ff,ff,00,00,00,00,00,00,00,00,00,00,00,00,bc,02,00,00,00,\
  00,00,00,00,00,00,00,54,00,61,00,68,00,6f,00,6d,00,61,00,00,00,f0,77,3f,00,\
  3f,00,3f,00,3f,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,78,00,1c,10,fc,\
  7f,22,14,fc,7f,b0,fe,12,00,00,00,00,00,00,00,00,00,98,23,eb,77

Yes, the KEY_VALUE is:

hex:f4,ff,ff,ff,00,00,00,00,00,00,00,00,00,00,00,00,bc,02,00,00,00,\
  00,00,00,00,00,00,00,54,00,61,00,68,00,6f,00,6d,00,61,00,00,00,f0,77,3f,00,\
  3f,00,3f,00,3f,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,78,00,1c,10,fc,\
  7f,22,14,fc,7f,b0,fe,12,00,00,00,00,00,00,00,00,00,98,23,eb,77

This example imports cleanly into the Windows registry. Binary-based key values can have escaped newlines, in order for binary data to span multiple lines. Essentially, the assumptions are:

  • A valid binary-based KEY_VALUE can span multiple lines.
  • IF a valid binary-based KEY_VALUE spans multiple lines, then every continuation line will always end with (\, followed by \n). The final line ends with (\n) and not (\, followed by \n).

So, what's the problem with using a simple perl regexp?

Fundamentally, while you could code a complete perl regexp to handle these cases, it would be a horribly inefficient and probably segfault perl for large strings (especially when running in a brittle environment, like Cygwin Perl). To give you an idea about how bad this may look, consider the following code snippet from the regutils parsing library. This segment was (attempting) to identify an entire KEY_NAME and KEY_VALUE pair, and shove the detected fields into an array within a hashtable reference.

if (
/^(\@|\"([^\\"]|\\"|\\\\)+")(|=(|(|stringz:|ustringz:)"([^\\"]|\\"|\\\\)*"|dword:[\da-fA-F]+|hex(|\([0-9a-fA-F]+\)):[\da-fA-F,]*))(|\s+\((.+)\))\s*$/
)
{
    push(@{$info->{'entries'}},
         {'name'     => $1,
          'value'    => $3 eq '' ? undef: $4,});
    next;
}

Beautiful, isn't it? Anyway, the code they used is still useful — as a basis for developing a much more robust solution: a perl-based registry parser.

The Parser

Programs specifically designed for parsing have been around for ages. We'll bypass the explanation about why developing a parser is a good idea. Damian Conway put together an interesting tutorial, about the benefits of parsing. His library, Parse::RecDescent, is a decent perl-based parsing package. However, we chose to use Parse::Yapp, based upon performance analysis performed by Dr. Mingyi Liu.

The Grammar

Ironically, developing the grammar for a Parse::Yapp generated parser is not difficult. We have already identified all the main tokens, as described in the Initial Interpretation section.

  • DIR_NAME - The directory name.
  • KEY_NAME - The name of an entry's key.
  • KEY_VALUE - The value of an entry's key.
  • HEADER - The standard REGEDIT4\n header
  • NEWLINE - Self explanatory.

Here's how our grammar looks:

%token DIR_NAME
%token KEY_NAME
%token KEY_VALUE
%token HEADER
%token NEWLINE

%%

# A registry can be thought of as a HEADER, along with 1 or more groups.
# For robustness, we make the HEADER optional.
registry:
        groups
    |   HEADER groups
;

# Define 1 or more groups.
groups:
        NEWLINE group
    |   NEWLINE group NEWLINE
    |   NEWLINE group groups
;

# A group consists of a DIR_NAME and 0 or more entries.
group:
        DIR_NAME entries
    |   DIR_NAME
;

# Define 1 or more entries.
entries:
        entry
    |   entry entries
;

# Define an entry.
entry:
        KEY_NAME KEY_VALUE
;

%%

In reality, the grammar is only part of the parser; it basically indicates, that IF data can be (properly) identified as one of these tokens, then the grammar makes sure that the specified input is valid.

The Lexer

The other piece to this puzzle, is the lexer. The lexer is what ties chunks of data to tokens, which allows the parser to efficiently process large quantities of data. Fundamentally, the lexer is a function that consists of a series of relatively small regular expressions, which operate upon a given input stream. As the function encounters identifiable tokens, it returns those tokens back to the Parse::Yapp engine. In turn, the Parse::Yapp engine calls the lexer function multiple times, on various portions of the input stream.

In our implementation, we're still actively developing the lexer function, as a bug branch to resolve ticket #42. We'll forgo pasting the entire code here, but rather provide a link to view the source.

However, we will touch upon a few nuances in the code. Specifically, let's look at the regular expressions used to identify the KEY_NAME and KEY_VALUE tokens.

KEY_NAME tokens

# Identify KEY_NAME token.
if ($_[0]->YYData->{DATA} =~ m/\G\"(|[^\\]|.*(?:\\[^\\]|\\\\|[^\\][^\\]))\"(?==)/cg) {
    # ...
    return ("KEY_NAME", $1);
}

# Identify default KEY_NAME token (@).
if ($_[0]->YYData->{DATA} =~ m/\G\@(?==)/cg) {
    # ...
    return ("KEY_NAME", "@");
}

The first block attempts to detect KEY_NAMEs based upon the fact that the text will always be enclosed within unescaped double quotes, where the next character after the right-most unescaped double quote is an equals sign. The OR-ed regular expression inside the parenthesis signifies the following scenarios (when read from left-to-right):

  • The empty string: "".
  • A single character string, where the character is an backslash: "\". Technically, this valid KEY_NAME will never be represented with a single backslash inside a registry dump. If a KEY_NAME did appear as a single backslash within the registry editor, then its value dumped to disk will be escaped. Thus, this name would appear as "\\" inside the registry dump.
  • A string of length 2 or more, where the last character is a backslash: ".*\". Technically, this valid KEY_NAME will never end with a single backslash, when looking at its representation inside a registry dump. Worst-case scenario, this backslash will be escaped. Thus, this name would appear as ".*\\" inside the registry dump.

The second block handles KEY_NAMEs that are the default keys. A default key is 'always' represented by an at-symbol (@), followed by an equals sign.

KEY_VALUE tokens

# Identify string KEY_VALUE token.
if ($_[0]->YYData->{DATA} =~ m/\G=\"(|[^\\]|.*?(?:\\[^\\]|\\\\|[^\\][^\\]))\"\n/cgs) {
    $inValue = 0;
    return ("KEY_VALUE", $1);
}

# Identify binary KEY_VALUE token.
if ($_[0]->YYData->{DATA} =~ m/\G=(|.*?[^\\])\n/cgs) {
    $inValue = 0;
    return ("KEY_VALUE", $1);
}

The first block deals with string-based KEY_VALUEs in a similar fashion as the first block handling KEY_NAMEs. It is assumed that the text will always be enclosed within unescaped double quotes, where the next character after the right-most unescaped double quote is a newline symbol. The OR-ed regular expression inside the parenthesis signifies the following scenarios (when read from left-to-right):

  • The empty string: "".
  • A single character string, where the character is an backslash: "\". Technically, this valid KEY_VALUE will never be represented with a single backslash inside a registry dump. If a KEY_VALUE did appear as a single backslash within the registry editor, then its value dumped to disk will be escaped. Thus, this value would appear as "\\" inside the registry dump.
  • A string of length 2 or more, where the last character is a backslash: ".*\". Technically, this valid KEY_VALUE will never end with a single backslash, when looking at its representation inside a registry dump. Worst-case scenario, this backslash will be escaped. Thus, this value would appear as ".*\\" inside the registry dump.

The second block deals with binary-based KEY_VALUEs. Specifically, a binary-based KEY_VALUE could be completely empty or be 1 or more lines of freeform text, where end of the KEY_VALUE is represented by a newline symbol on the last line, where the newline symbol did not have a single backslash proceeding it.

Optimizations Used

In developing the parser (and especially the lexer function), we leveraged many of the recommended optimizations made by Dr. Mingyi Liu. Specifically, there was one crucial suggestion he echoed from the perl manual:

"I decided to profile the performance of the regex parser. The profiling soon 
made me zero in on one particular performance hit - the string replacement in 
the lexer functions! Since I started writing lexer for Parse::Yapp, I followed 
its calc.yp's lexer function style, which chews up the input string when 
processing it, each time returns the next token to the parser. While this is 
OK for parsing short string, when parsing long input strings, the s/// string 
replacement becomes incredibly costly because each time a very short token gets 
chewed off an input string hundred of KB in size, the allocation of memory, 
copying etc. is very costly."

"OK, the culprit is identified. How about cure? Well, we just need to replace 
the string replacement statements with string matching statements. But how to 
keep the next-token position on the string? Naturally, we'd have to use the 
\G + /g combination. Still, we need to throw in /c as well so that a failed 
match does not move the next-token position. Therefore the s/^$token// becomes 
m/\G$token/cg, and voila! - the speed of the regex-based parser was improved 
nearly 50 fold - now my regex-based parser only needs 0.5 second to parse 
record #4539! Although the m/\G/cg is not new trick (well documented in 
'Programming Perl'), it certainly did magic in this case."

Specifically, the lexer function keeps track of a single input stream handle and does not modify the input stream handle at all.

Notes / Troubleshooting

What about registry directory/key/values containing NULL characters?

If you are referring to the Hidden Registry Keys technique developed by Mark Russinovich (as part of the Rootkit Revealer), then yes, this is a known issue.

Fundamentally, the Microsoft-provided REGEDIT.EXE tool does NOT properly output/dump null encoded (hidden) registry keys, properly. Thus, this is one type of key that our current detection methodology does NOT detect.

However, you can build a custom version of the REGEDIT.EXE tool which does output/dump null encoded (hidden) registry keys, and replace the REGEDIT.EXE references within the HoneyClient::Agent::Integrity::Registry code, in order to improve its subsequent detection scheme.

Why is the parsing code exceptionally slower, when I use it?

This parsing logic requires the perl regular expression engine to be running in an optimal matter. Currently, in Perl v5.8.8, there are a couple of reasons (aka gotchas) as to why the perl regular expression engine may be performing poorly. Specifically, the perldoc perlre documentation states the following:

WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
the program, it has to provide them for every pattern match.  This may
substantially slow your program.  Perl uses the same mechanism to pro-
duce $1, $2, etc, so you also pay a price for each pattern that con-
tains capturing parentheses.  (To avoid this cost while retaining the
grouping behaviour, use the extended regular expression "(?: ... )"
instead.)  But if you never use $&, $` or $', then patterns without
capturing parentheses will not be penalized.  So avoid $&, $', and $`
if you can, but if you can't (and some algorithms really appreciate
them), once you've used them once, use them at will, because you've
already paid the price.  As of 5.005, $& is not so costly as the other
two.

Thus, if you happen to use $&, $`, or $' anywhere in your code, or if these 3 tokens are used in any linked library, then the perl regular expression engine will slow down significantly (by 10x). Incidentally, we noticed this issue within the XML::XPath library and have reported this bug to the author.

To fix this issue, see @LAST_MATCH_START in perldoc perlvar, in order to obtain equivalent optimal code, that replaces any use of $&, $`, or $'.
Specifically, the following recommendations are listed:

After a match against some variable $var:

$` is the same as "substr($var, 0, $-[0])"
$& is the same as "substr($var, $-[0], $+[0] - $-[0])"
$' is the same as "substr($var, $+[0])"
$1 is the same as "substr($var, $-[1], $+[1] - $-[1])"
$2 is the same as "substr($var, $-[2], $+[2] - $-[2])"
$3 is the same as "substr($var, $-[3], $+[3] - $-[3])"