Archive for the ‘Reverse Engineering’ Category

Why doesn’t the publicly available kernrate work on Windows x64? (and how to fix it)

Monday, June 4th, 2007

Previously, I wrote up an introduction to kernrate (the Windows kernel profiler). In that post, I erroneously stated that the KrView distribution includes a version of kernrate that works on x64. Actually, it supports IA64 and x86, but not x64. A week or two ago, I had a problem on an x64 box of mine that I wanted to help track down using kernrate. Unfortunately, I couldn’t actually find a working kernrate for Windows x64 (Srv03 / Vista).

It turns out that nowhere is there a published kernrate that works on any production version of Windows x64. KrView ships with an IA64 version, and the Srv03 resource kit only ships with an x86 version only. (While you can run the x86 version of kernrate in Wow64, you can’t use it to profile kernel mode; for that, you need a native x64 version.)

A bit of digging turned up a version of kernrate labeled ‘AMD64’ in the Srv03 SP0 DDK (the 3790.1830 / Srv03 SP1 DDK doesn’t include kernrate at all, only documentation for it, and the 6000 WDK omits even the documentation, which is really quite a shame). Unfortunately, that version of kernrate, while compiled as native x64 and theoretically capable of profiling the x64 kernel, doesn’t actually work. If you try to run it, you’ll get the following error:

NtQuerySystemInformation failed status c0000004
KERNRATE: Failed to get SYSTEM_BASIC_INFORMATION

So, no dice even with the Srv03 SP0 DDK version of kernrate, or so I thought. Ironically, the various flavors of x86 (including 3790.0) kernrate I could find all worked on Srv03 x64 SP1 (3790.1830) / Vista x64 SP0 (6000.0), but as I mentioned earlier, you can’t use the profiling APIs to profile kernel mode from a Wow64 program. So, I had a kernrate that worked but couldn’t profile kernel mode, and a kernrate that should have worked and been able to profile kernel mode except that it bombed out right away.

After running into that wall, I decided to try and get ahold of someone at Microsoft who I suspected might be able to help, Andrew Rogers from the Windows Serviceability group. After trading mails (and speculation) back and forth a bit, we eventually determined just what was going on with that version of kernrate.

To understand what the deal is with the 3790.0 DDK x64 version of kernrate, a little history lesson is in order. When the production (“RTM”) version of Windows Server 2003 was released, it was supported on two platforms: x86, and IA64 (“Windows Server 2003 64-bit”). This continued until the Srv03 Service Pack 1 timeframe, when support for another platform “went gold” – x64 (or AMD64). Now, this means that the production code base for Srv03 x64 RTM (3790.1830) is essentially comparable to Srv03 x86 SP1 (3790.1830). While normally, there aren’t “breaking changes” (or at least, these are tried to kept to a minimum) from service pack to service pack, Srv03 x64 is kind of a special case.

You see, there was no production / RTM release of Srv03 x64 3790.0, or a “Service Pack 0” for the x64 platform. As a result, in a special case, it was acceptable for there to be breaking changes from 3790.0/x64 to 3790.1830/x64, as pre-3790.1830 builds could essentially be considered beta/RC builds and not full production releases (and indeed, they were not generally publicly available as in a normal production release).

If you’re following me this far, you’re might be thinking “wait a minute, didn’t he say that the 3790.0 x86 build of kernrate worked on Srv03 3790.1830?” – and in fact, I did imply that (it does work). The breaking change in this case only relates to 64-bit-specific parts, which for the most part excludes things visible to 32-bit programs (such as the 3790.0 x86 kernrate).

In this particular case, it turns out that part of the SYSTEM_BASIC_INFORMATION structure was changed from the 3790.0 timeframe to the 3790.1830 timeframe, with respect to x64 platforms.

The 3790.0 structure, as viewed from x64 builds, is approximately like this:

typedef struct _SYSTEM_BASIC_INFORMATION_3790 {
    ULONG Reserved;
    ULONG TimerResolution;
    ULONG PageSize;
    ULONG_PTR NumberOfPhysicalPages;
    ULONG_PTR LowestPhysicalPageNumber;
    ULONG_PTR HighestPhysicalPageNumber;
    ULONG AllocationGranularity;
    ULONG_PTR MinimumUserModeAddress;
    ULONG_PTR MaximumUserModeAddress;
    KAFFINITY ActiveProcessorsAffinityMask;
    CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION_3790, *PSYSTEM_BASIC_INFORMATION_3790;

However, by the time 3790.1830 (the “RTM” version of Srv03 x64 / XP x64) was released, a subtle change had been made:

typedef struct _SYSTEM_BASIC_INFORMATION {
	ULONG Reserved;
	ULONG TimerResolution;
	ULONG PageSize;
	ULONG NumberOfPhysicalPages;
	ULONG LowestPhysicalPageNumber;
	ULONG HighestPhysicalPageNumber;
	ULONG AllocationGranularity;
	ULONG_PTR MinimumUserModeAddress;
	ULONG_PTR MaximumUserModeAddress;
	KAFFINITY ActiveProcessorsAffinityMask;
	CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION, *PSYSTEM_BASIC_INFORMATION;

Essentially, three ULONG_PTR fields were “contracted” from 64-bits to 32-bits (by changing the type to a fixed-length type, such as ULONG). The cause for this relates again to the fact that the first 64-bit RTM build of Srv03 was for IA64, and not x64 (in other words, “64-bit Windows” originally just meant “Windows for Itanium”, something that is to this day still unfortunately propagated in certain MSDN documentation).

According to Andrew, the reasoning behind the difference in the two structure versions is that those three fields were originally defined as either a pointer-sized data type (such as SIZE_T or ULONG_PTR – for 32-bit builds), or a 32-bit data type (such as ULONG – for 64-bit builds) in terms of the _IA64_ preprocessor constant, which indicates an Itanium build. However, 64-bit builds for x64 do not use the _IA64_ constant, which resulted in the structure fields being erroneously expanded to 64-bits for the 3790.0/x64 builds. By the time Srv03 x64 RTM (3790.1830) had been released, the page counts had been fixed to be defined in terms of the _WIN64 preprocessor constant (indicating any 64-bit build, not just an Itanium build). As a result, the fields that were originally 64-bits long for the prerelease 3790.0/x64 build became only 32-bits long for the RTM 3790.1830/x64 production build. Note that due to the fact the page counts are expressed in terms of a count of physical pages, which are at least 4096 bytes for x86 and x64 (and significantly more for large physical pages on those platforms), keeping them as a 32-bit quantity is not a terribly limiting factor on total physical memory given today’s technology, and that of the foreseeable future, at least relating to systems that NT-based kernels will operate on).

The end result of this change is that the SYSTEM_BASIC_INFORMATION structure that kernrate 3790.0/x64 tries to retrieve is incompatible with the 3790.1830/x64 (RTM) version of Srv03, hence the call failing with c0000004 (otherwise known as STATUS_INFO_LENGTH_MISMATCH). The culimination of this is that the 3790.0/x64 version of kernrate will abort on production builds of Srv03, as the SYSTEM_BASIC_INFORMATION structure format is incompatible with the version kernrate is expecting.

Normally, this would be pretty bad news; the program was compiled against a different layout of an OS-supplied structure. Nonetheless, I decided to crack open kernrate.exe with IDA and HIEW to see what I could find, in the hopes of fixing the binary to work on RTM builds of Srv03 x64. It turned out that I was in luck, and there was only one place that actually retrieved the SYSTEM_BASIC_INFORMATION structure, storing it in a global variable for future reference. Unfortunately, there were a great many references to that global; too many to be practical to fix to reflect the new structure layout.

However, since there was only one place where the SYSTEM_BASIC_INFORMATION structure was actually retrieved (and even more, it was in an isolated, dedicated function), there was another option: Patch in some code to retrieve the 3790.1830/x64 version of SYSTEM_BASIC_INFORMATION, and then convert it to appear to kernrate as if it were actually the 3790.0/x64 layout. Unfortunately, a change like this involves adding new code to an existing binary, which means that there needs to be a place to put it. Normally, the way you do this when patching a binary on-disk is to find some padding between functions, or at the end of a PE section that is marked as executable but otherwise unused, and place your code there. For large patches, this may involve “spreading” the patch code out among various different “slices” of padding, if there is no one contiguous block that is long enough to contain the entire patch.

In this case, however, due to the fact that the routine to retrieve the SYSTEM_BASIC_INFORMATION structure was a dedicated, isolated routine, and that it had some error handling code for “unlikely” situations (such as failing a memory allocation, or failing the NtQuerySystemInformation(…SystemBasicInformation…) call (although, it would appear the latter is not quite so “unlikely” in this case), a different option presented itself: Dispense with the error checking, most of which would rarely be used in a realistic situation, and use the extra space to write in some code to convert the structure layout to the version expected by kernrate 3790.0. Obviously, while not a completely “clean” solution per se, the idea does have its merits when you’re already into patch-the-binary-on-disk-land (which pretty much rules out the idea of a “clean” solution altogether at that point).

The structure conversion is fairly straightforward code, and after a bit of poking around, I had a version to try out. Lo and behold, it actually worked; it turns out that the only thing that was preventing the prerelease 3790.0/x64 kernrate from doing basic kernel profiling on production x64 kernels was the layout of the SYSTEM_BASIC_INFORMATION structure.

For those so inclined, the patch I created effectively rewrites the original version of the function, as shown below after translated to C:

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformation(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 NTSTATUS                       Status;

 Sbi = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));

 if (!Sbi)
 {
  fprintf(stderr, "Buffer allocation failed"
   " for SystemInformation in "
   "GetSystemBasicInformation\\n");
  exit(1);
 }

 if (!NT_SUCCESS((Status = NtQuerySystemInformation(
  SystemBasicInformation,
  Sbi,
  sizeof(SYSTEM_BASIC_INFORMATION_3790),
  0))))
 {
  fprintf(stderr, "NtQuerySystemInformation failed"
   " status %08lx\\n",
   Status);
  free(Sbi);

  Sbi = 0;
 }

 return Sbi;
}

Conceptually represented in C, the modified (patched) function would appear something like the following (note that the error checking code has been removed to make room for the structure conversion logic, as previously mentioned; the substantial new changes are colored in red):

(Note that the patch code was not compiler-generated and is not truly a function written in C; below is simply how it would look if it were translated from assembler to C.)

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformationFixed(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 PSYSTEM_BASIC_INFORMATION      SbiReal;

 Sbi     = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));
 SbiReal = (PSYSTEM_BASIC_INFORMATION)Sbi;

 NtQuerySystemInformation(
  SystemBasicInformation,
  SbiReal,
  sizeof(SYSTEM_BASIC_INFORMATION),
  0);

 Sbi->NumberOfProcessors           =
  SbiReal->NumberOfProcessors;
 Sbi->ActiveProcessorsAffinityMask =
  SbiReal->ActiveProcessorsAffinityMask;
 Sbi->MaximumUserModeAddress       =
  SbiReal->MaximumUserModeAddress;
 Sbi->MinimumUserModeAddress       =
  SbiReal->MinimumUserModeAddress;
 Sbi->AllocationGranularity        =
  SbiReal->AllocationGranularity;
 Sbi->HighestPhysicalPageNumber    =
  SbiReal->HighestPhysicalPageNumber;
 Sbi->LowestPhysicalPageNumber     =
  SbiReal->LowestPhysicalPageNumber;
 Sbi->NumberOfPhysicalPages        =
  SbiReal->NumberOfPhysicalPages;

 return Sbi;
}

(The structure can be converted in-place due to how the two versions are laid out in memory; the “expected” version is larger than the “real” version (and more specifically, the offsets of all the fields we care about are greater in the “expected” version than the “real” version), so structure fields can safely be copied from the tail of the “real” structure to the tail of the “expected” structure.)

If you find yourself in a similar bind, and need a working kernrate for x64 (until if and when Microsoft puts out a new version that is compatible with production x64 kernels), I’ve posted the patch (assembler, with opcode diffs) that I made to the “amd64” release of kernrate.exe from the 3790.0 DDK. Any conventional hex editor should be sufficient to apply it (as far as I know, third parties aren’t authorized to redistribute kernrate in its entirety, so I’m not posting the entire binary). Note that DDKs after the 3790.0 DDK don’t include any kernrate for x64 (even the prerelease version, broken as it was), so you’ll also need the original 3790.0 DDK to use the patch. Hopefully, we may see an update to kernrate at some point, but for now, the patch suffices in a pinch if you really need to profile a Windows x64 system.

Reverse engineer’s toolkit

Tuesday, February 13th, 2007

If you’re planning on reverse engineering something (or debugging unfaimilar code), then it’s important to have the right tools for the job. This is a short list of some of the various tools that I find useful for this line of work (some are free, others are not; they are good tools, though, so I would encourage you to support the authors); if I am going to be doing extensive reversing (or debugging) work on a system, I’ll typically install most (or all) of these tools:

  • WinDbg (Debugging Tools for Windows); the de-facto debugger for Windows NT-based systems (and it’s freely available). Although there are other good debuggers out there (such as OllyDbg), it’s hard to beat the fact that Microsoft develops the debugger, and the designers have a great deal of information about and experience with the internals of the operating system. WinDbg comes with a plethora of plugin modules (extensions) that automate and improve many common tasks, and it has a well-documented and powerful interface to allow third parties (such as yourself) to program new extension modules for customized tasks. Additionally, WinDbg supports a (relatively crude, but effective nevertheless) scripting language. WinDbg can be used on 32-bit and 64-bit (IA64 and x64) targets in both user mode and kernel mode, and is the only supported facilities for debugging problems at customer sites or debugging certain specialized scenarios (like certain infrastructure processes, or fullscreen processes). The Debugging Tools for Windows distribution also includes ntsd, cdb, and kd, which are different (command-line) front-ends to the same debugger engine used by WinDbg. It also includes various other useful utilities, such as Logger (an API call spy program).
  • IDA Pro, or the Interactive Disassembler. IDA is the most full-featured disassembler (by a huge margin) out there that I know of, and it supports a wide variety of target architectures and operating systems to boot (although its support for x86 is probably the most full-featured of all the target architectures that you can use it on). IDA automates many of the aspects of disassembly and reverse engineering that have historically been tedious and error-prone, such as stack variable tracking. The full version of IDA is relatively pricey (on the order of $500..$900 or so, depending on whether you need x64 support), but if you’re serious about reverse engineering, it’s a good investment. For the casual reverse engineer, DataRescue has released a previous version, IDA 4.30, for free. If you are just working with x86 or simply want to get your feet wet with reversing, the free version is a good place to start. Although tools like WinDbg include a built-in disassembler, IDA’s advanced code analysis and interactive aspects (such as allowing the user to describe and annotate types, names, and variables in a structured fashion) make it far superior for non-trivial reverse engineering projects.
  • HIEW (Hacker’s vIEW) is a powerful (console-based) hex editor / disassembler / assembler for 16-bit, 32-bit, and 64-bit (x64) targets. It understands the various Windows image formats, but it can also be used on raw binary files as well. In addition to serving all of your hex editor needs, HIEW is just the perfect tool to use when you need to make quick on-disk patches to executables (and the integrated disassembler and assembler makes creating and applying such patches on-the-fly a walk in the park compared to traditional hex editors, which wourld require you to manually build the opcodes yourself, a pain for non-trivial patches). HIEW includes some additional power-features as well, such as a way to create and run simple programs to decrypt sections in a file (very useful if you’re working on something that is self-decrypting, and you know how to decrypt it but don’t have a program to do so already). It also includes a fairly simple plugin extension interface to allow custom actions to be integrated with the HIEW user interface. HIEW isn’t free, although it is fairly reasonably priced (and there is a (limited) demo that you can play around with).
  • The Windows Vista SDK is an essential tool for many reverse engineering tasks. It includes extensive documentation (as well as headers) for all public Win32 APIs, and it also includes several useful utilities as well (such as link.exe /dump, otherwise known as dumpbin.exe, which can be used to quickly extract various bits of information from a binary (like a list of imports) without having to load it up into a full-featured disassembler tool). The Vista SDK also includes OleView, which can be useful for inspecting a third-party COM library, as it has integrated support for turning a type library into MSIL (which can be trivially converted to a C header as needed).
  • Process Monitor, from SysInternals, is a great utility for quickly inspecting what file and registry operations a program is making. Depending on what you are trying to figure out when analyzing a program, simply looking at its file and registry activity can often save you hours of wading through disassembly or working with a debugger. Process Monitor allows you to perform (potentially filtered) in-depth logging of low-level file and registry activity, including operations that fail and the data returned by successful operations in some circumstances. Process Monitor is freely available from Microsoft’s website.
  • Process Explorer, formerly known as HandleEx, is another freely available tool from SysInternals. It allows quick and easy searching of open handles and loaded modules within a process, which is handy for initial information gathering (such as finding which process uses a DLL that you might be interested in). The ability to quickly check active processes for a DLL is also quite handy if you are trying to track down where a shared DLL (like a Windows hook DLL) is getting loaded in as a precursor to in-depth analysis.
  • SDbgExt is a WinDbg extension that I’ve developed (shameless plug). It provides a number of features that are useful for analyzing unfamiliar targets, such as the ability to import map files and create symbolic names out of them (particularly useful in conjunction with IDA’s export map file feature, if you want to synchronize your process between IDA and WinDbg), support for identifying writable function pointers within a process address space, and support for locating and displaying x64 exception/unwind handlers. SDbgExt presently only supports the 32-bit version of WinDbg. It is freely available (see the link above), and requires that WinDbg and the VC2005 CRT be installed.

There are other useful utilities out there (this is hardly an all-inclusive list), but these are the ones that I use the most often in a wide variety of situations.

Compiler tricks in x86 assembly, part 2

Monday, February 12th, 2007

A common operation in many C programs is a division by a constant quantity. This sort of operation occurs all over the place; in explicit mathematical calculations, and in other, more innocuous situations such as where the programmer does something like take a size in bytes and convert it into a count of structures.

This fundamental operation (division by a constant) is also one that the compiler will tend to optimize in ways that may seem a bit strange until you know what is going on.

Consider this simple C program:

int
__cdecl
wmain(
	int ac,
	wchar_t** av)
{
	unsigned long x = rand();

	return x / 1518;
}

In assembly, one would expect to see the following operations emitted by the compiler:

  1. A call to rand, to populate the ‘x’ variable.
  2. A div instruction to perform the requested division of ‘x’ by 1518.

However, if optimizations are turned on and the compiler is not configured to prioritize small code over performance exclusively, then the result is not what we might expect:

; int __cdecl main(int argc,
  const char **argv,
  const char *envp)
_main proc near
call    ds:__imp__rand
mov     ecx, eax
mov     eax, 596179C3h
mul     ecx
mov     eax, ecx
sub     eax, edx
shr     eax, 1
add     eax, edx
shr     eax, 0Ah ; return value in eax
retn

The first time you encounter something like this, you’ll probably be wondering something along the lines of “what was the compiler thinking when it did that, and how does that set of instructions end up dividing out ‘x’ by 1518?”.

What has happened here is that the compiler has used a trick sometimes known as magic number division to improve performance. In x86 (and in most processor architectures, actually), it is typically much more expensive to perform a division than a multiplication (or most any other basic mathematical primative, for that matter). While ordinarily, it is typically not possible to make division any faster than what the ‘div’ instruction gives you performance-wise, under special circumstances it is possible to do a bit better. Specifically, if the compiler knows that one of the operands to the division operation is a constant (specifically, the divisor), then it can engage some clever tricks in order to eliminate the expensive ‘div’ instruction and turn it into a sequence of other instructions that, while initially longer and more complicated on the surface, are actually faster to execute.

In this instance, the compiler has taken one ‘div’ instruction and converted it into a multiply, subtract, add, and two right shift operations. The basic idea behind how the code the compiler produced actually gives the correct result is that it typically involves a multiplication with a large constant, where the low 32-bits are then discarded and then remaining high 32-bits (recall that the ‘mul’ instruction returns the result in edx:eax, so the high 32-bits of the result are stored in edx) have some transform applied, typically one that includes fast division by a power of 2 via a right shift.

Sure enough, we can demonstrate that the above code works by manually comparing the result to a known value, and then manually performing each of the individual operations. For example, assume that in the above program, the call to rand returns 10626 (otherwise known as 7 * 1518, for simplicity, although the optimized code above works equally well for non-multiples of 1518).

Taking the above set of instructions, we perform the following operations:

  1. Multiply 10626 by the magic constant 0x596179C3. This yields 0x00000E7E00001006.
  2. Discard the low 32-bits (eax), saving only the high 32-bits of the multiplication (edx; 0xE7E, or 3710 in decimal).
  3. Subtract 3710 from the original input value, 10626, yielding 6916.
  4. Divide 6916 by 2^1 (bit shift right by 1), yielding 3458.
  5. Add 3710 to 3458, yielding 7168.
  6. Divide 7168 by 2^10 (bit shift right by 10), yielding 7, our final answer. (10626 / 1518 == 7)

Depending on the values being divided, it is possible that there may only be a right shift after the large multiply. For instance, if we change the above program to divide by 1517 instead of 1518, then we might see the following code generated instead:

mov     eax, 5666F0A5h
mul     ecx
shr     edx, 9

Coming up with the “magical divide” constants in the first place involves a bit of math; interested readers can read up about it on DevDev.

If you’re reverse engineering something that has these characteristics (large multiply that discards the low 32-bits and performs some transforms on the high 32-bits), then I would recommend breaking up each instruction into the corresponding mathematical operation (be aware of tricks like shift right to divide by a power of 2, or shift left to multiply by a power of 2). As a quick “litmus test”, you can always just pick an input and do the operations, and then simply compare the result to the input and see if it makes sense in the context of a division operation.

More on other optimizer tricks in a future posting…

Compiler optimizer tricks in x86 assembly, part 1

Saturday, February 10th, 2007

The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious. With a bit of practice, you can train yourself to quickly identify these optimizations and see what they really do. These kinds of optimizations are very common with constant values.

With many of these little “assembly tricks”, the compiler is not simply taking into account instruction speed, but also instruction size, in order to reduce the total amount of opcode bytes required to do an optimization.

This post series attempts to provide a basic overview (and rationale) of several of the most common compiler optimizations that you might see while analyzing x86 assembly code. Knowing what these “assembly tricks” do is a great benefit if you are debugging or reverse engineering something, as it allows one to quickly recognize certain compiler patterns and gain clues as to what a program is doing.

Without further ado, here’s a brief overview of several of these such optimizations:

  1. Clearing a register value by xor reg, reg is a very common sequence in x86 code generated by a compiler. You will almost never see an instruction of the form mov reg, 0. Instead, the compiler will virtually always use the above-mentioned xor construct.

    The reasoning behind this is that the xor reg, reg construct is a very small instruction in terms of opcode bytes (2 bytes), whereas assigning a constant 32-bit value is typically much more expensive in terms of opcode length (say, 5 bytes).

    The gain with this optimization is reduced code size. Reducing code size is always good, and can lead to improved performance in that it improves the cacheability of a particular chunk of code (remember, most processor cache sizes are still very small compared to main system memory or hard drives). Also, if the compiler can shrink down the total image size by even one page (e.g. 4096 bytes) with optimizations like this, that’s one less page fault that needs to be taken when loading the program. This can be noticible even in lengths less than a page, if a function can be kept within one page instead of spilling over to a neighboring page (which might not be used for the current code path), which can eliminate “extraneous” page faults, where most of the data brought in is unnecessary.

    (It’s worth noting that this sort of optimization has come a long way recently, in the form of profile-guided BBT-style optimizations that reorder “hot” code paths to be on the same page in an attempt to make every byte that is in-paged from disk be as relevant as possible to the current execution path.)

  2. The constant zero register another very common optimization technique used by the compiler. Since the value zero appears so frequently in many functions (e.g. default parameter values are typically zero, and it is very common to compare values against zero or assign values to zero), the compiler will sometimes dedicate an entire register to contain the value zero throughout most of the function. Here’s an example, taken from nt!MmZeroPageThread:

    xor     esi, esi
    cmp     [ebp+var_14], esi
    [...]
    push    esi             ; WaitBlockArray
    push    esi             ; Timeout
    push    esi             ; Alertable
    push    esi             ; WaitMode
    push    8               ; WaitReason
    xor     ebx, ebx
    inc     ebx
    push    ebx             ; WaitType
    lea     eax, [ebp+Object]
    push    eax             ; Object
    push    2               ; Count
    call    _KeWaitForMultipleObjects@32

    Here, the compiler has dedicated the “esi” register to be the constant zero value for this function. It is used in many cases; in just a small part of nt!MmZeroPageThread, for instance, we can see that it is being used as both an argument to the “cmp” instruction for a test against constant zero, and we can also see it being used to fill many constant zero parameter values for a call to KeWaitForMultipleObjects.

    The gain from using a constant zero register is typically reduced code size. In x86, in many cases, it takes less opcode bytes to assign or compare a value to a register than to an immediate constant operand. The compiler takes advantage of this fact by only “paying the price” for referencing the value zero in an immediate operand for an instruction once, by storing it into a register. Then, the compiler can simply refer to that register for smaller overall code size if references to constant zero are common enough. For example, in the above code fragment, the compiler saves one opcode byte per push esi instruction over doing a push 0 instruction.

  3. Fast multiplication or addition with the lea instruction is another fairly common optimization that one is likely to run into frequently. The lea instruction (load effective address) was originally intended for use with pointer math, but it also turns out to be quite useful for doing fast constant math on non-pointer values as well, in part due to the wide range of operations that can be done with this instruction.

    Consider the following code fragment:

    mov     eax, [ebp+some_local]
    movzx   ecx, bx
    lea     eax, [eax+eax*4]
    lea     eax, [ecx+eax*2-30h]
    mov     [ebp+other_local], eax

    This instruction sequence may seem a bit convoluted at first, but it’s not too bad if we break it down into its constituent parts (and then combine it into one operation).

    We have the following operations:

    lea     eax, [eax+eax*4]

    This operation is equivalent to the following in C:

    other_local = some_local;
    other_local *= 5;

    Then, we’ve got the second lea operation:

    lea     eax, [ecx+eax*2-30h]

    In C, this might look like so:

    other_local = other_local * 2 + X - 0x30;

    …(where X corresponds to bx (and then ecx)).

    If we combine the two together, we get the following expression in C:

    other_local = some_local * 10 + X - 0x30;

    Now, the compiler could have used a combination of mul, add, and sub instructions to achieve the same effect. However, this would be more expensive in terms of instruction size, as those instructions are designed to work with values that are not known at runtime. By using the lea instruction, the compiler can take advantage of the fact that the lea instruction can perform multiple operations with one instruction in order to reduce code size.

  4. The lea instruction is also useful for scaling array indicies to their native type. Recall that in C, if you subscript an array, the subscript is performed on whole array elements; in the x86 instruction set, however, the processor has no way of magically knowing the size of an array element when the programmer subscripts an array.

    For example, consider the following structure:

    0:000> dt nt!_MMPFN
       +0x000 u1               : __unnamed
       +0x004 PteAddress       : Ptr32 _MMPTE
       +0x008 u2               : __unnamed
       +0x00c u3               : __unnamed
       +0x010 OriginalPte      : _MMPTE
       +0x010 AweReferenceCount : Int4B
       +0x014 u4               : __unnamed ; 4 bytes
    

    In this case, sizeof(nt!_MMPFN) == 24 (0x18). Consider an array of _MMPFN structures like so:

    _MMPFN MmPfnDatabase[ array_size ];

    If the programmer wants to index MmPfnDatabase (i.e. retrieve a pointer to a particular _MMPFN element within the array), then the compiler needs to convert an index into a pointer to an _MMPFN structure contained within the array.

    For example, the programmer might write the following C code:

    _MMPFN* Pfn = &MmPfnDatabase[ PfnIndex ];

    At the x86 instruction set level, though, the compiler needs to convert PfnIndex into a pointer. This requires two operations: First, PfnIndex needs to be scaled to the array size (or multipled by sizeof(_MMPFN). Second, the resultant value needs to be added to the base of MmPfnDatabase to form a complete pointer value to the requested _MMPFN element.

    In order to accomplish this, the compiler might emit the following instruction sequence:

    mov     eax, [ebp+PfnIndex]
    mov     ecx, _MmPfnDatabase
    push    ebx
    mov     ebx, [ebp+arg_4]
    lea     eax, [eax+eax*2]
    push    esi
    lea     esi, [ecx+eax*8]
    

    Here, the lea instruction is used to take the PfnIndex and MmPfnDatabase values and combine them into an _MMPFN pointer (stored in “esi”). If we break down the individual operations performed, what’s going on here isn’t all that difficult to understand:

    1. The initial LEA operation is equivalent to multiplying PfnIndex by 3 (PfnIndex is stored into “eax”).
    2. The final LEA operation multiplies the result of the first LEA operation by 8 (which can be simplified to say that PfnIndex has been multiplied by 24, which is conveniently equal to sizeof(_MMPFN).
    3. Finally (also part of the last LEA operation), “ecx” (which was loaded with the base address of MmPfnDatabase) is added to the intermediate result, which is then placed into “esi”, forming a completed _MMPFN pointer.

    Like with performing constant math for non-array indexing, the advantage of using lea over a series of mul and add operations is primarily code size, as lea allows for several distinct operations (e.g. multiply/add, or multiply/subtract) to be combined into a single compact operation. Most processors also provide very fast implementations of the lea instruction (as compared to the mul instruction for constant multiply operations).

    In order to be able to differentiate between an array indexing operation and plain constant math using lea, I would recommend checking to see whether any of the input values are treated as pointers or if the output value is treated as a pointer. Usually, it’s fairly easy to make this determination, though.

    As an aside, if you are reverse engineering something, constructs like array index operations are very handy as they will definitively tell you the size of the structure comprising an array.

The next post in this series will continue this discussion and examine several more common (and more complicated) optimizations and assembly tricks that the compiler may emit on x86. Stay tuned…

Don’t always trust the compiler… (or when reverse engineering comes in handy even when you’ve got source code)

Wednesday, January 10th, 2007

Usually, when a program breaks, you look for a bug in the program. On the rare occasion, however, compilers have been known to malfunction.

I ran into such a problem recently. At my apartment, I have a video streaming system setup, wherein I have a TV tuner plugged into a dedicated desktop box. That desktop box has been setup to run VLC noninteractively in order to stream (broadcast) TV from the TV tuner onto my apartment LAN. Then, if I want to watch TV, all I have to do is pull up VLC at a computer and tell it to display the MPEG stream I have configured to be broadcast on my local network.

This works fairly well (although VLC isn’t without it’s quirks), and it’s got the nice side effect of that I have a bit more flexibility as to where I want to watch TV at without having to invest in extra hardware (beyond a TV tuner). Furthermore, I can even do silly things like put TV up on multiple monitors if I really wanted to, something not normally doable if you just use a “plain” TV set (the old fashioned way!).

Recently, though, one of my computers ceased being able to play the MPEG stream I was running over my network. Investigation showed that other computers on the LAN weren’t having problems displaying the stream; only this one system in particular wouldn’t play the stream correctly. When I connected VLC to the stream, I’d get a blank black screen with no audio or video. I checked out the VLC debug message log and found numerous instances of this log message:

warning: received bufer in the future

Hmm. It seemed like VLC was having timing-related problems that were causing it to drop frames. My first reaction was that VLC had some broken-ness relating to the handling of large uptimes (this system in question had recently exceeded the “49.7 day boundary”, wherein the value returned by GetTickCount, a count in milliseconds of time elapsed since the system booted, wraps around to zero). I set out to prove this assumption by setting a breakpoint on kernel32!GetTickCount in the debugger and attaching VLC to the stream. While GetTickCount was occasionally called, it turned out that it wasn’t being used in the critical code path in question.

So, I set out to find that log message in the VLC source code (VLC is open source). It turned out to be coming from a function relating to audio decoding (aout_DecPlay). The relevant code turned out to be as follows (reformatting by me):

[...]
if ( p_buffer->start_date > mdate() +
     p_input->i_pts_delay           +
     AOUT_MAX_ADVANCE_TIME )

{
     msg_Warn( p_aout,
      "received buffer in the future ("I64Fd")",
       p_buffer->start_date - mdate());

[...]

After logging this warning, the function in question drops the frame with the assumption that it is probably bogus due to bad timing information.

Clearly, there was nothing wrong with the stream itself, as I could still play the stream fine on other computers. In fact, restarting VLC on the computer hosting the stream, or the computer hosting the VLC stream itself both did nothing to resolve the problem; other computers could play the stream, except for one system (with a high uptime) that would always fail due to bad timing information.

In this case, it turns out that the mdate function is an internal VLC function used all over the place for high resolution timing. It returns a microsecond-precision counter that is monotically incrementing since VLC started (or in the case of Win32 VLC, since Windows was started). I continued to suspect that something was wrong here (as the only system that was failing to play the stream had a fairly high uptime). Looking into the source for mdate, there were two code paths that could be taken on Win32; one that used GetTickCount for timing resolution (though this code path in question does handle tick count wraparound), and another path that utilizes QueryPerformanceCounter and QueryPerformanceFrequency for high resolution timing, if VLC thinks that the performance counter is slaved to the system timer clock. (Whether or not the latter is really a good thing to do period on Windows is debatable; I would say no, but it appears to work for VLC.)

As I had already ruled out GetTickCount as being used in the timing-critical parts of VLC, I ignored the GetTickCount-related code path in mdate. This left the following segment of code in the Win32 version of mdate:

/**
 * Return high precision date
 *
 * Uses the gettimeofday() function when
 *  possible (1 MHz resolution) or the
 * ftime() function (1 kHz resolution).
 */
mtime_t mdate( void )
{
 /* We don't need the real date, just the value of
    a high precision timer */
 static mtime_t freq = I64C(-1);

 if( freq == I64C(-1) )
 {
  /* Extract from the Tcl source code:
   * (http://www.cs.man.ac.uk/fellowsd-bin/TIP/7.html)
   *
   * Some hardware abstraction layers use the CPU clock
   * in place of the real-time clock as a performance counter
   * reference.  This results in:
   * - inconsistent results among the processors on
   *   multi-processor systems.
   * - unpredictable changes in performance counter frequency
   *   on "gearshift" processors such as Transmeta and
   *   SpeedStep.
   * There seems to be no way to test whether the performance
   * counter is reliable, but a useful heuristic is that
   * if its frequency is 1.193182 MHz or 3.579545 MHz, it's
   * derived from a colorburst crystal and is therefore
   * the RTC rather than the TSC.  If it's anything else, we
   * presume that the performance counter is unreliable.
   */

  freq = ( QueryPerformanceFrequency( (LARGE_INTEGER *)&freq )
      && (freq == I64C(1193182) || freq == I64C(3579545) ) )
      ? freq : 0;
 }

 if( freq != 0 )
 {
  LARGE_INTEGER counter;
  QueryPerformanceCounter (&counter);

  /* Convert to from (1/freq) to microsecond resolution */
  /* We need to split the division to avoid 63-bits
       overflow */
  lldiv_t d = lldiv (counter.QuadPart, freq);

  return (d.quot * 1000000)
    + ((d.rem * 1000000) / freq);
 }
[...]
}

This code isn’t all that hard to follow. The idea is that the first time around, mdate will check the performance counter frequency for the current system. If it is one of two magical values, then mdate will be configured to use the performance counter for timing. Otherwise, it is configured to use an alternate method (not shown here), which is based on GetTickCount. On the system in question, mdate was being set to use the performance counter and not GetTickCount.

Assuming that mdate has decided on using the system performance counter for timing purposes (which, again, I do not believe is a particularly good (portable) choice, though it does happen to work on my system), then mdate simply divides out the counter value by the frequency (count of counter units per second), adjusted to return a nanosecond value (hence the constant 1000000 vales). The reason why the original author split up the division into two parts is evident by the comment; it is an effort to avoid an integer overflow when performing math on large quantities (it avoids multiplying an already very large (64-bit) value by 1000000 before the divission, which might then exceed 64 bits in the resultant quantity). (In case you were wondering, lldiv is a 64-bit version of the standard C runtime function ldiv; that is, it performs an integral 64-bit division with remainder.)

Given this code, it would appear that mtime should be working fine. Just to be sure, though, I decided to double check what was going on the debugger. Although VLC was built with gcc (and thus doesn’t ship with WinDbg-compatible symbol files), mtime is a function exported by one of the core VLC DLLs (libvlc.dll), so there wasn’t any great difficulty in setting a breakpoint on it with the debugger.

What I found was that mdate was in fact returning a strange value (to be precise, a large negative value – mtime_t is a signed 64-bit integer). Given the expression used in the audio decoding function snippet I listed above, it’s no surprise why that would break if mdate returned a negative value (and it’s a good assumption that other code in VLC would similarly break).

The relevant code portions for the actual implementation of mdate that gcc built were as so:

libvlc!mdate+0xe0:
62e20aa0 8d442428     lea     eax,[esp+0x28]
62e20aa4 890424       mov     [esp],eax
;
; QueryPerformanceCounter(&counter)
;
62e20aa7 e874640800   call    QueryPerformanceCounter
62e20aac 83ec04       sub     esp,0x4
62e20aaf b940420f00   mov     ecx,0xf4240 ; 1000000
62e20ab4 8b742428     mov     esi,[esp+0x28]
62e20ab8 8b7c242c     mov     edi,[esp+0x2c]
62e20abc 89f0         mov     eax,esi
62e20abe f7e1         mul     ecx
62e20ac0 89c1         mov     ecx,eax
62e20ac2 69c740420f00 imul    eax,edi,0xf4240 ; 1000000
62e20ac8 890c24       mov     [esp],ecx
62e20acb 8b3dcc7a2763 mov     edi,[freq.HighPart]
62e20ad1 8d3402       lea     esi,[edx+eax]
62e20ad4 897c240c     mov     [esp+0xc],edi
62e20ad8 8b15c87a2763 mov     edx,[freq.LowPart]
62e20ade 89742404     mov     [esp+0x4],esi
62e20ae2 89542408     mov     [esp+0x8],edx
;
; lldiv(...)
;
62e20ae6 e815983e00   call    lldiv
62e20aeb 8b5c2430     mov     ebx,[esp+0x30]
62e20aef 8b742434     mov     esi,[esp+0x34]
62e20af3 8b7c2438     mov     edi,[esp+0x38]
62e20af7 83c43c       add     esp,0x3c
62e20afa c3           ret

This bit of code might look a bit daunting at first, but it’s not too bad. Translated into C, it looks approximately like so:

LARGE_INTEGER counter, tmp;

QueryPerformanceCounter(&counter);

tmp.LowPart  = Counter.LowPart  * 1000000;
tmp.HighPart = Counter.HighPart * 1000000 +
    (((unsigned __int64)counter.LowPart  * 1000000) >> 32);

d = lldiv(tmp.QuadPart, freq);

return d.quot;

This looks code looks a little bit weird, though. It’s not exactly the same thing that we see in the VLC source code, even counting for differences that might arise between original C source code and reverse enginereed C source code; in the compiled code, the expression in the return statement has been moved before the call to lldiv.

In fact, the code has been heavily optimized. The compiler (gcc, in this case) apparently assumed some knowledge about the inner workings of lldiv, and decided that it would be safe to pre-calculate an input value instead of perform post-calculations on the result of lldiv. The calculations do appear to be equivalent, at first; the compiler simply moved a multiply around relative to a division that used remainders. Basic algebra tells us that there isn’t anything wrong with doing this.

However, there’s one little complication: computers don’t really do “basic algebra”. Normally, in math, you typically assume an unlimited space for variables and intermediate values, but in computer-land, this isn’t really the case. Computers approximate the set of all integer values in a 32-bit (or 64-bit) number-space, and as a result, there is a cap on how large (or small) of an integer you can represent natively, at least without going to a lot of extra work to support truly arbitrarily large integers (as is often done in public key cryptography implementations).

Taking a closer look at this case, there is a problem; the optimizations done by gcc cause some of the intermediate values of this calculation to grow to be very large. While the end result might be equivalent in pure math, when dealing with computers, the rules change a bit due to the fact that we are dealing with an integer with a maximum size of 64 bits. Specifically, this ends up being a problem because the gcc-optimized version of mdate multiplies the raw value of “counter” by 1000000 (as opposed to multiplying the result of the first division by 1000000). Presumably, gcc has performed this optimization as multiply is fairly cheap as far as computers go (and division is fairly expensive in comparison).

Now, while one might naively assume that the original version of mdate and the instructions emitted by gcc are equivalent, with the above information in mind, it’s clear that this isn’t really the case for the entire range of values that might be returned by QueryPerformanceCounter. Specifically, if the counter value multiplied by 1000000 exceeds the range of a 64-bit integer, then the two versions of mdate will not return the same value, as in the second version, one of the intermediate values of this calculation will “wrap around” (and in fact, to make matters worse, mdate is dealing with signed 64-bit values here, which limits the size of an integer to 63 significant bits, with one bit reserved for the representation of the integer’s sign).

This can be experimentally confirmed in the debugger, as I previously alluded to. Stepping through mdate, around the call to lldiv specifically, we can see that the intermediate value has exceeded the limits of a 63-bit integer with sign bit:

0:007>
eax=d5ebcb40 ebx=00369e99 ecx=6f602e40
edx=00369e99 esi=d5f158ce edi=00000000
eip=62e20ae6 esp=01d0fbfc ebp=00b73068
iopl=0         ov up ei ng nz nape cy
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000a83
libvlc!mdate+0x126:
62e20ae6 e815983e00       call    lldiv
0:007> dd @esp
01d0fbfc  6f602e40 d5f158ce 00369e99 00000000
01d0fc0c  00000060 01d0ffa8 77e6b7d0 77e6bb00
01d0fc1c  ffffffff 77e6bafd 5d29c231 00000e05
01d0fc2c  00000000 00b73068 00000000 62e2a544
01d0fc3c  00000f08 ffffffff 01d0fd50 00b72fd8
01d0fc4c  03acc670 00a49530 01d0fd50 6b941bd2
01d0fc5c  00a49530 00000f30 00000000 00000003
01d0fc6c  00b24130 00b53e20 0000000f 00b244a8
0:007> p
eax=e1083396 ebx=00369e99 ecx=ffffffff
edx=ffffff3a esi=d5f158ce edi=00000000
eip=62e20aeb esp=01d0fbfc ebp=00b73068
iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000206
libvlc!mdate+0x12b:
62e20aeb 8b5c2430         mov     ebx,[esp+0x30]

Using our knowledge of calling conventions, it’s easy to retrieve the arguments to lldiv from the stack: tmp.QuadPart is 0xd5f158ce6f602e40, and freq is 0x0000000000369e99

It’s clear that counter.QuadPart has overflowed here; considering the sign bit is set, it now holds a (very large) negative quantity. Since the remainder of the function does nothing that would influence the sign bit of the result, after the division, we get another (large, but closer to zero) negative value back (stored in edx:eax, the value 0xffffff3ae1083396). This is the final return value of mdate, which explains the problems I was experiencing with playing video streams; large negative values were being returned and causing sign-sensitive inequality tests on the return value of mdate (or a derivative thereof) to operate unexpectedly.

In this case, it turned out that VLC failing to play my video stream wasn’t really the fault of VLC; it ended up being a bug in gcc’s optimizer that caused it to make unsafe optimizations that introduce calculation errors. What’s particularly insidious about this mis-optimization is that it is invisible until the input values for the operations involve grow to a certain size, after which calculation results are wildly off. This explains why nobody else has run into this problem in VLC enough to get it fixed by now; unless you run VLC on Windows systems with a high uptime, where VLC is convinced that it can use the performance counter for timing, you would never know that the compiler had introduced a subtle (but serious) bug due to optimizations.

As to fixing this problem, there are a couple of approaches the VLC team could take. The first is to update to a more recent version of gcc (if newer gcc versions fix this problem; I don’t have a build environment that would let me compile all of VLC, and I haven’t really had much luck in generating a minimal repro for this problem, unfortunately). Alternatively, the function could be rewritten until gcc’s optimizer decided to stop trying to optimize the division (and thus introduce calculation errors).

A better solution would be to just drop the usage of QueryPerformanceCounter entirely, though. For VLC’s usage, GetTickCount should be close enough timing-wise, and you can even increase the resolution of GetTickCount up to around 1ms (with good hardware) using timeBeginTime. GetTickCount does have the infamous 49.7-day wraparound problem, but VLC does have a workaround that works. Furthermore, on Windows Vista and later, GetTickCount64 could be used, turning the 49.7-day limit into a virtual non-issue (at least in our lifetimes, anyway).

(Oh, and in case you’re wondering why I didn’t just fix this myself and submit a patch to VLC (after all, it’s open source, so why can’t I just “fix it myself”?): VLC’s source code distribution is ~100mb uncompressed, and I don’t really want to go spending a great deal of time to find a cygwin version that works correctly on Vista x64 with ASLR and NX enabled (cygwin’s fault, not Vista’s) so that I can get a build environment for VLC up and running so that I could test any potential fix I make (after debugging the inevitable build environment difficulties along the way for such a large project). I might still do this at some point, perhaps to see if recent gcc versions fix this optimizer bug, though.)

For now, I just patched my running VLC instance in-memory to use GetTickCount instead, using the debugger. Until I restart VLC, that will have to do for now.

Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)

Monday, January 8th, 2007

In the previous article in this series, I discussed the external interface exposed by RtlUnwindEx (and some of how unwinding works at a high level). This posting continues that discussion, and aims to provide insight into the internal workings of RtlUnwindEx (and as such, the inner details of all of the different aspects of unwind support on x64 Windows).

As previously described, the main behavior of RtlUnwindEx is to systematically unwind call frames (with the help of RtlVirtualUnwind) until a specific call frame, which is identified by the TargetFrame argument, is reached. RtlUnwindEx is also responsible for all interactions with language exception handlers for purposes of unwind operations. Additionally, RtlUnwindEx also imposes various validations and restrictions on execution contexts being unwound, and on the behavior of exception handlers being called for an unwind operation.

The first order of business within RtlUnwindEx is to capture the execution context at the time of the call to RtlUnwindEx (specifically, the execution context inside RtlUnwindEx, not of the caller of RtlUnwindEx). This is done with the aid of two helper functions, RtlpGetStackLimits (which retrieves the bounds of the stack for the current thread from the NT_TIB region of the current threads’ TEB), and RtlCaptureContext (which records the complete execution context of its caller within a standard CONTEXT structure). Additionally, if an unwind table is supplied, a special flag is set in it that optimizes the behavior of subsequent calls to RtlLookupFunctionTable for lookups that are unwind-driven (this is a behavior new to Windows Vista, and is a further attempt to improve the performance of unwind support on x64).

If the caller did not supply an EXCEPTION_RECORD argument, RtlUnwindEx will create the default STATUS_UNWIND exception record at this time and substitute it for what would have otherwise been a caller-supplied EXCEPTION_RECORD block. The exception record is initialized with an ExceptionAddress pointing to the Rip value captured previously by RtlCaptureContext, and with no parameters. Additionally, an initial ExceptionFlags value of EXCEPTION_UNWINDING is set, to later indicate to any exception handlers that might be called that an unwind operation is in progress (the EXCEPTION_RECORD pointer, either caller supplied or locally allocated by RtlUnwindEx in the absence of a caller-supplied value, corresponds exactly to the EXCEPTION_RECORD argument passed to any LanguageHandler that is called during unwind processing).

In the event that the caller of RtlUnwindEx did not supply a TargetFrame argument (indicating that the requested unwind operation is an exit unwind), then the EXCEPTION_EXIT_UNWIND flag is set within RtlUnwindEx’s internal ExceptionFlags value. An exit unwind is a special form of unwind where the “target” of the unwind is unknown; in other words, the caller does not have a valid target frame pointer to supply to RtlUnwindEx. Initiating a target unwind is normally dangerous unless the caller has special knowledge of an unwind handler in the call stack that will halt the unwind operation prematurely (either by initiating a secondary unwind, which leads to what is called a collided unwind, or by exiting the thread entirely). The reason for this restriction is that as RtlUnwindEx doesn’t have a clear “stopping point” to halt the unwind cycle at, it will happily unwind past the end of the stack (typically resulting in an access violation) unless an unwind handler along the way does something to halt the unwind. Most unwind operations are not exit unwinds.

At this point, RtlUnwindEx is set up to enter the main loop of the unwind algorithm, which essentially involves repeated calls to RtlVirtualUnwind, and then to unwind handlers (if present). This main loop involves multiple steps:

  1. The RUNTIME_FUNCTION entry for the current frame (given by the Rip member of the context record captured above, and later updated in this loop) is located via RtlLookupFunctionEntry. If no function entry is present, then RtlUnwindEx will load Context->Rip with a ULONG64 value located at Context->Rsp, and then increment Context->Rsp by 8. The behavior when there is no RUNTIME_FUNCTION entry present accounts for leaf functions, for which unwind metadata is optional. If the current frame is a leaf function, then control skips forward to step 8.
  2. Assuming that a RUNTIME_FUNCTION was found, RtlUnwindEx makes a copy of the current execution context that will be unwound – something I call the “unwind context”. After duplicating the context (via the RtlpCopyContext helper function, which only duplicates the non-volatile context), RtlVirtualUnwind is called (with the unwind context), and requested to return the address any associated language handler that is marked for unwind support. RtlVirtualUnwind thus returns several useful pieces of information; a language handler supporting unwind (if any), an updated context describing the caller of the requested call frame, a language-handler-specific (i.e. C scope table) data pointer associated with the requested call frame (if any), and the stack pointer of the call frame being unwound (the establisher frame). These pieces of information are used later in communication with a returned exception handler with unwind support, if one exists.
  3. After calling RtlVirtualUnwind to establish the context of the next location on the stack frame (now contained within the “unwind context” location), RtlUnwindEx performs some validation of the returned EstablisherFrame value. Specifically, the EstablisherFrame value is ensured to be 8-byte aligned and within the stack limits of the current thread (in kernel mode, there is also special support for handling the case of an unwind occcuring within the context of a DPC, which may operate under a secondary stack). If either of these conditions does not hold true, a STATUS_BAD_STACK exception is raised, indicating that the stack pointer in the requested call frame is damaged or corrupted. Additionally, if a TargetFrame value is specified (that is, the unwind operation is not an exit unwind), then the TargetFrame value is validated to be greater than or equal to the EstablisherFrame value returned by RtlVirtualUnwind. This is, in effect, a sanity check designed to ensure that the unwind target actually refers to a previous call frame and not that one that has already be unwound. If this check fails, then a STATUS_BAD_STACK exception is raised.
  4. If a language handler was returned by RtlVirtualUnwind, then RtlUnwindEx sets up for a call to the language handler. This involves the initial setup of a DISPATCHER_CONTEXT structure created on the stack of RtlUnwindEx. The DISPATCHER_CONTEXT structure describes some internal state that RtlUnwindEx shares with all participants in the unwind process, such as language handlers being called for unwind. It contains all of the state information necessary to coordinate operation between RtlUnwindEx and any language handler. Furthermore, it is also instrumental in the processing of collided unwinds; more on that later. The newly initialized DISPATCHER_CONTEXT contains two fields of significance, initially; the TargetIp field (which is simply a copy of the TargetIp argument to RtlUnwindEx), and the ScopeIndex field (which is zero initialized). Both of these fields are unused by RtlUnwindEx itself, and are simply available for the conveniene of language handlers being called for an unwind operation. If no language handler was present for the requested call frame, then control skips forward to step 8.
  5. At this point, RtlUnwindEx is ready to make a call to an unwind handler. This first involves a quick check to determine whether the end of the unwind chain has been reached, through comparing the current frame’s EstablisherFrame value with the TargetFrame argument to RtlUnwindEx. If the two frame pointers match exactly, then the ExceptionFlags value passed in to the unwind handler has an additional bit set, EXCEPTION_TARGET_UNWIND. This flag bit lets the unwind handler know that it is the “last stop” in the unwind process (in other words, that there will be no further frame unwinds after the unwind handler finishes processing). At this point, the ReturnValue argument passed to RtlUnwindEx is copied into the Rax register image in the active context for the current frame (not the unwound context, which refers to the previous frame). Then, the last remaining fields of the DISPATCHER_CONTEXT structure are initialized based on the internal state of RtlUnwindEx; the image base, handler data, instruction pointer (ControlPc), function entry, establisher frame, and language handler values previously returned by RtlLookupFunctionEntry and RtlVirtualUnwind are copied into the DISPATCHER_CONTEXT structure, along with a pointer to the context record describing the execution state at the current frame. After the ExceptionFlags member of RtlUnwindEx’s EXCEPTION_RECORD structure is set, the stack-based exception flags image (from which the copy in the EXCEPTION_RECORD was copied from) has the EXCEPTION_TARGET_UNWIND and EXCEPTION_COLLIDED_UNWIND flags cleared, to ensure that these flags are not inadvertently passed to an exception routine unexpectedly in a future loop iteration.
  6. After preparing the DISPATCHER_CONTEXT for the unwind handler call, RtlUnwindEx makes a call to a small helper function, RtlpExecuteHandlerForUnwind. RtlpExecuteHandlerForUnwind is an assembly-language routine whose prototype matches that of the language specific handler, given below:
    typedef EXCEPTION_DISPOSITION (*PEXCEPTION_ROUTINE) (
        IN PEXCEPTION_RECORD               ExceptionRecord,
        IN ULONG64                         EstablisherFrame,
        IN OUT PCONTEXT                    ContextRecord,
        IN OUT struct _DISPATCHER_CONTEXT* DispatcherContext
    );

    RtlpExecuteHandlerForUnwind is fairly straightforward. All it does is store the DispatcherContext argument on the stack, and then make a call to the LanguageHandler member in the DISPATCHER_CONTEXT structure. RtlpExecuteHandler then returns the return value of the LanguageHandler itself.

    While this may seem like a rather useless helper routine at first, RtlpExecuteHandlerForUnwind actually does add some value, although it might not be immediately apparent unless one looks closely. Specifically, RtlpExecuteHandlerForUnwind registers an exception/unwind handler for itself (RtlpUnwindHandler). RtlpUnwindHandler does not go through _C_specific_handler; in other words, it is a raw exception handler registration. Like RtlpExecuteHandlerForUnwind, RtlpUnwindHandler is a raw assembly language routine. It, too, is fairly simple (and as a language-level exception handler routine, RtlpUnwindHandler is compatible with the LanguageHandler prototype described above); RtlpUnwindHandler uses the EstablisherFrame argument given to a LanguageHandler routine to locate the saved pointer to the DISPATCHER_CONTEXT on the stack of RtlpExecuteHandlerForUnwind, and then copies most of the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandlerForUnwind over the DISPATCHER_CONTEXT structure that was passed to RtlpUnwindHandler itself (conspicuously omitted from the copy is the TargetIp member of the DISPATCHER_CONTEXT structure, for reasons that will become clear later). After performing the copy of the DISPATCHER_CONTEXT structure, RtlpUnwindHandler returns the manifest ExceptionCollidedUnwind constant. Although one might naively assume that all of this just leads up to protecting against the case of an unwind handler throwing an exception, it actually has a much more common (and significant) use; more on that later.

  7. After RtlpExecuteHandlerForUnwind returns, RtlUnwindEx decides what course of action to persue based on the return value. There are two legal return values from an exception handler called for unwind, ExceptionContinueSearch (the general “success”) return, and ExceptionCollidedUnwind. If any other value is returned, then RtlUnwindEx raises a STATUS_INVALID_DISPOSITION exception, indicating that an unwind handler has returned an illegal value (this is typically rarely seen in practice, as most unwind handlers are compiler generated, and therefore always get the return value correct). If ExceptionContinueSearch is returned, and the current EstablisherFrame doesn’t match the TargetFrame argument, then the unwind context and the context for the “current frame” are swapped (this positions the current frame context as referring to the context of the next function in the call chain, which will then be duplicated and unwound in the next loop iteration). If ExceptionCollidedUnwind is returned, then the execution path is a little bit more complicated. In the collided unwind case, all of the internal state information that RtlUnwindEx had previously copied into the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandler back out of the DISPATCHER_CONTEXT structure. RtlVirtualUnwind is then executed to determine the next lowest call frame using the context copied out of the DISPATCHER_CONTEXT structure, the EXCEPTION_COLLIDED_UNWIND flag is set, and control is transferred to step 5. This step may initially seem strange, but it will become clear after it is explained later.
  8. If control reaches this point, then a frame has been successfully unwound, and any applicable unwind handler has been notified of the unwind operation. The next step is a re-validation of the EstablisherFrame value (as it may have changed in the collided unwind case). Assuming that EstablisherFrame is valid, if its value does not match the TargetFrame argument, then control is transferred to step 1. Otherwise, if there is a match, then the loop terminates. (If the EstablisherFrame is not valid, and is not the expected TargetFrame value, then either the unwind exception record is raised as an exception, or a STATUS_BAD_FUNCTION_TABLE exception is raised.)

At this point, RtlUnwindEx has arrived at its target frame, and all intermediary unwind handlers have been called. It is now time to transfer control to the unwind point. The ReturnValue argument is again loaded into the current frame’s context (Rax register), and if the exception code supplied by the RtlUnwindEx caller via the ExceptionRecord argument does not match STATUS_UNWIND_CONSOLIDATE, the Rip value in the current frame’s context is replaced with the TargetIp argument.

The final task is to realize the finalized context; this is done by calling RtlRestoreContext, passing it the current frame’s context and the ExceptionRecord argument (or the default exception record constructed if no ExceptionRecord argument was supplied). RtlRestoreContext will in most cases simply copy the given context into the currently active register set, although in two special cases (if a STATUS_LONGJUMP or STATUS_UNWIND_CONSOLIDATE exception code is set in the optional ExceptionRecord argument), this behavior deviates from the norm. In the long jump case (as previously documented), the ExceptionRecord argument is assumed to contain a pointer to a jmp_buf, which contains a nonvolatile register set to restore on top of the unwound context supplied by RtlUnwindEx. The unwind consolidate case is rather more complicated, and will be discussed in a future posting.

For reference, I have posted some annotated, reverse engineered C and assembler code describing the internal operations of RtlUnwindEx and several of its helper functions (such as RtlpUnwindHander). This C code is based off of the Windows Vista implementation of RtlUnwindEx, and as such takes advantage of new Windows Vista-specific optimizations to unwind handling. Specifically, the “Unwind” flag in the UNWIND_HISTORY_TABLE structure is new in Windows Vista (although the size of the structure has not changed; there used to be empty alignment padding at that offset in previous Windows versions). This flag is used as a hint to RtlLookupFunctionEntry, in order to expedite lookup of function entries for some commonly referenced functions in the unwind path. Between the provided comments and the above description of the overall functionality of RtlUnwindEx, the inner workings of it should begin to come clear. There are some aspects (in particular, collided unwind) that are a bit more complicated than one might initially imagine; I’ll discuss collided unwinds (and more) in the next posting in this series.

It would be best to call the system version of RtlUnwindEx instead of reimplementing it for general purpose use (which I have done so here primarily to illustrate how unwind works on x64 Windows). There have been improvements made to RtlUnwindEx between Windows Server 2003 SP1 x64 and Windows Vista x64, so it would be unwise to assume that RtlUnwindEx will remain devoid of new performance or feature additions forever.

Next up: Collided unwinds, and other things that go “bump” in the dark when you use compiler exception handling and unwind support.

Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)

Sunday, January 7th, 2007

Previously, I provided a brief overview of what each of the core APIs relating to x64’s extensive data-driven unwind support were, and when you might find them useful.

This post focuses on discussing the interface-level details of RtlUnwindEx, and how they relate to procedure unwinding on Windows (x64 versions, specifically, though most of the concepts apply to other architecture in principle).

The main workhorse of unwind support on x64 Windows is RtlUnwindEx. As previously described, this routine encapsulates all of the work necessary to restore execution context to a prior point in the call stack (relying on RtlVirtualUnwind for this task). RtlUnwindEx also implements all of the logic relating to interactions with unwind/exception handlers during the unwind process (which is essentially the value added by RtlUnwindEx on top of what RtlVirtualUnwind implements).

In order to understand the inner workings of how unwinding works, it is first necessary to understand the high level theory behind how RtlUnwindEx is used (as RtlUnwindEx is at the heart of unwind support on Windows). Although there have been previously posted articles that touch briefly on how unwind is implemented, none that I have seen include all of the details, which is something that this segment of the x64 exception handling series shall attempt to correct.

For the moment, it is simpler to just consider the unwind half of exception handling. The nitty-gritty, exhaustive details of how exceptions are handled and dispatched will be discussed in a future posting; for now, assume that we are only interested in the unwind code path.

When a procedure unwind is requested, by any place within the system, the first order of business is a call to RtlUnwindEx. The prototype for RtlUnwindEx was provided in a previous posting, but in an effort to ensure that everyone is on the same page with this discussion, here’s what it looks like for x64:

VOID
NTAPI
RtlUnwindEx(
   __in_opt ULONG64               TargetFrame,
   __in_opt ULONG64               TargetIp,
   __in_opt PEXCEPTION_RECORD     ExceptionRecord,
   __in     PVOID                 ReturnValue,
   __out    PCONTEXT              OriginalContext,
   __in_opt PUNWIND_HISTORY_TABLE HistoryTable
   );

These parameters deserve perhaps a bit more explanation.

  1. TargetFrame describes the stack pointer (rsp) value for the target of the unwind operation. In normal circumstances, this is always the EstablisherFrame argument to an exception handler that is handling an exception. In the context of an exception handler, EstablisherFrame refers to the stack pointer of the caller of the function that caused the exception being inspected. Likewise, in this context, TargetFrame refers to the stack pointer of the function that the call stack should be unwound to. Although given the fact that with data-driven unwind semantics, one might initially think that this argument is unnecessary (after all, one might assume that RtlUnwindEx could simply invoke RtlVirtualUnwind in order to determine the expected stack pointer value for the next function on the call stack), this argument is actually required. The reason is that RtlUnwindEx supports unwinding past multiple procedure frames; that is, RtlUnwindEx can be used to unwind to a function that is several levels down in the call stack, instead of the immediately lower function in the call stack. Note that the TargetFrame argument must match exactly the expected stack pointer value of the target function in the call stack.

    Observant readers may pick up on the SAL annotation describing the TargetFrame argument and notice that it is marked as optional. In general, TargetFrame is always supplied; it can be omitted in one specific circumstance, which is known as an exit unwind; more on that later.

  2. TargetIp serves a similar purpose as TargetFrame; it describes the instruction pointer value that execution should be unwound to. TargetIp must be an instruction in the same function on the call stack that corresponds to the target stack frame described by TargetFrame. This argument is supplied as a particular function may have multiple points that could be resumed in response to an exception (this typically the case if there are multiple try/except clauses).

    Like TargetFrame, the TargetIp argument is also optional (though in most cases, it will be present). Specifically, if a frame consolidation unwind operation is being executed, then the TargetIp argument will be ignored by RtlUnwindEx and may be set to zero if desired (it will, however, still be passed to unwind handlers for use as they see fit). This specialized unwind operation will be discussed later, along with C++ exception support.

  3. ExceptionRecord is an optional argument describing the reason for an unwind operation. This is typically the same exception record that was indicated as the cause of an exception (if the caller is an exception handler), although it does not strictly have to be as such. If no exception record is supplied, RtlUnwindEx constructs a default exception record to pass on to unwind handlers, with an exception code of STATUS_UNWIND and an exception address referring to an instruction within RtlUnwindEx itself.
  4. ReturnValue describes a pointer-sized value that is to be placed in the return value register at the completion of an unwind operation, just before control is transferred to the newly unwound context. The interpretation of this value is entirely up to the routine being unwound into. In practice, the Microsoft C/C++ compiler does not use the return value at all in typical cases. Usually, the Microsoft C/C++ compiler will indicate the exception code that caused the exception as the return value, but due to how unwinding across functions works with try/except, there is no language-level support for retrieving the return value of a function that has been unwound due to an exception. As a result, in most circumstances, the return value placed in the unwound execution context based on this argument is ignored.
  5. OriginalContext describes an out-only pointer to a context record that is updated with the execution context as procedure call frames are unwound. In practice, as RtlUnwindEx does not ever “return” to its caller, this value is typically only provided as a way for a caller to supply its own storage to be used as scratch space by RtlUnwindEx during the intermediate unwind operations comprimising an unwind to the target call frame. Typically, the context record passed in to an exception handler from the exception dispatcher is supplied. Because the initial contents of the OriginalContext argument are not used, however, this argument need not necessarily be the context record passed in from the exception dispatcher.
  6. HistoryTable describes a cache used to improve the performance of repeated function entry lookups via RtlLookupFunctionEntry. Under normal circumstances, this is the same history table passed in from the exception dispatcher to an exception handler, although it could also be a caller-allocated structure as well. This argument can also be safely omitted entirely, although if a non-trivial set of call frames are being unwound, passing in even a newly-initialized history table may improve performance.

Given all of the above information, RtlUnwindEx performs a procedure call unwind by performing a successive sequence of RtlVirtualUnwind calls (to determine the execution context of the next call frame in the call stack), followed by a call to the registered language handler for the call frame (if one exists and is marked for unwinding support). In most cases where there is a language unwind handler, it will point to _C_specific_handler, which internally searches all of the internal exception handling scopes (e.g. try/except or try/finally constructs), calling “finally” handlers as need be. There may also be internal unwind handlers that are present in the scope table for a particular function, such as for C++ destructor support (assuming asynchronous C++ exception handling has been enabled). Most users will thus interact with unwind handlers in the form of a “finally” handler in a try/finally construct in a function whose language handler refers to _C_specific_handler.

If RtlUnwindEx encounters a “leaf function” during the unwind process (a leaf function is a function that does not use the stack and calls no subfunctions), then it is possible that there will be no matching RUNTIME_FUNCTION entry for the current call frame returned by RtlLookupFunctionEntry. In this case, RtlUnwindEx assumes that the return address of the current call frame is at the current value of Rsp (and that the current call frame has no unwind or exception handlers). Because the x64 calling convention enforces hard rules as to what functions without RUNTIME_FUNCTION registrations can do with the stack, this is a valid assumption for RtlUnwindEx to make (and a necessary assumption, as there is no way to call RtlVirtualUnwind on a function with no matching RUNTIME_FUNCTION entry). The current call frame’s value of Rsp (in the context record describing the current call frame, not the register value of rsp itself within RtlUnwindEx) is dereferenced to locate the call frame’s return address (Rip value), and the saved Rsp value is then adjusted accordingly (increased by 8 bytes).

When RtlUnwindEx locates the endpoint frame of the unwind, a special flag (EXCEPTION_TARGET_UNWIND) is set in the ExceptionFlags member of the EXCEPTION_RECORD passed to the language handler. This flag indicates to the language handler (and possibly any C-language scope handlers) that the handler is being called as the “final destination” of the unwind operation. The Microsoft C/C++ compiler does not expose functionality to detect whether a “finally” handler is being called in the context of a target unwind or if the “finally” handler is simply being called as an intermediate step towards the unwind target.

After the last unwind handler (if applicable) has been called, RtlUnwindEx restores the execution context that has been continually updated by successive calls to RtlVirtualUnwind. This restoration is performed by a call to RtlRestoreContext (a documented, exported function), which simply transfers a given context record to the thread’s execution context (thus “realizing” it).

RtlUnwindEx does not return a value to its caller. In fact, it typically does not return to its caller at all; the only “return” path for RtlUnwindEx is in the case where the passed-in execution context is corrupted (typically due to a bogus stack pointer), or if an exception handler does something illegal (such as returning an unrecognized EXCEPTION_DISPOSITION) value. In these cases, RtlUnwindEx will raise a noncontinuable exception describing the problem (via RtlRaiseStatus). These error conditions are usually fatal (and are indicative of something being seriously corrupted in the process), and virtually always result in the process being terminated. As a result, it is atypical for a caller of RtlUnwindEx to attempt to handle these error cases with an exception handler block.

In the case where RtlUnwindEx performs the requested unwind successfully, a new execution context describing the state at the requested (unwound) call frame is directly realized, and as such RtlUnwindEx does not ever truly return in the success case.

Although RtlUnwindEx is principally used in conjunction with exception handling, there are other use cases implemented by the Microsoft C/C++ compiler which internally rely upon RtlUnwindEx in unrelated capacities. Specifically, RtlUnwindEx implements the core of the standard setjmp and longjmp routines (assuming the exception safe versions of these are enabled by use of the <setjmpex.h> header file) provided by the C runtime library in the Microsoft CRT.

In the exception-safe setjmp/longjmp case, the jmp_buf argument essentially contains an abridged version of the execution context (specifically, volatile register values are omitted). When longjmp is called, the routine constructs an EXCEPTION_RECORD with STATUS_LONGJUMP as the exception code, sets up one exception information parameter (which is a pointer to the jmp_buf), and passes control to RtlUnwindEx (for the curious, the x64 version of the jmp_buf structure is described as _JUMP_BUFFER in setjmp.h under the _M_AMD64_ section). In this particular instance, the ReturnValue argument of RtlUnwindEx is significant; it corresponds to the value that is seemingly returned by setjmp when control is being transferred to the saved setjmp context as part of a longjmp call (somewhat similar in principal as to how the UNIX fork system call indicates whether it is returning to the child process or the parent process). The internal operations of RtlUnwindEx are identical whether it is being used for the implementation of setjmp/longjmp, or for conventional exception-handler-based triggering of procedure call frame unwinding.

However, there are differences that appear when RtlUnwindEx restores the execution context via RtlRestoreContext. There is special support inside RtlRestoreContext for STATUS_LONGJUMP exceptions with one exception information parameter; if this situation is detected, then RtlRestoreContext internally reinitializes portions of the passed-in context record based on the jmp_buf pointer stored in the exception information parameter block of the exception record provided to RtlRestoreContext by RtlUnwindEx. After this special-case partial reinitialization of the context record is complete, RtlRestoreContext realizes the context record as normal (causing execution control to be transferred to the stored Rip value). This can be seen as a hack (and a violation of abstraction layers; there is intended to be a logical separation between operating system level SEH support, and language level SEH support; this special support in RtlRestoreContext blurs the distinction between the two for C language support with the Microsoft C/C++ compiler). This layering violation is not the most egregious in the x64 exception handling scene, however.

This concludes the basic overview of the interface provided by RtlUnwindEx. There are some things that I have not yet covered, such as exit unwinds, collided unwinds, or the deep integration and support for C++ try/catch, and some of the highly unsavory things done in the name of C++ exception support. Next time: A walkthrough of the complete internal implementation of RtlUnwindEx, including undocumented, never-before-seen (or barely documented) corner cases like exit unwinds or collided unwinds (the internals of C++ exception support from the perspective of RtlUnwindEx are reserved for a future posting, due to size considerations).

Use a custom symbol server in conjunction with IDA with Vladimir Scherbina’s IDA plugin

Tuesday, December 5th, 2006

Vladimir Scherbina has recently released a useful IDA plugin that enhances IDA’s now built-in support for loading symbols via the symbol server to allow custom symbol server paths. This is something I personally have been wanting for some time; IDA’s PDB loading mechanism overrides _NT_SYMBOL_PATH with a hardcoded value of the Microsoft symbol server. This breaks my little trick for injecting symbol server support into programs that do not already support it, which is fairly annoying. Now, with Vladimir’s plugin, you can have IDA use a custom symbol server without having to hack the PDB plugin and change its hardcoded string constant for the Microsoft symbol server path. (Plus, you can have IDA use your local downstream store cache as well – another disadvantage to how IDA normally loads symbols via PDB.)

Win32 calling conventions review

Friday, November 10th, 2006

Recently, I’ve posted about the Win32 calling conventions. Here’s a table of contents of the various different posts I’ve made.

  1. Win32 calling conventions: Concepts
  2. Win32 calling conventions: Usage cases
  3. Win32 calling conventions: __cdecl in assembler
  4. Win32 calling conventions: __stdcall in assembler
  5. Win32 calling conventions: __fastcall in assembler
  6. Win32 calling conventions: __thiscall in assembler

Remember that when picking a calling convention to use, there are a number of factors to consider. There is no one calling convention that fits all cases (however, __stdcall is a good default if you are not sure).

Hopefully, you’ll have found this series to be enlightening, useful, and practically applicable.

Debugger flow control: More on breakpoints (part 2)

Wednesday, November 8th, 2006

Last time, I discussed the two types of breakpoints that you’ll see in a debugger (hardware and sfotware) at a high level. I didn’t really explain when it was best to use one instead of the other, though, besides a couple hints relating to hardware breakpoints being limited in number and good for tracking down memory corruption issues at times.

By taking a closer look at how each of the two breakpoints work, we can get some idea as to when we’ll prefer one to another. Both types of breakpoints alter the target in some way, but to differing degrees.

The primary concern with software breakpoints is that they actually involve patching memory in the target to set the breakpoint. This is usually fine; the debugger uses it as its default breakpoint strategy when you give an end address to g, for instance. However, it begins to break down if the target both executes a region that you are setting a breakpoint on, and also reads that same region.

This particular concern is a real problem when you are dealing with self-modifying code, or certain protection schemes (such as code that attempts to checksum itself in memory). In these cases, you might accidentally break self-modifying code, or trip a protection scheme, simply by virtue of setting a breakpoint (since the act of setting a software breakpoint actually modifies the address space of the target).

In cases like this, hardware breakpoints can come to the rescue. Since setting an execute hardware breakpoint does not actually modify the underlying instruction, anything that reads the memory backing that instruction will not get back an 0xCC opcode instead of the real first byte of the instruction opcode. Granted, you can only have four enabled hardware breakpoints at a time, but usually you can get by with that many (or at least, a “sliding window” of hardware breakpoints, assuming you have breakpoints over a well-defined execution sequence. In this case, you could have breakpoints disable themselves and enable the next breakpoint, thus conserving the number of active breakpoints).

There is also the other obvious advantage to hardware breakpoints which I touched on earlier: the ability to set a breakpoint on a memory fetch for a particular address. This obviously has a great deal of uses, whether you’re reverse engineering something or are tracking down a corruption problem. Memory-access breakpoints are an excellent way to very quickly figure out which piece of code is modifying a variable, without having to trace through an arbitrarily large set of code to find the access that you were looking for. One thing to consider about memory-access breakpoints on x86 and x64, though, is that there is only support for setting memory-access breakpoints on regions that are 1) a power of 2 in length, and 2) have a length that is less than or equal to the native pointer size (8 for x64, or 4 for x86). (If you are lucky enough [or perhaps unlucky enough, as Itanium isn’t exactly the most friendly thing to view from an assembler perspective] to be debugging on an Itanium platform, this restriction does not exist; you can set a length of any power of 2 between 1 byte and two gigabytes). As a result, you’ll have to plan where to set your breakpoints carefully, as on x86, you can only cover at most 16 bytes with this kind of “memory guard” access. You might or might not be able to use the same kind of “sliding breakpoint window” idea I mentioned above, if the memory locations you are setting breakpoints on are accessed in a particular sequence (or at least, the accesses that you are interested in).

Hardware breakpoints are typically less invasive than software breakpoints, but there are still ways that they can be interfered with. The most common case for this happening is if you try to set a hardware breakpoint while DLL initializers are being called during process startup (such as at the initial create process breakpoint). If you try to do this, you’ll get a warning from the debugger advising you that your breakpoints won’t stick:

0:000> ba e1 kernel32!CreateFileA
        ^ Unable to set breakpoint error
The system resets thread contexts after the process
breakpoint so hardware breakpoints cannot be set.
Go to the executable's entry point and set it then.
 'ba e1 kernel32!CreateFileA'

The reason why this is the case is that there is a context set that occurs between the initial process breakpoint being hit and the requested thread start address / process start address being executed. I’ll go into just how this works at process startup in a future posting, but to keep it simple, the basic idea is that an APC is queued to the new usermode thread that runs the loader component in NTDLL. One of the arguments to the APC is a context record describing the register context that was requested for the new thread by CreateProcess, CreateThread, and soforth. The loader component runs process (or thread) DLL initializers, and then calls NtContinue to continue execution at the specified context record, which kicks off execution at the user requested thread start address. We can see this in action easily by looking at the arguments that the APC dispatcher supplies to the loader initializer APC:

0:000> kv
ChildEBP RetAddr  Args to Child              
0013fb1c 7c93edc0 7ffdf000 ntdll!DbgBreakPoint
0013fc94 7c921639 0013fd30 ntdll!LdrpInitializeProcess+0xffa
0013fd1c 7c90eac7 0013fd30 ntdll!_LdrpInitialize+0x183
00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x7
0:000> .cxr 0013fd30 
eax=4ad05056 ebx=7ffdd000 ecx=00f2faa8 edx=00090000
esi=7c9118f1 edi=00011970
eip=7c810665 esp=0013fffc ebp=7c910570 iopl=0
         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=0038  gs=0000
             efl=00000200
kernel32!BaseProcessStartThunk:
7c810665 33ed            xor     ebp,ebp

If you have been paying attention so far, it should be clear why hardware breakpoints set at the initial process breakpoint do not appear to work like you might expect when you set them at the initial process breakin: When the APC that runs loader initializers returns, it restores a previously saved register context image via NtContinue. Since hardware breakpoints are part of the register context, they are wiped away after the context is restored, and so your breakpoints would appear to simply disappear after DLL initializers were finished.

This limitation also implies that calling SetThreadContext on a thread can interfere with hardware breakpoints if care is not taken to preserve the value of the Dr series of registers. Indeed, some protection schemes utilize such a trick in an attempt to defeat hardware breakpoints.

Fortunately, it is easy to work around such limitations using the debugger. There is a little-used command called “.apply_dbp” that allows you to instruct the debugger that it should re-apply hardware breakpoints, either to the current register context, or a saved register context image in-memory (supplied by the /m Context argument). With the use of this command, you can quickly restore your hardware breakpoints even after something attempts to trash them. Combined with a conventional breakpoint on, say, kernel32!SetThreadContext, this can be used to quickly re-enable the use of hardware breakpoints on such cases. You can also use this trick to persist hardware breakpoints in the process startup case, by using .apply_dbp /m <address-of-context-record-argument-from-APC-dispatcher> to enforce any hardware breakpoints you set in the register context image that will eventually be restored by NtContinue. For instance, in the case of the example that I gave above, you might use the following to apply hardware breakpoints to the context that NtContinue will restore:

0:000> .apply_dbp /m 0013fd30 
Applied data breakpoint state

Next up, some more tricks that you can do to get the most out of controlling the target in the debugger.