Archive for February, 2007

Debugger internals: Why do ntoskrnl and ntdll have type information?

Wednesday, February 14th, 2007

Johan Johansson sent me a mail asking why nt and ntdll have partial type information included (if you’re at all experienced with debugging on Windows, you’ll know that public symbols, such as what Microsoft ships on the public symbol server at, don’t typically include type information. Instead, one typically needs access to private symbols in order to view types.

However, nt and ntdll are an exception to this rule, on Windows XP or later. Unlike all the other PDBs shipped by Microsoft, the ones corresponding to ntdll and ntoskrnl do include type information for a seemingly arbitrary mix of types, some publicly documented and some undocumented. There is, however, a method to the madness with respect to what symbols are included in the public nt/ntdll PDBs.

To understand what symbols are chosen and why, though, it’s necessary to know a bit of history.

Way back in the days when Windows NT was still called Windows NT (and not Windows 2000 or Windows XP), the debugger scene was a much less friendly place. In those days, “remote debugging” involved dialing up your debugger person with a modem, and if you wanted to kernel debug a target, you had to run a different kernel debugger program specific to the architecture on the target computer.

Additionally, you had to use a debugger version that was newer than the operating system on the target computer or things wouldn’t work out very well. Furthermore, one had to load architecture-specific extension dlls for many kernel debugging tasks. One of the reasons for these restrictions, among other things, is that for different architectures (and different OS releases), the size and layout of many internal structures used by the debugger (and debugger extension modules) to do their work varied. In other words, the size and layout of, say, EPROCESS might not be the same on Windows NT 3.1 x86 vs Windows NT 4.0 for the DEC Alpha.

When Windows 2000 was released, things became a little bit better- Windows 2000 only publicly supported x86, which reduced the number of different architectures that WinDbg needed to publicly support going forward. However, Windows XP and Windows Server 2003 reintroduced mainstream support for non-x86 architectures (first IA64 and then x64).

At some point on the road to Windows XP and Windows Server 2003, a decision was made to clean things up from the debugger perspective and introduce a more manageable way of supporting a large matrix of operating systems and target architectures.

Part of the solution devised involved providing a unified, future-compatible (where possible; obviously, new or radically redesigned OS functionality would require debugger extension changes, but things like simple structure size chages shouldn’t require such drastic measures) method for accessing data on the remote system. Since all that a debugger extension does in the common case is simply quickly reformat data on the target system into an easily human-readable format (such as the process list returned by !process), a unified way to communicate structure sizes and layouts to debugger extensions (and the debugger engine itself) would greatly reduce the headache of supporting the debugger on an ever-expanding set of platforms and operating systems. This solution involved putting the structures used by the debugger engine itself and many debugger extension into the symbol files shipped with ntoskrnl and ntdll, and then providing a well-defined API for retrieving that type information from the perspective of a debugger extension.

Fast-forward to 2007. Now, there is a single, unified debugger engine for all target platforms and architectures (the i386kd.exe and ia64kd.exe that ship with the DTW distribution are essentially the same as the plain kd.exe and are vestigal remains of a by-gone era; these files are simply retained for backwards compability with scripts and programs that drive the debugger), real remote debugging exists, and your debugger doesn’t break every time a service pack is released. All of this is made possible in part due to the symbol information available in the ntoskrnl and ntdll PDB files. This is why you can use WinDbg to debug Windows Vista RTM, despite the fact that WinDbg was released months before the final RTM build was shipped.

Symbol support is also a reason why there is no ‘srv03ext’, ‘srv03fre’, or ‘srv03chk’ extension directories under your Debugging Tools for Windows folder. The nt4chk/nt4fre/w2kchk/w2kfre directories contain debugger extensions specific to that Windows build. Due to the new unified debugger architecture, there is no longer a need to tie a binary to a particular operating system build going forward. Because Windows 2000 and Windows NT 4.0 doesn’t include type data, however, the old directories still remain for backwards compatibility with those platforms.

So, to answer Johan’s question: All of the symbols in the ntoskrnl or ntdll PDBs should be used by the debugger engine itself, or some debugger extension, somewhere. This is the final determining factor as to what types are exposed via those PDBs, to my knowledge; whether a public debugger extension DLL (or the debugger engine itself) uses them or not.

Reverse engineer’s toolkit

Tuesday, February 13th, 2007

If you’re planning on reverse engineering something (or debugging unfaimilar code), then it’s important to have the right tools for the job. This is a short list of some of the various tools that I find useful for this line of work (some are free, others are not; they are good tools, though, so I would encourage you to support the authors); if I am going to be doing extensive reversing (or debugging) work on a system, I’ll typically install most (or all) of these tools:

  • WinDbg (Debugging Tools for Windows); the de-facto debugger for Windows NT-based systems (and it’s freely available). Although there are other good debuggers out there (such as OllyDbg), it’s hard to beat the fact that Microsoft develops the debugger, and the designers have a great deal of information about and experience with the internals of the operating system. WinDbg comes with a plethora of plugin modules (extensions) that automate and improve many common tasks, and it has a well-documented and powerful interface to allow third parties (such as yourself) to program new extension modules for customized tasks. Additionally, WinDbg supports a (relatively crude, but effective nevertheless) scripting language. WinDbg can be used on 32-bit and 64-bit (IA64 and x64) targets in both user mode and kernel mode, and is the only supported facilities for debugging problems at customer sites or debugging certain specialized scenarios (like certain infrastructure processes, or fullscreen processes). The Debugging Tools for Windows distribution also includes ntsd, cdb, and kd, which are different (command-line) front-ends to the same debugger engine used by WinDbg. It also includes various other useful utilities, such as Logger (an API call spy program).
  • IDA Pro, or the Interactive Disassembler. IDA is the most full-featured disassembler (by a huge margin) out there that I know of, and it supports a wide variety of target architectures and operating systems to boot (although its support for x86 is probably the most full-featured of all the target architectures that you can use it on). IDA automates many of the aspects of disassembly and reverse engineering that have historically been tedious and error-prone, such as stack variable tracking. The full version of IDA is relatively pricey (on the order of $500..$900 or so, depending on whether you need x64 support), but if you’re serious about reverse engineering, it’s a good investment. For the casual reverse engineer, DataRescue has released a previous version, IDA 4.30, for free. If you are just working with x86 or simply want to get your feet wet with reversing, the free version is a good place to start. Although tools like WinDbg include a built-in disassembler, IDA’s advanced code analysis and interactive aspects (such as allowing the user to describe and annotate types, names, and variables in a structured fashion) make it far superior for non-trivial reverse engineering projects.
  • HIEW (Hacker’s vIEW) is a powerful (console-based) hex editor / disassembler / assembler for 16-bit, 32-bit, and 64-bit (x64) targets. It understands the various Windows image formats, but it can also be used on raw binary files as well. In addition to serving all of your hex editor needs, HIEW is just the perfect tool to use when you need to make quick on-disk patches to executables (and the integrated disassembler and assembler makes creating and applying such patches on-the-fly a walk in the park compared to traditional hex editors, which wourld require you to manually build the opcodes yourself, a pain for non-trivial patches). HIEW includes some additional power-features as well, such as a way to create and run simple programs to decrypt sections in a file (very useful if you’re working on something that is self-decrypting, and you know how to decrypt it but don’t have a program to do so already). It also includes a fairly simple plugin extension interface to allow custom actions to be integrated with the HIEW user interface. HIEW isn’t free, although it is fairly reasonably priced (and there is a (limited) demo that you can play around with).
  • The Windows Vista SDK is an essential tool for many reverse engineering tasks. It includes extensive documentation (as well as headers) for all public Win32 APIs, and it also includes several useful utilities as well (such as link.exe /dump, otherwise known as dumpbin.exe, which can be used to quickly extract various bits of information from a binary (like a list of imports) without having to load it up into a full-featured disassembler tool). The Vista SDK also includes OleView, which can be useful for inspecting a third-party COM library, as it has integrated support for turning a type library into MSIL (which can be trivially converted to a C header as needed).
  • Process Monitor, from SysInternals, is a great utility for quickly inspecting what file and registry operations a program is making. Depending on what you are trying to figure out when analyzing a program, simply looking at its file and registry activity can often save you hours of wading through disassembly or working with a debugger. Process Monitor allows you to perform (potentially filtered) in-depth logging of low-level file and registry activity, including operations that fail and the data returned by successful operations in some circumstances. Process Monitor is freely available from Microsoft’s website.
  • Process Explorer, formerly known as HandleEx, is another freely available tool from SysInternals. It allows quick and easy searching of open handles and loaded modules within a process, which is handy for initial information gathering (such as finding which process uses a DLL that you might be interested in). The ability to quickly check active processes for a DLL is also quite handy if you are trying to track down where a shared DLL (like a Windows hook DLL) is getting loaded in as a precursor to in-depth analysis.
  • SDbgExt is a WinDbg extension that I’ve developed (shameless plug). It provides a number of features that are useful for analyzing unfamiliar targets, such as the ability to import map files and create symbolic names out of them (particularly useful in conjunction with IDA’s export map file feature, if you want to synchronize your process between IDA and WinDbg), support for identifying writable function pointers within a process address space, and support for locating and displaying x64 exception/unwind handlers. SDbgExt presently only supports the 32-bit version of WinDbg. It is freely available (see the link above), and requires that WinDbg and the VC2005 CRT be installed.

There are other useful utilities out there (this is hardly an all-inclusive list), but these are the ones that I use the most often in a wide variety of situations.

Compiler tricks in x86 assembly, part 2

Monday, February 12th, 2007

A common operation in many C programs is a division by a constant quantity. This sort of operation occurs all over the place; in explicit mathematical calculations, and in other, more innocuous situations such as where the programmer does something like take a size in bytes and convert it into a count of structures.

This fundamental operation (division by a constant) is also one that the compiler will tend to optimize in ways that may seem a bit strange until you know what is going on.

Consider this simple C program:

	int ac,
	wchar_t** av)
	unsigned long x = rand();

	return x / 1518;

In assembly, one would expect to see the following operations emitted by the compiler:

  1. A call to rand, to populate the ‘x’ variable.
  2. A div instruction to perform the requested division of ‘x’ by 1518.

However, if optimizations are turned on and the compiler is not configured to prioritize small code over performance exclusively, then the result is not what we might expect:

; int __cdecl main(int argc,
  const char **argv,
  const char *envp)
_main proc near
call    ds:__imp__rand
mov     ecx, eax
mov     eax, 596179C3h
mul     ecx
mov     eax, ecx
sub     eax, edx
shr     eax, 1
add     eax, edx
shr     eax, 0Ah ; return value in eax

The first time you encounter something like this, you’ll probably be wondering something along the lines of “what was the compiler thinking when it did that, and how does that set of instructions end up dividing out ‘x’ by 1518?”.

What has happened here is that the compiler has used a trick sometimes known as magic number division to improve performance. In x86 (and in most processor architectures, actually), it is typically much more expensive to perform a division than a multiplication (or most any other basic mathematical primative, for that matter). While ordinarily, it is typically not possible to make division any faster than what the ‘div’ instruction gives you performance-wise, under special circumstances it is possible to do a bit better. Specifically, if the compiler knows that one of the operands to the division operation is a constant (specifically, the divisor), then it can engage some clever tricks in order to eliminate the expensive ‘div’ instruction and turn it into a sequence of other instructions that, while initially longer and more complicated on the surface, are actually faster to execute.

In this instance, the compiler has taken one ‘div’ instruction and converted it into a multiply, subtract, add, and two right shift operations. The basic idea behind how the code the compiler produced actually gives the correct result is that it typically involves a multiplication with a large constant, where the low 32-bits are then discarded and then remaining high 32-bits (recall that the ‘mul’ instruction returns the result in edx:eax, so the high 32-bits of the result are stored in edx) have some transform applied, typically one that includes fast division by a power of 2 via a right shift.

Sure enough, we can demonstrate that the above code works by manually comparing the result to a known value, and then manually performing each of the individual operations. For example, assume that in the above program, the call to rand returns 10626 (otherwise known as 7 * 1518, for simplicity, although the optimized code above works equally well for non-multiples of 1518).

Taking the above set of instructions, we perform the following operations:

  1. Multiply 10626 by the magic constant 0x596179C3. This yields 0x00000E7E00001006.
  2. Discard the low 32-bits (eax), saving only the high 32-bits of the multiplication (edx; 0xE7E, or 3710 in decimal).
  3. Subtract 3710 from the original input value, 10626, yielding 6916.
  4. Divide 6916 by 2^1 (bit shift right by 1), yielding 3458.
  5. Add 3710 to 3458, yielding 7168.
  6. Divide 7168 by 2^10 (bit shift right by 10), yielding 7, our final answer. (10626 / 1518 == 7)

Depending on the values being divided, it is possible that there may only be a right shift after the large multiply. For instance, if we change the above program to divide by 1517 instead of 1518, then we might see the following code generated instead:

mov     eax, 5666F0A5h
mul     ecx
shr     edx, 9

Coming up with the “magical divide” constants in the first place involves a bit of math; interested readers can read up about it on DevDev.

If you’re reverse engineering something that has these characteristics (large multiply that discards the low 32-bits and performs some transforms on the high 32-bits), then I would recommend breaking up each instruction into the corresponding mathematical operation (be aware of tricks like shift right to divide by a power of 2, or shift left to multiply by a power of 2). As a quick “litmus test”, you can always just pick an input and do the operations, and then simply compare the result to the input and see if it makes sense in the context of a division operation.

More on other optimizer tricks in a future posting…

Compiler optimizer tricks in x86 assembly, part 1

Saturday, February 10th, 2007

The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious. With a bit of practice, you can train yourself to quickly identify these optimizations and see what they really do. These kinds of optimizations are very common with constant values.

With many of these little “assembly tricks”, the compiler is not simply taking into account instruction speed, but also instruction size, in order to reduce the total amount of opcode bytes required to do an optimization.

This post series attempts to provide a basic overview (and rationale) of several of the most common compiler optimizations that you might see while analyzing x86 assembly code. Knowing what these “assembly tricks” do is a great benefit if you are debugging or reverse engineering something, as it allows one to quickly recognize certain compiler patterns and gain clues as to what a program is doing.

Without further ado, here’s a brief overview of several of these such optimizations:

  1. Clearing a register value by xor reg, reg is a very common sequence in x86 code generated by a compiler. You will almost never see an instruction of the form mov reg, 0. Instead, the compiler will virtually always use the above-mentioned xor construct.

    The reasoning behind this is that the xor reg, reg construct is a very small instruction in terms of opcode bytes (2 bytes), whereas assigning a constant 32-bit value is typically much more expensive in terms of opcode length (say, 5 bytes).

    The gain with this optimization is reduced code size. Reducing code size is always good, and can lead to improved performance in that it improves the cacheability of a particular chunk of code (remember, most processor cache sizes are still very small compared to main system memory or hard drives). Also, if the compiler can shrink down the total image size by even one page (e.g. 4096 bytes) with optimizations like this, that’s one less page fault that needs to be taken when loading the program. This can be noticible even in lengths less than a page, if a function can be kept within one page instead of spilling over to a neighboring page (which might not be used for the current code path), which can eliminate “extraneous” page faults, where most of the data brought in is unnecessary.

    (It’s worth noting that this sort of optimization has come a long way recently, in the form of profile-guided BBT-style optimizations that reorder “hot” code paths to be on the same page in an attempt to make every byte that is in-paged from disk be as relevant as possible to the current execution path.)

  2. The constant zero register another very common optimization technique used by the compiler. Since the value zero appears so frequently in many functions (e.g. default parameter values are typically zero, and it is very common to compare values against zero or assign values to zero), the compiler will sometimes dedicate an entire register to contain the value zero throughout most of the function. Here’s an example, taken from nt!MmZeroPageThread:

    xor     esi, esi
    cmp     [ebp+var_14], esi
    push    esi             ; WaitBlockArray
    push    esi             ; Timeout
    push    esi             ; Alertable
    push    esi             ; WaitMode
    push    8               ; WaitReason
    xor     ebx, ebx
    inc     ebx
    push    ebx             ; WaitType
    lea     eax, [ebp+Object]
    push    eax             ; Object
    push    2               ; Count
    call    _KeWaitForMultipleObjects@32

    Here, the compiler has dedicated the “esi” register to be the constant zero value for this function. It is used in many cases; in just a small part of nt!MmZeroPageThread, for instance, we can see that it is being used as both an argument to the “cmp” instruction for a test against constant zero, and we can also see it being used to fill many constant zero parameter values for a call to KeWaitForMultipleObjects.

    The gain from using a constant zero register is typically reduced code size. In x86, in many cases, it takes less opcode bytes to assign or compare a value to a register than to an immediate constant operand. The compiler takes advantage of this fact by only “paying the price” for referencing the value zero in an immediate operand for an instruction once, by storing it into a register. Then, the compiler can simply refer to that register for smaller overall code size if references to constant zero are common enough. For example, in the above code fragment, the compiler saves one opcode byte per push esi instruction over doing a push 0 instruction.

  3. Fast multiplication or addition with the lea instruction is another fairly common optimization that one is likely to run into frequently. The lea instruction (load effective address) was originally intended for use with pointer math, but it also turns out to be quite useful for doing fast constant math on non-pointer values as well, in part due to the wide range of operations that can be done with this instruction.

    Consider the following code fragment:

    mov     eax, [ebp+some_local]
    movzx   ecx, bx
    lea     eax, [eax+eax*4]
    lea     eax, [ecx+eax*2-30h]
    mov     [ebp+other_local], eax

    This instruction sequence may seem a bit convoluted at first, but it’s not too bad if we break it down into its constituent parts (and then combine it into one operation).

    We have the following operations:

    lea     eax, [eax+eax*4]

    This operation is equivalent to the following in C:

    other_local = some_local;
    other_local *= 5;

    Then, we’ve got the second lea operation:

    lea     eax, [ecx+eax*2-30h]

    In C, this might look like so:

    other_local = other_local * 2 + X - 0x30;

    …(where X corresponds to bx (and then ecx)).

    If we combine the two together, we get the following expression in C:

    other_local = some_local * 10 + X - 0x30;

    Now, the compiler could have used a combination of mul, add, and sub instructions to achieve the same effect. However, this would be more expensive in terms of instruction size, as those instructions are designed to work with values that are not known at runtime. By using the lea instruction, the compiler can take advantage of the fact that the lea instruction can perform multiple operations with one instruction in order to reduce code size.

  4. The lea instruction is also useful for scaling array indicies to their native type. Recall that in C, if you subscript an array, the subscript is performed on whole array elements; in the x86 instruction set, however, the processor has no way of magically knowing the size of an array element when the programmer subscripts an array.

    For example, consider the following structure:

    0:000> dt nt!_MMPFN
       +0x000 u1               : __unnamed
       +0x004 PteAddress       : Ptr32 _MMPTE
       +0x008 u2               : __unnamed
       +0x00c u3               : __unnamed
       +0x010 OriginalPte      : _MMPTE
       +0x010 AweReferenceCount : Int4B
       +0x014 u4               : __unnamed ; 4 bytes

    In this case, sizeof(nt!_MMPFN) == 24 (0x18). Consider an array of _MMPFN structures like so:

    _MMPFN MmPfnDatabase[ array_size ];

    If the programmer wants to index MmPfnDatabase (i.e. retrieve a pointer to a particular _MMPFN element within the array), then the compiler needs to convert an index into a pointer to an _MMPFN structure contained within the array.

    For example, the programmer might write the following C code:

    _MMPFN* Pfn = &MmPfnDatabase[ PfnIndex ];

    At the x86 instruction set level, though, the compiler needs to convert PfnIndex into a pointer. This requires two operations: First, PfnIndex needs to be scaled to the array size (or multipled by sizeof(_MMPFN). Second, the resultant value needs to be added to the base of MmPfnDatabase to form a complete pointer value to the requested _MMPFN element.

    In order to accomplish this, the compiler might emit the following instruction sequence:

    mov     eax, [ebp+PfnIndex]
    mov     ecx, _MmPfnDatabase
    push    ebx
    mov     ebx, [ebp+arg_4]
    lea     eax, [eax+eax*2]
    push    esi
    lea     esi, [ecx+eax*8]

    Here, the lea instruction is used to take the PfnIndex and MmPfnDatabase values and combine them into an _MMPFN pointer (stored in “esi”). If we break down the individual operations performed, what’s going on here isn’t all that difficult to understand:

    1. The initial LEA operation is equivalent to multiplying PfnIndex by 3 (PfnIndex is stored into “eax”).
    2. The final LEA operation multiplies the result of the first LEA operation by 8 (which can be simplified to say that PfnIndex has been multiplied by 24, which is conveniently equal to sizeof(_MMPFN).
    3. Finally (also part of the last LEA operation), “ecx” (which was loaded with the base address of MmPfnDatabase) is added to the intermediate result, which is then placed into “esi”, forming a completed _MMPFN pointer.

    Like with performing constant math for non-array indexing, the advantage of using lea over a series of mul and add operations is primarily code size, as lea allows for several distinct operations (e.g. multiply/add, or multiply/subtract) to be combined into a single compact operation. Most processors also provide very fast implementations of the lea instruction (as compared to the mul instruction for constant multiply operations).

    In order to be able to differentiate between an array indexing operation and plain constant math using lea, I would recommend checking to see whether any of the input values are treated as pointers or if the output value is treated as a pointer. Usually, it’s fairly easy to make this determination, though.

    As an aside, if you are reverse engineering something, constructs like array index operations are very handy as they will definitively tell you the size of the structure comprising an array.

The next post in this series will continue this discussion and examine several more common (and more complicated) optimizations and assembly tricks that the compiler may emit on x86. Stay tuned…

Enabling the local kernel debugger on Vista RTM

Friday, February 9th, 2007

If you’re a kernel developer, and you’ve upgraded to Vista, then one of the changes that you may have noticed is that you can’t perform local kernel debugging anymore.

This is true even if you elevate WinDbg. If you try, you’ll get an error message stating that the debugger failed to get KD version information (error 5), which corresponds to the Win32 ERROR_ACCESS_DENIED error code.

This is due to a change from Vista RC2 to Vista RTM, where the kernel function responsible for supporting much of the local KD functionality in WinDbg (KdSystemDebugControl) was altered to require the system to be booted with /DEBUG. This is apparent if we compare RC2 to RTM.

RC2 has the following check in KdSystemDebugControl (one comparison against KdpBootedNodebug):

push    0F4h
push    81841938
call    nt!_SEH_prolog4
xor     ebx,ebx
mov     dword ptr [ebp-28h],ebx
mov     dword ptr [ebp-20h],ebx
mov     dword ptr [ebp-24h],ebx
cmp     byte ptr [nt!KdpBootedNodebug],bl
je      nt!KdSystemDebugControl+0x2c ; Success
mov     eax,0C0000022h ; STATUS_ACCESS_DENIED

On Vista RTM, two additional checks were added against nt!KdPitchDebugger and nt!KdDebuggerEnabled (disregard the fact that the RTM disassembly is from the x64 version; both the x86 and x64 Vista versions have the same checks):

mov     qword ptr [rsp+8],rbx
mov     qword ptr [rsp+10h],rdi
push    r12
sub     rsp,170h
mov     r10,rdx
and     dword ptr [rsp+44h],0
and     qword ptr [rsp+48h],0
and     qword ptr [rsp+50h],0
cmp     byte ptr [nt!KdpBootedNodebug)],0
jne     nt!KdSystemDebugControl+0x8b7 ; Fail
cmp     byte ptr [nt!KdPitchDebugger],0
jne     nt!KdSystemDebugControl+0x8b7 ; Fail
cmp     byte ptr [nt!KdDebuggerEnabled],0
je      nt!KdSystemDebugControl+0x8b7 ; Fail

The essence of these checks is that you need to be booted with /DEBUG enabled in order for local kernel debugging to work.

There is a simple way to accomplish this, however, without the usual painful aspects of having a kernel debugger attached (e.g. breaking on user mode exceptions or breakpoints).

All you have to do is enable kernel debugging, and then disable user mode exception handling. This requires the following options to be set via BCDEdit.exe, the Vista boot configuration database manager:

  1. bcdedit /debug on. This enables kernel debugging for the booted OS configuration.
  2. bcdedit /dbgsettings <type> /start disable /noumex (where type corresponds to a usable KD type on your computer, such as 1394). This disables user mode exception handling for the kernel debugger. You should still be able to boot the system without a kernel debugger attached.

After setting these options, reboot, and you should be set. You’ll now be able to use local KD (you must still remember to elevate the debugger, though), but you won’t have user mode programs try to break into the kernel debugger when they crash without a user mode debugger attached.

Note, however, that you’ll still be able to break in to the system with a kernel debugger after boot if you choose these options (and if the box crashes in kernel mode, it’ll freeze waiting for a debugger to attach). However, at least you will not have to contend with errant user mode programs causing the system to break into the kernel debugger.

Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Monday, February 5th, 2007

This is the final post in the x64 exception handling series, comprised of the following articles:

  1. Programming against the x64 exception handling support, part 1: Definitions for x64 versions of exception handling support
  2. Programming against the x64 exception handling support, part 2: A description of the new unwind APIs
  3. Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)
  4. Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)
  5. Programming against the x64 exception handling support, part 5: Collided unwinds
  6. Programming against the x64 exception handling support, part 6: Frame consolidation unwinds
  7. Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Armed with all of the information in the previous articles in this series, we can now do some fairly interesting things. There are a whole lot of possible applications for the information described here (some common, some estoric); however, for simplicities sake, I am choosing to demonstrate a classical stack walk (albeit one that takes advantage of some of the newly available information in the unwind metadata).

Like unwinding on x86 where FPO is disabled, we are able to do simple tasks like determine frame return addresses and stack pointers throughout the call stack. However, we can expand on this a great deal on x64. Not only are our stack traces guaranteed to be accurate (due to the strict calling convention requirements on unwind metadata), but we can retrieve parts of the nonvolatile context of each caller with perfect reliability, without having to manually disassemble every function in the call stack. Furthermore, we can see (at a glance) which functions modify which non-volatile registers.

For the purpose of implementing a stack walk, it is best to use RtlVirtualUnwind instead of RtlUnwindEx, as the RtlUnwindEx will make irreversible changes to the current execution state (while RtlVirtualUnwind operates on a virtual copy of the execution state that we can modify to our heart’s content, without disturbing normal program flow). With RtlVirtualUnwind, it’s fairly trivial to implement an unwind (as we’ve previously seen based on the internal workings of RtlUnwindEx). The basic algorithm is simply to retrieve the current unwind metadata for the active function, and execute a virtual unwind. If no unwind metadata is present, then we can simply set the virtual Rip to the virtual *Rsp, and increment the virtual *Rsp by 8 (as opposed to invoking RlVirtualUnwind). Since RtlVirtualUnwind does most of the hard work for us, the only thing left after that is to interpret and display the output (or save it away, if we are logging stack traces).

With the above information in mind, I have written an example x64 stack walking routine that implements a basic x64 stack walk, and displays the nonvolatile context as viewed by each successive frame in the stack. The example routine also makes use of the little-used ContextPointers argument to RtlVirtualUnwind in order to detect functions that have used a particular non-volatile register (other than the stack pointer, which is immediately obvious). If a function in the call stack writes to a non-volatile register, the stack walk routine takes note of this and displays the modified register, its original value, and the backing store location on the stack where the original value is stored. The example stack walk routine should work “as-is” in both user mode and kernel mode on x64.

As an aside, there is a whole lot of information that is being captured and displayed by the stack walk routine. Much of this information could be used to do very interesting things (like augment disassembly and code analysis by defintiively identifying saved registers and parameter usage. Other possible uses are more estoric, such as skipping function calls at run-time in a safe fashion, or altering the non-volatile execution context of called functions via modification of the backing store pointers returned by RtlVirtualUnwind in ContextPointers. The stack walk use case, as such, really only begins to scratch the surface as it relates to some of the many very interesting things that x64 unwind metadata allows you to do.

Comparing the output of the example stack walk routine to the debugger’s built-in stack walk, we can see that it is accurate (and in fact, even includes more information; the debugger does not, presently, have support for displaying non-volatile context for frames using unwind metadata (Update: Pavel Lebedinsky points out that the debugger does, in fact, have this capability with the “.frame /r” command)):

StackTrace64: Executing stack trace...
FRAME 00: Rip=00000000010022E9 Rsp=000000000012FEA0
r12=0000000000000000 r13=0000000000000000
rdi=0000000000000000 rsi=0000000000130000
rbp=0000000000000000 rsp=000000000012FEA0
 -> Saved register 'Rbx' on stack at 000000000012FEB8
   (=> 0000000000130000)
 -> Saved register 'Rbp' on stack at 000000000012FE90
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FE88
   (=> 0000000000130000)
 -> Saved register 'Rdi' on stack at 000000000012FE80
   (=> 0000000000000000)

FRAME 01: Rip=0000000001002357 Rsp=000000000012FED0

FRAME 02: Rip=0000000001002452 Rsp=000000000012FF00

FRAME 03: Rip=0000000001002990 Rsp=000000000012FF30

FRAME 04: Rip=00000000777DCDCD Rsp=000000000012FF60
 -> Saved register 'Rbx' on stack at 000000000012FF60
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FF68
   (=> 0000000000000000)
 -> Saved register 'Rdi' on stack at 000000000012FF50
   (=> 0000000000000000)

FRAME 05: Rip=000000007792C6E1 Rsp=000000000012FF90

0:000> k
Child-SP          RetAddr           Call Site
00000000`0012f778 00000000`010022c6 DbgBreakPoint
00000000`0012f780 00000000`010022e9 StackTrace64+0x1d6
00000000`0012fea0 00000000`01002357 FaultingSubfunction3+0x9
00000000`0012fed0 00000000`01002452 FaultingFunction3+0x17
00000000`0012ff00 00000000`01002990 wmain+0x82
00000000`0012ff30 00000000`777dcdcd __tmainCRTStartup+0x120
00000000`0012ff60 00000000`7792c6e1 BaseThreadInitThunk+0xd
00000000`0012ff90 00000000`00000000 RtlUserThreadStart+0x1d

(Note that the stack walk routine doesn’t include DbgBreakPoint or the StackTrace64 frame itself. Also, for brevity, I have snipped the verbose, unchanging parts of the nonvolatile context from all but the first frame.)

Other interesting uses for the unwind metadata include logging periodic stack traces at choice locations in your program for later analysis when a debugger is active. This is even more powerful on x64, especially when you are dealing with third party code, as even without special compiler settings, you are guaranteed to get good data with no invasive use of symbols. And, of course, a good knowledge of the fundamentals of how the exception/unwind metadata works is helpful for debugging failure reporting code (such as custom unhandled exception filter) on x64.

Hopefully, you’ve found this series both interesting and enlightening. In a future series, I’m planning on running through how exception dispatching works (as opposed to unwind dispatching, which has been relatively thoroughly covered in this series). That’s a topic for another day, though.