Archive for the ‘Programming’ Category

Sometimes, a cheap hack does the trick after all (or how I worked around some mysterious USB/Bluetooth issues)

Monday, April 16th, 2007

My relatively new Dell XPS M1710 laptop (running Vista x64) has an annoying problem that I’ve recently tracked down.

It appears that when you connect a USB 1.1 device (such as a USB keyboard) to it, and then disconnect that device, Bluetooth and the built-in smart card reader tend to break the next time the computer goes through a suspend/resume cycle. This is fairly annoying for me, as I use a USB keyboard at work (and I also use Bluetooth extensively); it got to the point where pretty much every other time I went to the office and came home, Bluetooth and the smart card reader would break.

The internal Bluetooth transceiver and the built-in smart card reader are both connected to the system via an internal USB hub. It seems like this is becoming a popular thing nowadays among laptop manufacturers, connecting “internal” peripherals on laptops via USB instead of them being on-board and hardwired to the PCI bus or the like. Anyways, when the Bluetooth hub and smart card reader get into the broken state, I’ll typically get a “A USB device attached to the system is not functioning” notification, and the internal USB hub shows up as not started in device manager (with a problem code of CM_PROB_FAILED_START – code 10). Occasionally, the Bluetooth transceiver itself shows up with a CM_PROB_FAILED_START error, but the vast majority of the time it is the USB hub that fails.

I’ve done a bit of searching for a real fix for this problem, and closest I’ve found is this KB article which describes a problem that does sound like mine – Bluetooth breaks after suspending – but, the hotfix isn’t publicly available yet. I suppose I could have called PSS and tried to talk them into getting me a copy of that hotfix to try out, but I tend to try and avoid muddling through the various tiers of technical support until I really have no other options. Furthermore, there’s no guarantee that the hotfix would actually solve the underlying problem; the article doesn’t make any mention of internal USB hubs acting flaky in conjunction with a Bluetooth transceiver, only the Bluetooth module itself.

Until recently, to get out of this state, I typically either have to suspend/resume the laptop again (although this is dangerous*), reboot entirely, or remove and reinstall the devnode associated with the USB hub in device manager. (*Trying to suspend/resume in this case doesn’t always work. Sometimes, it won’t fix the problem, and more than a couple of times it has resulted in Vista hanging while trying to suspend, forcing a hard reboot.).

None of these solutions are particularly desirable; nobody likes having to shut everything down and reboot all the time with a laptop (kind of defeats the point of that nice sleep feature, doesn’t it?). Removing and reinstalling the devnode is less painful than a reboot, but only slightly; waiting for everything on the internal USB hub to get reinstalled after doing that takes up to a minute or two of disk thrashing while INFs are searched, and if things are going to take that long then it would almost be faster just to shut down and reboot anyway.

Eventually, though, I discovered that simply disabling and reenabling the devnode corresponding to the USB hub would resolve the problem. This is definitely a much nicer solution than any of the above; it’s (relatively) quick and less painful, and it doesn’t involve me sitting around and waiting for either a shutdown or reboot, or for a bunch of devices to get reinstalled by PnP.

Unfortunately, that workaround is still a rather manual process. It’s not too bad if I just keep an elevated Device Manager window open for easy access, but among other things this prevents me from, say, unlocking my computer with a smart card after an unsuspend (as the Bluetooth transceiver and smart card reader share the same USB hub that tends to flake out after a suspend/resume, after a USB 1.1 device is removed).

So, I set about seeing whether I could try and automate this process. I could have tried to track down the root cause a bit more, but as far as I can tell, the problem is either 1) a bug in Vista’s USB support (perhaps or perhaps not an x64 specific bug), or 2) a bug in firmware/hardware relating to the internal USB hub itself, or 3) a bug in firmware/hardware relating to the Bluetooth transceiver, or 4) a bug in the Bluetooth drivers for my laptop. None of those possibilities would be something that I could easily solve (as far as the root cause goes), even if I managed to track down the originating cause of the problem. As a result, I decided that the most time-effective solution would be to just try and automate the process of disabling and reenabling the devnode associated with the USB hub (if it breaks), or the Bluetooth transceiver (if it breaks instead). Restarting the devnode is comparatively fast when viewed in light of the other workarounds, and in any case it is fast enough to be a viable way to at least alleviate the symptoms of this problem.

After doing some brief research, it looked like the way to go here was a combination of the CM_Xxx APIs and the SetupDi APIs. Although there’s a relatively large amount of indirection you have to go through to restart an individual devnode (and the CM_Xxx APIs / setupapi are not particularly easy to use in the first place), there happens to be a WDK sample that has the capability to do just what I want – DevCon. DevCon is a console equivalent to the Device Manager MMC snapin; it’s capable of enumerating device nodes, installing/removing them, updating drivers, disabling/reenabling devnodes, and soforth.

Sure enough, I verified that DevCon’s `restart’ command was sufficient to restart the broken devnode (in a fashion similar to disabling and reenabling it in Device Manager), with the end result of causing the USB hub to start working again.

At this point, all I had to do was come up with a good way to locate the broken devnodes, a good way to know when the computer went in/out of sleep (as a sleep cycle causes the problem to occur), and then inject the DevCon code responsible for restarting a device into my program. To make a long story short, I ended up keying the program based on the hardware ids of the internal USB hub and Bluetooth transceiver such that it would check for all devnodes that 1) matched a hardware id in that list, and 2) were in a disabled state with a problem code of CM_PROB_FAILED_START. Detecting a sleep transition is fairly easy for a service (services receive notifications of PnP/Power events through the callback registered via RegisterServiceCtrlHandlerEx, and as I wanted the program to function continuously, a service seemed like the logical way to run it anyway.

An hour or two later and I had my workaround problem done. It’s certainly not pretty, and it doesn’t do anything to fix the root cause of the problem, but as far as treating the symtoms goes, it gets the job done. The basic way the program works is that every time the system resumes from suspend, it will poll all devnodes every second for 30 seconds, looking for a devnode that failed to start and is one of the known list of problematic hardware ids. If it finds such a device, it attempts to restart it (thereby working around the root cause of the problem and alleviating the symptoms). Polling for a fixed time after resume isn’t pretty by any means, but it can occasionally take a bit for one of the devnodes to show as broken when it’s not working, so this works around that.

While you could safely call it a giant hack, the program does get the job done; now, I can unlock my laptop via smart card or use Bluetooth almost immediately after a resume, even if the breakage-after-USB1-device-is-unplugged problem strikes, and all without having to manually futz around in device manager every time.

In case anyone’s interested, I’ve put the source code for the program (“Broken Device Bouncer”, or DevBouncer for short) up. It’s fairly hardcoded to be specific to my machine, so if you wanted to use it for some reason, you’d need to rebuild it.

Compiler tricks in x86 assembly, part 2

Monday, February 12th, 2007

A common operation in many C programs is a division by a constant quantity. This sort of operation occurs all over the place; in explicit mathematical calculations, and in other, more innocuous situations such as where the programmer does something like take a size in bytes and convert it into a count of structures.

This fundamental operation (division by a constant) is also one that the compiler will tend to optimize in ways that may seem a bit strange until you know what is going on.

Consider this simple C program:

int
__cdecl
wmain(
	int ac,
	wchar_t** av)
{
	unsigned long x = rand();

	return x / 1518;
}

In assembly, one would expect to see the following operations emitted by the compiler:

  1. A call to rand, to populate the ‘x’ variable.
  2. A div instruction to perform the requested division of ‘x’ by 1518.

However, if optimizations are turned on and the compiler is not configured to prioritize small code over performance exclusively, then the result is not what we might expect:

; int __cdecl main(int argc,
  const char **argv,
  const char *envp)
_main proc near
call    ds:__imp__rand
mov     ecx, eax
mov     eax, 596179C3h
mul     ecx
mov     eax, ecx
sub     eax, edx
shr     eax, 1
add     eax, edx
shr     eax, 0Ah ; return value in eax
retn

The first time you encounter something like this, you’ll probably be wondering something along the lines of “what was the compiler thinking when it did that, and how does that set of instructions end up dividing out ‘x’ by 1518?”.

What has happened here is that the compiler has used a trick sometimes known as magic number division to improve performance. In x86 (and in most processor architectures, actually), it is typically much more expensive to perform a division than a multiplication (or most any other basic mathematical primative, for that matter). While ordinarily, it is typically not possible to make division any faster than what the ‘div’ instruction gives you performance-wise, under special circumstances it is possible to do a bit better. Specifically, if the compiler knows that one of the operands to the division operation is a constant (specifically, the divisor), then it can engage some clever tricks in order to eliminate the expensive ‘div’ instruction and turn it into a sequence of other instructions that, while initially longer and more complicated on the surface, are actually faster to execute.

In this instance, the compiler has taken one ‘div’ instruction and converted it into a multiply, subtract, add, and two right shift operations. The basic idea behind how the code the compiler produced actually gives the correct result is that it typically involves a multiplication with a large constant, where the low 32-bits are then discarded and then remaining high 32-bits (recall that the ‘mul’ instruction returns the result in edx:eax, so the high 32-bits of the result are stored in edx) have some transform applied, typically one that includes fast division by a power of 2 via a right shift.

Sure enough, we can demonstrate that the above code works by manually comparing the result to a known value, and then manually performing each of the individual operations. For example, assume that in the above program, the call to rand returns 10626 (otherwise known as 7 * 1518, for simplicity, although the optimized code above works equally well for non-multiples of 1518).

Taking the above set of instructions, we perform the following operations:

  1. Multiply 10626 by the magic constant 0x596179C3. This yields 0x00000E7E00001006.
  2. Discard the low 32-bits (eax), saving only the high 32-bits of the multiplication (edx; 0xE7E, or 3710 in decimal).
  3. Subtract 3710 from the original input value, 10626, yielding 6916.
  4. Divide 6916 by 2^1 (bit shift right by 1), yielding 3458.
  5. Add 3710 to 3458, yielding 7168.
  6. Divide 7168 by 2^10 (bit shift right by 10), yielding 7, our final answer. (10626 / 1518 == 7)

Depending on the values being divided, it is possible that there may only be a right shift after the large multiply. For instance, if we change the above program to divide by 1517 instead of 1518, then we might see the following code generated instead:

mov     eax, 5666F0A5h
mul     ecx
shr     edx, 9

Coming up with the “magical divide” constants in the first place involves a bit of math; interested readers can read up about it on DevDev.

If you’re reverse engineering something that has these characteristics (large multiply that discards the low 32-bits and performs some transforms on the high 32-bits), then I would recommend breaking up each instruction into the corresponding mathematical operation (be aware of tricks like shift right to divide by a power of 2, or shift left to multiply by a power of 2). As a quick “litmus test”, you can always just pick an input and do the operations, and then simply compare the result to the input and see if it makes sense in the context of a division operation.

More on other optimizer tricks in a future posting…

Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Monday, February 5th, 2007

This is the final post in the x64 exception handling series, comprised of the following articles:

  1. Programming against the x64 exception handling support, part 1: Definitions for x64 versions of exception handling support
  2. Programming against the x64 exception handling support, part 2: A description of the new unwind APIs
  3. Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)
  4. Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)
  5. Programming against the x64 exception handling support, part 5: Collided unwinds
  6. Programming against the x64 exception handling support, part 6: Frame consolidation unwinds
  7. Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Armed with all of the information in the previous articles in this series, we can now do some fairly interesting things. There are a whole lot of possible applications for the information described here (some common, some estoric); however, for simplicities sake, I am choosing to demonstrate a classical stack walk (albeit one that takes advantage of some of the newly available information in the unwind metadata).

Like unwinding on x86 where FPO is disabled, we are able to do simple tasks like determine frame return addresses and stack pointers throughout the call stack. However, we can expand on this a great deal on x64. Not only are our stack traces guaranteed to be accurate (due to the strict calling convention requirements on unwind metadata), but we can retrieve parts of the nonvolatile context of each caller with perfect reliability, without having to manually disassemble every function in the call stack. Furthermore, we can see (at a glance) which functions modify which non-volatile registers.

For the purpose of implementing a stack walk, it is best to use RtlVirtualUnwind instead of RtlUnwindEx, as the RtlUnwindEx will make irreversible changes to the current execution state (while RtlVirtualUnwind operates on a virtual copy of the execution state that we can modify to our heart’s content, without disturbing normal program flow). With RtlVirtualUnwind, it’s fairly trivial to implement an unwind (as we’ve previously seen based on the internal workings of RtlUnwindEx). The basic algorithm is simply to retrieve the current unwind metadata for the active function, and execute a virtual unwind. If no unwind metadata is present, then we can simply set the virtual Rip to the virtual *Rsp, and increment the virtual *Rsp by 8 (as opposed to invoking RlVirtualUnwind). Since RtlVirtualUnwind does most of the hard work for us, the only thing left after that is to interpret and display the output (or save it away, if we are logging stack traces).

With the above information in mind, I have written an example x64 stack walking routine that implements a basic x64 stack walk, and displays the nonvolatile context as viewed by each successive frame in the stack. The example routine also makes use of the little-used ContextPointers argument to RtlVirtualUnwind in order to detect functions that have used a particular non-volatile register (other than the stack pointer, which is immediately obvious). If a function in the call stack writes to a non-volatile register, the stack walk routine takes note of this and displays the modified register, its original value, and the backing store location on the stack where the original value is stored. The example stack walk routine should work “as-is” in both user mode and kernel mode on x64.

As an aside, there is a whole lot of information that is being captured and displayed by the stack walk routine. Much of this information could be used to do very interesting things (like augment disassembly and code analysis by defintiively identifying saved registers and parameter usage. Other possible uses are more estoric, such as skipping function calls at run-time in a safe fashion, or altering the non-volatile execution context of called functions via modification of the backing store pointers returned by RtlVirtualUnwind in ContextPointers. The stack walk use case, as such, really only begins to scratch the surface as it relates to some of the many very interesting things that x64 unwind metadata allows you to do.

Comparing the output of the example stack walk routine to the debugger’s built-in stack walk, we can see that it is accurate (and in fact, even includes more information; the debugger does not, presently, have support for displaying non-volatile context for frames using unwind metadata (Update: Pavel Lebedinsky points out that the debugger does, in fact, have this capability with the “.frame /r” command)):

StackTrace64: Executing stack trace...
FRAME 00: Rip=00000000010022E9 Rsp=000000000012FEA0
 Rbp=000000000012FEA0
r12=0000000000000000 r13=0000000000000000
 r14=0000000000000000
rdi=0000000000000000 rsi=0000000000130000
 rbx=0000000000130000
rbp=0000000000000000 rsp=000000000012FEA0
 -> Saved register 'Rbx' on stack at 000000000012FEB8
   (=> 0000000000130000)
 -> Saved register 'Rbp' on stack at 000000000012FE90
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FE88
   (=> 0000000000130000)
 -> Saved register 'Rdi' on stack at 000000000012FE80
   (=> 0000000000000000)

FRAME 01: Rip=0000000001002357 Rsp=000000000012FED0
[...]

FRAME 02: Rip=0000000001002452 Rsp=000000000012FF00
[...]

FRAME 03: Rip=0000000001002990 Rsp=000000000012FF30
[...]

FRAME 04: Rip=00000000777DCDCD Rsp=000000000012FF60
[...]
 -> Saved register 'Rbx' on stack at 000000000012FF60
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FF68
   (=> 0000000000000000)
 -> Saved register 'Rdi' on stack at 000000000012FF50
   (=> 0000000000000000)

FRAME 05: Rip=000000007792C6E1 Rsp=000000000012FF90
[...]


0:000> k
Child-SP          RetAddr           Call Site
00000000`0012f778 00000000`010022c6 DbgBreakPoint
00000000`0012f780 00000000`010022e9 StackTrace64+0x1d6
00000000`0012fea0 00000000`01002357 FaultingSubfunction3+0x9
00000000`0012fed0 00000000`01002452 FaultingFunction3+0x17
00000000`0012ff00 00000000`01002990 wmain+0x82
00000000`0012ff30 00000000`777dcdcd __tmainCRTStartup+0x120
00000000`0012ff60 00000000`7792c6e1 BaseThreadInitThunk+0xd
00000000`0012ff90 00000000`00000000 RtlUserThreadStart+0x1d

(Note that the stack walk routine doesn’t include DbgBreakPoint or the StackTrace64 frame itself. Also, for brevity, I have snipped the verbose, unchanging parts of the nonvolatile context from all but the first frame.)

Other interesting uses for the unwind metadata include logging periodic stack traces at choice locations in your program for later analysis when a debugger is active. This is even more powerful on x64, especially when you are dealing with third party code, as even without special compiler settings, you are guaranteed to get good data with no invasive use of symbols. And, of course, a good knowledge of the fundamentals of how the exception/unwind metadata works is helpful for debugging failure reporting code (such as custom unhandled exception filter) on x64.

Hopefully, you’ve found this series both interesting and enlightening. In a future series, I’m planning on running through how exception dispatching works (as opposed to unwind dispatching, which has been relatively thoroughly covered in this series). That’s a topic for another day, though.

Programming against the x64 exception handling support, part 6: Frame consolidation unwinds

Sunday, January 14th, 2007

In the last post in the programming x64 exception handling series, I described how collided unwinds were implemented, and just how they operate. That just about wraps up the guts of unwinding (finally), except for one last corner case: So-called frame consolidation unwinds.

Consolidation unwinds are a special form of unwind that is indicated to RtlUnwindEx with a special exception code, STATUS_UNWIND_CONSOLIDATE. This exception code changes RtlUnwindEx’s behavior slightly; it suppresses the behavior of substituting the TargetIp argument to RtlUnwindEx with the unwound context’s Rip value.

As far as RtlUnwindEx goes, that’s all that there is to consolidation unwinds. There’s a bit more that goes on with this special form of unwind, though. Specifically, as with longjump style unwinds, there is special logic contained within RtlRestoreContext (used by RtlUnwindEx to realize the final, unwound execution context) that detects the consolidation unwind case (by virtue of the ExceptionCode member of the ExceptionRecord argument), and enables a special code path. As it is currently implemented, some of the logic relating to unwind consolidation is also resident within the pseudofunction RcFrameConsolidation. This function is tightly coupled with RtlUnwindContext; it is only separated into another logical function for purposes of describing RtlRestoreContext and RcFrameConsolidation in the unwind metadata for ntdll (or ntoskrnl).

The gist of what RtlRestoreContext/RcFrameConsolidation does in this case is essentially the following:

  1. Make a local copy of the passed-in context record.
  2. Treating ExceptionRecord->ExceptionInformation[0] as a callback function, this callback function is called (given a single argument pointing to the ExceptionRecord provided to RtlRestoreContext).
  3. The return address of the callback function is treated as a new Rip value to place in the context that is about to be restored.
  4. The context is restored as normal after the Rip value is updated based on the callback’s decision.

The callback routine pointed to by ExceptionRecord->ExceptionInformation[0] has the following signature:

// Returns a new RIP value to be used in the restored context.
typedef ULONG64 (* PFRAME_CONSOLIDATION_ROUTINE)(
   __in PEXCEPTION_RECORD ExceptionRecord
   );

After the confusing interaction between multiple instances of the case function that is collided unwinds, frame consolidation unwinds may seem a little bit anti-climactic.

Frame consolidation unwinds are typically used in conjunction with C++ try/catch/throw support. Specifically, when a C++ exception is thrown, a consolidation unwind is executed within a function that contains a catch handler. The frame consolidation callback is used by the CRT to actually invoke the various catch filters and handlers (these do not necessarily directly correspond to standard C SEH scope levels). The C++ exception handling routines use the additional ExceptionInformation fields available in a consolidation unwind in order to pass information about the exception object itself; this usage of the ExceptionInformation fields does not have any special support in the OS-level unwind routines, however. Once the C++ exception is going to cross a function-level boundary, in my observations, it is converted into a normal exception for purposes of unwinding and exception handling. Then, consolidation unwinds are used again to invoke any catch handlers encountered within the next function (if applicable).

Essentially, consolidation unwinds can be thought as a normal unwind, with a conditionally assigned TargetIp whose value is not determined until after all unwind handlers have been called, and the specified context has been unwound. In most circumstances, this functionality is not particularly useful (or critical) when speaking in terms of programming the raw OS-level exception support directly. Nonetheless, knowing how it works is still useful, if only for debugging purposes.

I’m not covering the details of how all of the C++ exception handling framework is built on top of the unwind handling routines in this post series; there is no further OS-level “awareness” of C++ exceptions within the OS’s unwind related support routines, however.

Next up: Tying it all together, or using x64’s improved exception handling support to our gain in the real world.

Programming against the x64 exception handling support, part 5: Collided unwinds

Tuesday, January 9th, 2007

Previously, I discussed the internal workings of RtlUnwindEx. While that posting covered most of the inner details regarding unwind support, I didn’t fully cover some of the corner cases.

Specifically, I haven’t yet discussed just what a “collided unwind” really is, other than providing vague hints as to its existance. A collided unwind occurs when an unwind handler initiates a secondary unwind operation in the context of an unwind notification callback. In other words, a collided unwind is what occurs when, in the process of a stack unwind, one of the call frames changes the target of an unwind. This has several implications and requirements in order to operate as one might expect:

  1. Some unwind handlers that were on the original unwind path might no longer be called, depending on the new unwind target.
  2. The current unwind call stack leading into RtlUnwindEx will need to be interrupted.
  3. The new unwind operation should pick up where the old unwind operation left off. That is, the new unwind operation shouldn’t start unwinding the exception handler stack; instead, it must unwind the original stack, starting from the call frame after the unwind handler which initiated the new unwind operation.

Because of these conditions, the implementation of collided unwinds is a bit more complicated than one might expect. The main difficulty here is that the second unwind operation is initiated within the call stack of an existing unwind operation, but what the unwind handler “wants” to do is to unwind the stack that was already being unwound, except to a different target and with different parameters.

From an unwind handler’s perspective, all that needs to be done to accomplish this is to make a call to RtlUnwindEx in the context of an unwind handler callback for an unwind operation, and RtlUnwindEx magically takes care of all of the work necessary to make the collided unwind “just work”.

Allowing this sort of unwind operation to “just work” requires a bit of creative thinking from the perspective of RtlUnwindEx, however. The main difficult here is that RtlUnwindEx, when called from the unwind handler, somehow needs a way to recover the original context that was being unwound in order to “pick up” where the original call to RtlUnwindEx “left off” (when it called an unwind handler that initiated a collided unwind). Because there is no provision for passing a context record to RtlUnwindEx and indicating that RtlUnwindEx should use it as a starting point for an unwind operation (RtlUnwindEx always initiates the unwind in the current call stack), this poses a problem; how is RtlUnwindEx to recover the original unwind parameters from where it should initiate the “real” unwind?

The way that Microsoft decided to solve this problem is an elegant little hack of sorts. The solution all comes down to that mysterious exception handler around RtlpExecuteHandlerForUnwind: RtlpUnwindHandler. Recall from the previous article that RtlUnwindEx calls RtlpExecuteHandlerForUnwind in order to invoke an exception handler for unwind purposes, and that RtlpExecuteHandlerForUnwind sets up an exception handler (RtlpUnwindHandler) before calling the requested exception handler for unwind. At the time, these extra steps (the use of RtlpExecuteHandlerForUnwind, and its exception handler) probably looked a bit redundant, and in the process of a “conventional” unwind operation, the extra work that RtlUnwindEx goes through before calling an unwind handler doesn’t even come into play as adding any value.

That all changes when a collided unwind occurs, however. In the collided unwind case, RtlpExecuteHandlerForUnwind and RtlpUnwindHandler are critical to solving the problem of how to recover the original unwind parameters so that RtlUnwindEx can perform an unwind operation on the correct call stack. In order to understand just how RtlpUnwindHandler and friends come into play with a collided unwind, it’s necessary to take a closer look about just what RtlUnwindEx will do when it is called from the context of an unwind handler.

Since RtlUnwindEx always begins a call frame unwind from the currently active call stack, the second call to RtlUnwindEx will start unwinding the call stack of the unwind handler that called RtlUnwindEx. But wait, you might say – this isn’t what is supposed to happen! It turns out that unwinding the unwind handler’s call stack will actually lead up to the “right thing” happening, through a bit of clever use of how “conventional” unwind operations work. To better understand what I mean, it’s helpful to look at the stack of a secondary call to RtlUnwindEx (initiating a collided unwind operation). For this purpose, I’ve put together a small problem that initiates a collided unwind (more on how and why you might see a collided unwind in the “real world” later). I’ve set a breakpoint on RtlUnwindEx, and skipped forward until I encountered the nested call to RtlUnwindEx that was initiating a collided unwind operation:

0:000> k
Child-SP          Call Site
00000000`0012e058 ntdll!RtlUnwindEx
00000000`0012e060 ntdll!local_unwind+0x1c
00000000`0012e540 TestApp!`FaultingFunction2'::`1'::fin$2+0x34
00000000`0012e570 ntdll!_C_specific_handler+0x140
00000000`0012e5e0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012e610 ntdll!RtlUnwindEx+0x236
00000000`0012ec90 TestApp!UnwindExceptionHandler2+0xf8
00000000`0012f1b0 TestApp!`FaultingFunction2'::`1'::filt$1+0xe
[...]

At this point, given what we know about RtlUnwindEx, it will start unwinding the stack downward. Since the target of the collided unwind will by definition be lower in the stack than the unwind handler’s stack pointer itself, RtlUnwindEx will continue unwinding downward, calling unwind handlers (if any) for each successive frame. Taking a look at the call stack, we can determine that there are no frames with an exception handler marked for unwind (denotated by a [ U ] in the !fnseh output):

0:000> !fnseh ntdll!RtlUnwindEx
ntdll!RtlUnwindEx L295 22,0A [   ]  (none)
0:000> !fnseh ntdll!local_unwind
ntdll!local_unwind L24 07,02 [   ]  (none)
0:000> !fnseh 00000000`01001f04 
1001ed0 L3a 06,02 [   ]  (none)
0:000> !fnseh ntdll!_C_specific_handler+0x140
ntdll!_C_specific_handler L16a 20,0C [   ]  (none)

(Here, 00000000`01001f04 corresponds to TestApp!`FaultingFunction2′::`1′::fin$2+0x34).

Because none of these call frames have an exception handler marked for unwind callbacks, we can surmise that RtlUnwindEx will blissfully unwind past all of these call frames just as one might expect. At this point, RtlUnwindEx is still unwinding the “wrong” stack though; we’d like it to be unwinding the stack passed to the original call to RtlUnwindEx, and not the unwind/exception handler call stack.

Something that one might not immediately expect happens when RtlUnwindEx reaches the next frame, however. Remember that the current call frame is now _C_specific_handler – the C-language exception handler for the current function that was originally being unwound after an exception occured. This means that the next call frame will be the original RtlUnwindEx, or more precisely, RtlpExecuteHandlerForUnwind.

This is where RtlpExecuteHandlerForUnwind and RtlpUnwindHandler get to shine. If we take a look at the next call frame in the debugger, we see that it is indeed RtlpExecuteHandlerForUnwind, and that it also has (as expected) an exception handler marked for unwind support: RtlpUnwindHandler.

0:000> !fnseh ntdll!RtlpExecuteHandlerForUnwind+0xd
ntdll!RtlpExecuteHandlerForUnwind L13 04,01 [EU ]
  ntdll!RtlpUnwindHandler (assembler/unknown)

Because this call frame does have an exception handler that supports unwind callouts, it will be returned to RtlUnwindEx by RtlVirtualUnwind. This, in turn, will lead to RtlUnwindEx calling RtlpUnwindHandler, as registered by RtlpExecuteHandlerForUnwind in the original call stack (by RtlUnwindEx). We can verify this in the debugger:

0:000> bp ntdll!RtlpUnwindHandler
0:000> g
Breakpoint 1 hit
ntdll!RtlpUnwindHandler:
00000000`779507e0 488b4220  mov rax,qword ptr [rdx+20h]
0:000> k
Child-SP          Call Site
00000000`0012d9a8 ntdll!RtlpUnwindHandler
00000000`0012d9b0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012d9e0 ntdll!RtlUnwindEx+0x236
00000000`0012e060 ntdll!local_unwind+0x1c
00000000`0012e540 TestApp!`FaultingFunction2'::`1'::fin$2+0x34
00000000`0012e570 ntdll!_C_specific_handler+0x140
00000000`0012e5e0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012e610 ntdll!RtlUnwindEx+0x236
00000000`0012ec90 TestApp!UnwindExceptionHandler2+0xf8
00000000`0012f1b0 TestApp!`FaultingFunction2'::`1'::filt$1+0xe
[...]

This is where things start to get a little interesting. From the discussion in the previous article, we know that RtlpUnwindHandler essentially does the following:

  1. Retrieve the PDISPATCHER_CONTEXT argument that RtlpExecuteHandlerForUnwind (the original instance, from the original unwind operation initiated by the first call to RtlUnwindEx) saved on its stack. This is done via the use of the EstablisherFrame argument to RtlpUnwindHandler.
  2. Copy the contents of RtlpExecuteHandlerForUnwind’s DISPATCHER_CONTEXT over the DISPATCHER_CONTEXT of the current RtlUnwindEx instance, through the PDISPATCHER_CONTEXT argument provided to RtlpUnwindHandler. Note that the TargetIp member of the DISPATCHER_CONTEXT is not copied from RtlpExecuteHandlerForUnwind’s DISPATCHER_CONTEXT.
  3. Return the manifest ExceptionCollidedUnwind constant to the caller (RtlpExecuteHandlerForUnwind, which will in turn return this value to RtlUnwindEx).

After all this is done, control returns to RtlUnwindEx. Because RtlpExecuteHandlerForUnwind returned ExceptionCollidedUnwind, though, a previously unused code path is activated. This code path (as described previously) copies the contents of the DISPATCHER_CONTEXT structure whose address was passed to RtlpExecuteHandlerForUnwind back into the internal state of RtlUnwindEx (including the context record), and then attempts to re-start unwinding of the current stack frame.

If you’ve been paying attention so far, then you probably understand what is going to happen next.

Because of the fact that RtlpUnwindHandler copied the DISPATCHER_CONTEXT from the original call to RtlUnwindEx over the DISPATCHER_CONTEXT from the current (collided unwind) call to RtlUnwindEx, the current instance of RtlUnwindEx now has access to all of the state information that the original RtlUnwindEx instance had placed into the PDISPATCHER_CONTEXT passed to RtlpExecuteHandlerForUnwind. Most importantly, this includes access to the original context record descibing the call frame that the original instance of RtlUnwindEx was in the process of unwinding.

Since all of this information has now been copied over the current RtlUnwindEx instance’s internal state, in effect, the current instance of RtlUnwindEx will (for the next unwind iteration) start unwinding the stack where the original RtlUnwindEx instance stopped; in other words, the stack being unwound “jumps” from the currently active call stack to the exception (or other) call stack that was originally being unwound.

At this point, the second instance of RtlUnwindEx is all setup to unwind the call stack to the new unwind target frame (and target instruction pointer; remember that TargetIp was omitted from the copying performed on the PDISPATCHER_CONTEXT in RtlpUnwindHandler) like a “conventional” unwind. The rest is, as they say, history.

Now that we know how collided unwinds work, it is important to know when one would ever see such a thing (after all, interrupting an unwind in-progress is a fairly invasive and atypical operation).

It turns out that collided unwinds are not quite as far-fetched as they might seem; the easiest way to cause such an event is to do something sleazy like execute a return/goto/continue/break to transfer control out of a __finally block. This, in effect, requires that the compiler stop the current unwind operation and transfer control to the target location (which is usually within the function that contained the __finally that the programmer jumped out of). Nevertheless, the compiler still has to deal with the fact that it has been called in the context of an unwind operation, and as such it needs a way to “break out” of the unwind call stack. This is done by executing a “local unwind”, or an unwind to a location within the current function. In order to do this, the compiler calls a small, runtime-supplied helper function known as local_unwind. This function is described below, and is essentially an extremely thin wrapper around RtlUnwindEx that, in practice, adds no value other than providing some default argument values (and scratch space on the stack for RtlUnwindEx to use to store a CONTEXT structure):

0:000> uf ntdll!local_unwind
ntdll!local_unwind:
00000000`7796f580 4881ecd8040000 sub  rsp,4D8h
00000000`7796f587 4d33c0         xor  r8,r8
00000000`7796f58a 4d33c9         xor  r9,r9
00000000`7796f58d 4889642420     mov  qword ptr [rsp+20h],rsp
00000000`7796f592 4c89442428     mov  qword ptr [rsp+28h],r8
00000000`7796f597 e844b0fdff     call ntdll!RtlUnwindEx
00000000`7796f59c 4881c4d8040000 add  rsp,4D8h
00000000`7796f5a3 c3             ret

When the compiler calls local_unwind as a result of the programmer breaking out of a __finally block in some fashion, then execution will eventually end up in RtlUnwindEx. From there, RtlUnwindEx eventually detects the operation as a collided unwind, once it unwinds past the original call to the original unwind handler that started the new unwind operation via local_unwind.

As a result, breaking out of a __finally block instead of allowing it to run to completion (which may result in control being transferred out of the “current function”, from the programmer’s perspective, and “into” the next function in the call stack for unwind processing) is how every-day programs can end up causing a collided unwind.

Next time: More unwind estorica, including details on how RtlUnwindEx and RtlRestoreContext lay the groundwork used to build C++ exception handling support.

Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)

Monday, January 8th, 2007

In the previous article in this series, I discussed the external interface exposed by RtlUnwindEx (and some of how unwinding works at a high level). This posting continues that discussion, and aims to provide insight into the internal workings of RtlUnwindEx (and as such, the inner details of all of the different aspects of unwind support on x64 Windows).

As previously described, the main behavior of RtlUnwindEx is to systematically unwind call frames (with the help of RtlVirtualUnwind) until a specific call frame, which is identified by the TargetFrame argument, is reached. RtlUnwindEx is also responsible for all interactions with language exception handlers for purposes of unwind operations. Additionally, RtlUnwindEx also imposes various validations and restrictions on execution contexts being unwound, and on the behavior of exception handlers being called for an unwind operation.

The first order of business within RtlUnwindEx is to capture the execution context at the time of the call to RtlUnwindEx (specifically, the execution context inside RtlUnwindEx, not of the caller of RtlUnwindEx). This is done with the aid of two helper functions, RtlpGetStackLimits (which retrieves the bounds of the stack for the current thread from the NT_TIB region of the current threads’ TEB), and RtlCaptureContext (which records the complete execution context of its caller within a standard CONTEXT structure). Additionally, if an unwind table is supplied, a special flag is set in it that optimizes the behavior of subsequent calls to RtlLookupFunctionTable for lookups that are unwind-driven (this is a behavior new to Windows Vista, and is a further attempt to improve the performance of unwind support on x64).

If the caller did not supply an EXCEPTION_RECORD argument, RtlUnwindEx will create the default STATUS_UNWIND exception record at this time and substitute it for what would have otherwise been a caller-supplied EXCEPTION_RECORD block. The exception record is initialized with an ExceptionAddress pointing to the Rip value captured previously by RtlCaptureContext, and with no parameters. Additionally, an initial ExceptionFlags value of EXCEPTION_UNWINDING is set, to later indicate to any exception handlers that might be called that an unwind operation is in progress (the EXCEPTION_RECORD pointer, either caller supplied or locally allocated by RtlUnwindEx in the absence of a caller-supplied value, corresponds exactly to the EXCEPTION_RECORD argument passed to any LanguageHandler that is called during unwind processing).

In the event that the caller of RtlUnwindEx did not supply a TargetFrame argument (indicating that the requested unwind operation is an exit unwind), then the EXCEPTION_EXIT_UNWIND flag is set within RtlUnwindEx’s internal ExceptionFlags value. An exit unwind is a special form of unwind where the “target” of the unwind is unknown; in other words, the caller does not have a valid target frame pointer to supply to RtlUnwindEx. Initiating a target unwind is normally dangerous unless the caller has special knowledge of an unwind handler in the call stack that will halt the unwind operation prematurely (either by initiating a secondary unwind, which leads to what is called a collided unwind, or by exiting the thread entirely). The reason for this restriction is that as RtlUnwindEx doesn’t have a clear “stopping point” to halt the unwind cycle at, it will happily unwind past the end of the stack (typically resulting in an access violation) unless an unwind handler along the way does something to halt the unwind. Most unwind operations are not exit unwinds.

At this point, RtlUnwindEx is set up to enter the main loop of the unwind algorithm, which essentially involves repeated calls to RtlVirtualUnwind, and then to unwind handlers (if present). This main loop involves multiple steps:

  1. The RUNTIME_FUNCTION entry for the current frame (given by the Rip member of the context record captured above, and later updated in this loop) is located via RtlLookupFunctionEntry. If no function entry is present, then RtlUnwindEx will load Context->Rip with a ULONG64 value located at Context->Rsp, and then increment Context->Rsp by 8. The behavior when there is no RUNTIME_FUNCTION entry present accounts for leaf functions, for which unwind metadata is optional. If the current frame is a leaf function, then control skips forward to step 8.
  2. Assuming that a RUNTIME_FUNCTION was found, RtlUnwindEx makes a copy of the current execution context that will be unwound – something I call the “unwind context”. After duplicating the context (via the RtlpCopyContext helper function, which only duplicates the non-volatile context), RtlVirtualUnwind is called (with the unwind context), and requested to return the address any associated language handler that is marked for unwind support. RtlVirtualUnwind thus returns several useful pieces of information; a language handler supporting unwind (if any), an updated context describing the caller of the requested call frame, a language-handler-specific (i.e. C scope table) data pointer associated with the requested call frame (if any), and the stack pointer of the call frame being unwound (the establisher frame). These pieces of information are used later in communication with a returned exception handler with unwind support, if one exists.
  3. After calling RtlVirtualUnwind to establish the context of the next location on the stack frame (now contained within the “unwind context” location), RtlUnwindEx performs some validation of the returned EstablisherFrame value. Specifically, the EstablisherFrame value is ensured to be 8-byte aligned and within the stack limits of the current thread (in kernel mode, there is also special support for handling the case of an unwind occcuring within the context of a DPC, which may operate under a secondary stack). If either of these conditions does not hold true, a STATUS_BAD_STACK exception is raised, indicating that the stack pointer in the requested call frame is damaged or corrupted. Additionally, if a TargetFrame value is specified (that is, the unwind operation is not an exit unwind), then the TargetFrame value is validated to be greater than or equal to the EstablisherFrame value returned by RtlVirtualUnwind. This is, in effect, a sanity check designed to ensure that the unwind target actually refers to a previous call frame and not that one that has already be unwound. If this check fails, then a STATUS_BAD_STACK exception is raised.
  4. If a language handler was returned by RtlVirtualUnwind, then RtlUnwindEx sets up for a call to the language handler. This involves the initial setup of a DISPATCHER_CONTEXT structure created on the stack of RtlUnwindEx. The DISPATCHER_CONTEXT structure describes some internal state that RtlUnwindEx shares with all participants in the unwind process, such as language handlers being called for unwind. It contains all of the state information necessary to coordinate operation between RtlUnwindEx and any language handler. Furthermore, it is also instrumental in the processing of collided unwinds; more on that later. The newly initialized DISPATCHER_CONTEXT contains two fields of significance, initially; the TargetIp field (which is simply a copy of the TargetIp argument to RtlUnwindEx), and the ScopeIndex field (which is zero initialized). Both of these fields are unused by RtlUnwindEx itself, and are simply available for the conveniene of language handlers being called for an unwind operation. If no language handler was present for the requested call frame, then control skips forward to step 8.
  5. At this point, RtlUnwindEx is ready to make a call to an unwind handler. This first involves a quick check to determine whether the end of the unwind chain has been reached, through comparing the current frame’s EstablisherFrame value with the TargetFrame argument to RtlUnwindEx. If the two frame pointers match exactly, then the ExceptionFlags value passed in to the unwind handler has an additional bit set, EXCEPTION_TARGET_UNWIND. This flag bit lets the unwind handler know that it is the “last stop” in the unwind process (in other words, that there will be no further frame unwinds after the unwind handler finishes processing). At this point, the ReturnValue argument passed to RtlUnwindEx is copied into the Rax register image in the active context for the current frame (not the unwound context, which refers to the previous frame). Then, the last remaining fields of the DISPATCHER_CONTEXT structure are initialized based on the internal state of RtlUnwindEx; the image base, handler data, instruction pointer (ControlPc), function entry, establisher frame, and language handler values previously returned by RtlLookupFunctionEntry and RtlVirtualUnwind are copied into the DISPATCHER_CONTEXT structure, along with a pointer to the context record describing the execution state at the current frame. After the ExceptionFlags member of RtlUnwindEx’s EXCEPTION_RECORD structure is set, the stack-based exception flags image (from which the copy in the EXCEPTION_RECORD was copied from) has the EXCEPTION_TARGET_UNWIND and EXCEPTION_COLLIDED_UNWIND flags cleared, to ensure that these flags are not inadvertently passed to an exception routine unexpectedly in a future loop iteration.
  6. After preparing the DISPATCHER_CONTEXT for the unwind handler call, RtlUnwindEx makes a call to a small helper function, RtlpExecuteHandlerForUnwind. RtlpExecuteHandlerForUnwind is an assembly-language routine whose prototype matches that of the language specific handler, given below:
    typedef EXCEPTION_DISPOSITION (*PEXCEPTION_ROUTINE) (
        IN PEXCEPTION_RECORD               ExceptionRecord,
        IN ULONG64                         EstablisherFrame,
        IN OUT PCONTEXT                    ContextRecord,
        IN OUT struct _DISPATCHER_CONTEXT* DispatcherContext
    );

    RtlpExecuteHandlerForUnwind is fairly straightforward. All it does is store the DispatcherContext argument on the stack, and then make a call to the LanguageHandler member in the DISPATCHER_CONTEXT structure. RtlpExecuteHandler then returns the return value of the LanguageHandler itself.

    While this may seem like a rather useless helper routine at first, RtlpExecuteHandlerForUnwind actually does add some value, although it might not be immediately apparent unless one looks closely. Specifically, RtlpExecuteHandlerForUnwind registers an exception/unwind handler for itself (RtlpUnwindHandler). RtlpUnwindHandler does not go through _C_specific_handler; in other words, it is a raw exception handler registration. Like RtlpExecuteHandlerForUnwind, RtlpUnwindHandler is a raw assembly language routine. It, too, is fairly simple (and as a language-level exception handler routine, RtlpUnwindHandler is compatible with the LanguageHandler prototype described above); RtlpUnwindHandler uses the EstablisherFrame argument given to a LanguageHandler routine to locate the saved pointer to the DISPATCHER_CONTEXT on the stack of RtlpExecuteHandlerForUnwind, and then copies most of the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandlerForUnwind over the DISPATCHER_CONTEXT structure that was passed to RtlpUnwindHandler itself (conspicuously omitted from the copy is the TargetIp member of the DISPATCHER_CONTEXT structure, for reasons that will become clear later). After performing the copy of the DISPATCHER_CONTEXT structure, RtlpUnwindHandler returns the manifest ExceptionCollidedUnwind constant. Although one might naively assume that all of this just leads up to protecting against the case of an unwind handler throwing an exception, it actually has a much more common (and significant) use; more on that later.

  7. After RtlpExecuteHandlerForUnwind returns, RtlUnwindEx decides what course of action to persue based on the return value. There are two legal return values from an exception handler called for unwind, ExceptionContinueSearch (the general “success”) return, and ExceptionCollidedUnwind. If any other value is returned, then RtlUnwindEx raises a STATUS_INVALID_DISPOSITION exception, indicating that an unwind handler has returned an illegal value (this is typically rarely seen in practice, as most unwind handlers are compiler generated, and therefore always get the return value correct). If ExceptionContinueSearch is returned, and the current EstablisherFrame doesn’t match the TargetFrame argument, then the unwind context and the context for the “current frame” are swapped (this positions the current frame context as referring to the context of the next function in the call chain, which will then be duplicated and unwound in the next loop iteration). If ExceptionCollidedUnwind is returned, then the execution path is a little bit more complicated. In the collided unwind case, all of the internal state information that RtlUnwindEx had previously copied into the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandler back out of the DISPATCHER_CONTEXT structure. RtlVirtualUnwind is then executed to determine the next lowest call frame using the context copied out of the DISPATCHER_CONTEXT structure, the EXCEPTION_COLLIDED_UNWIND flag is set, and control is transferred to step 5. This step may initially seem strange, but it will become clear after it is explained later.
  8. If control reaches this point, then a frame has been successfully unwound, and any applicable unwind handler has been notified of the unwind operation. The next step is a re-validation of the EstablisherFrame value (as it may have changed in the collided unwind case). Assuming that EstablisherFrame is valid, if its value does not match the TargetFrame argument, then control is transferred to step 1. Otherwise, if there is a match, then the loop terminates. (If the EstablisherFrame is not valid, and is not the expected TargetFrame value, then either the unwind exception record is raised as an exception, or a STATUS_BAD_FUNCTION_TABLE exception is raised.)

At this point, RtlUnwindEx has arrived at its target frame, and all intermediary unwind handlers have been called. It is now time to transfer control to the unwind point. The ReturnValue argument is again loaded into the current frame’s context (Rax register), and if the exception code supplied by the RtlUnwindEx caller via the ExceptionRecord argument does not match STATUS_UNWIND_CONSOLIDATE, the Rip value in the current frame’s context is replaced with the TargetIp argument.

The final task is to realize the finalized context; this is done by calling RtlRestoreContext, passing it the current frame’s context and the ExceptionRecord argument (or the default exception record constructed if no ExceptionRecord argument was supplied). RtlRestoreContext will in most cases simply copy the given context into the currently active register set, although in two special cases (if a STATUS_LONGJUMP or STATUS_UNWIND_CONSOLIDATE exception code is set in the optional ExceptionRecord argument), this behavior deviates from the norm. In the long jump case (as previously documented), the ExceptionRecord argument is assumed to contain a pointer to a jmp_buf, which contains a nonvolatile register set to restore on top of the unwound context supplied by RtlUnwindEx. The unwind consolidate case is rather more complicated, and will be discussed in a future posting.

For reference, I have posted some annotated, reverse engineered C and assembler code describing the internal operations of RtlUnwindEx and several of its helper functions (such as RtlpUnwindHander). This C code is based off of the Windows Vista implementation of RtlUnwindEx, and as such takes advantage of new Windows Vista-specific optimizations to unwind handling. Specifically, the “Unwind” flag in the UNWIND_HISTORY_TABLE structure is new in Windows Vista (although the size of the structure has not changed; there used to be empty alignment padding at that offset in previous Windows versions). This flag is used as a hint to RtlLookupFunctionEntry, in order to expedite lookup of function entries for some commonly referenced functions in the unwind path. Between the provided comments and the above description of the overall functionality of RtlUnwindEx, the inner workings of it should begin to come clear. There are some aspects (in particular, collided unwind) that are a bit more complicated than one might initially imagine; I’ll discuss collided unwinds (and more) in the next posting in this series.

It would be best to call the system version of RtlUnwindEx instead of reimplementing it for general purpose use (which I have done so here primarily to illustrate how unwind works on x64 Windows). There have been improvements made to RtlUnwindEx between Windows Server 2003 SP1 x64 and Windows Vista x64, so it would be unwise to assume that RtlUnwindEx will remain devoid of new performance or feature additions forever.

Next up: Collided unwinds, and other things that go “bump” in the dark when you use compiler exception handling and unwind support.

Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)

Sunday, January 7th, 2007

Previously, I provided a brief overview of what each of the core APIs relating to x64’s extensive data-driven unwind support were, and when you might find them useful.

This post focuses on discussing the interface-level details of RtlUnwindEx, and how they relate to procedure unwinding on Windows (x64 versions, specifically, though most of the concepts apply to other architecture in principle).

The main workhorse of unwind support on x64 Windows is RtlUnwindEx. As previously described, this routine encapsulates all of the work necessary to restore execution context to a prior point in the call stack (relying on RtlVirtualUnwind for this task). RtlUnwindEx also implements all of the logic relating to interactions with unwind/exception handlers during the unwind process (which is essentially the value added by RtlUnwindEx on top of what RtlVirtualUnwind implements).

In order to understand the inner workings of how unwinding works, it is first necessary to understand the high level theory behind how RtlUnwindEx is used (as RtlUnwindEx is at the heart of unwind support on Windows). Although there have been previously posted articles that touch briefly on how unwind is implemented, none that I have seen include all of the details, which is something that this segment of the x64 exception handling series shall attempt to correct.

For the moment, it is simpler to just consider the unwind half of exception handling. The nitty-gritty, exhaustive details of how exceptions are handled and dispatched will be discussed in a future posting; for now, assume that we are only interested in the unwind code path.

When a procedure unwind is requested, by any place within the system, the first order of business is a call to RtlUnwindEx. The prototype for RtlUnwindEx was provided in a previous posting, but in an effort to ensure that everyone is on the same page with this discussion, here’s what it looks like for x64:

VOID
NTAPI
RtlUnwindEx(
   __in_opt ULONG64               TargetFrame,
   __in_opt ULONG64               TargetIp,
   __in_opt PEXCEPTION_RECORD     ExceptionRecord,
   __in     PVOID                 ReturnValue,
   __out    PCONTEXT              OriginalContext,
   __in_opt PUNWIND_HISTORY_TABLE HistoryTable
   );

These parameters deserve perhaps a bit more explanation.

  1. TargetFrame describes the stack pointer (rsp) value for the target of the unwind operation. In normal circumstances, this is always the EstablisherFrame argument to an exception handler that is handling an exception. In the context of an exception handler, EstablisherFrame refers to the stack pointer of the caller of the function that caused the exception being inspected. Likewise, in this context, TargetFrame refers to the stack pointer of the function that the call stack should be unwound to. Although given the fact that with data-driven unwind semantics, one might initially think that this argument is unnecessary (after all, one might assume that RtlUnwindEx could simply invoke RtlVirtualUnwind in order to determine the expected stack pointer value for the next function on the call stack), this argument is actually required. The reason is that RtlUnwindEx supports unwinding past multiple procedure frames; that is, RtlUnwindEx can be used to unwind to a function that is several levels down in the call stack, instead of the immediately lower function in the call stack. Note that the TargetFrame argument must match exactly the expected stack pointer value of the target function in the call stack.

    Observant readers may pick up on the SAL annotation describing the TargetFrame argument and notice that it is marked as optional. In general, TargetFrame is always supplied; it can be omitted in one specific circumstance, which is known as an exit unwind; more on that later.

  2. TargetIp serves a similar purpose as TargetFrame; it describes the instruction pointer value that execution should be unwound to. TargetIp must be an instruction in the same function on the call stack that corresponds to the target stack frame described by TargetFrame. This argument is supplied as a particular function may have multiple points that could be resumed in response to an exception (this typically the case if there are multiple try/except clauses).

    Like TargetFrame, the TargetIp argument is also optional (though in most cases, it will be present). Specifically, if a frame consolidation unwind operation is being executed, then the TargetIp argument will be ignored by RtlUnwindEx and may be set to zero if desired (it will, however, still be passed to unwind handlers for use as they see fit). This specialized unwind operation will be discussed later, along with C++ exception support.

  3. ExceptionRecord is an optional argument describing the reason for an unwind operation. This is typically the same exception record that was indicated as the cause of an exception (if the caller is an exception handler), although it does not strictly have to be as such. If no exception record is supplied, RtlUnwindEx constructs a default exception record to pass on to unwind handlers, with an exception code of STATUS_UNWIND and an exception address referring to an instruction within RtlUnwindEx itself.
  4. ReturnValue describes a pointer-sized value that is to be placed in the return value register at the completion of an unwind operation, just before control is transferred to the newly unwound context. The interpretation of this value is entirely up to the routine being unwound into. In practice, the Microsoft C/C++ compiler does not use the return value at all in typical cases. Usually, the Microsoft C/C++ compiler will indicate the exception code that caused the exception as the return value, but due to how unwinding across functions works with try/except, there is no language-level support for retrieving the return value of a function that has been unwound due to an exception. As a result, in most circumstances, the return value placed in the unwound execution context based on this argument is ignored.
  5. OriginalContext describes an out-only pointer to a context record that is updated with the execution context as procedure call frames are unwound. In practice, as RtlUnwindEx does not ever “return” to its caller, this value is typically only provided as a way for a caller to supply its own storage to be used as scratch space by RtlUnwindEx during the intermediate unwind operations comprimising an unwind to the target call frame. Typically, the context record passed in to an exception handler from the exception dispatcher is supplied. Because the initial contents of the OriginalContext argument are not used, however, this argument need not necessarily be the context record passed in from the exception dispatcher.
  6. HistoryTable describes a cache used to improve the performance of repeated function entry lookups via RtlLookupFunctionEntry. Under normal circumstances, this is the same history table passed in from the exception dispatcher to an exception handler, although it could also be a caller-allocated structure as well. This argument can also be safely omitted entirely, although if a non-trivial set of call frames are being unwound, passing in even a newly-initialized history table may improve performance.

Given all of the above information, RtlUnwindEx performs a procedure call unwind by performing a successive sequence of RtlVirtualUnwind calls (to determine the execution context of the next call frame in the call stack), followed by a call to the registered language handler for the call frame (if one exists and is marked for unwinding support). In most cases where there is a language unwind handler, it will point to _C_specific_handler, which internally searches all of the internal exception handling scopes (e.g. try/except or try/finally constructs), calling “finally” handlers as need be. There may also be internal unwind handlers that are present in the scope table for a particular function, such as for C++ destructor support (assuming asynchronous C++ exception handling has been enabled). Most users will thus interact with unwind handlers in the form of a “finally” handler in a try/finally construct in a function whose language handler refers to _C_specific_handler.

If RtlUnwindEx encounters a “leaf function” during the unwind process (a leaf function is a function that does not use the stack and calls no subfunctions), then it is possible that there will be no matching RUNTIME_FUNCTION entry for the current call frame returned by RtlLookupFunctionEntry. In this case, RtlUnwindEx assumes that the return address of the current call frame is at the current value of Rsp (and that the current call frame has no unwind or exception handlers). Because the x64 calling convention enforces hard rules as to what functions without RUNTIME_FUNCTION registrations can do with the stack, this is a valid assumption for RtlUnwindEx to make (and a necessary assumption, as there is no way to call RtlVirtualUnwind on a function with no matching RUNTIME_FUNCTION entry). The current call frame’s value of Rsp (in the context record describing the current call frame, not the register value of rsp itself within RtlUnwindEx) is dereferenced to locate the call frame’s return address (Rip value), and the saved Rsp value is then adjusted accordingly (increased by 8 bytes).

When RtlUnwindEx locates the endpoint frame of the unwind, a special flag (EXCEPTION_TARGET_UNWIND) is set in the ExceptionFlags member of the EXCEPTION_RECORD passed to the language handler. This flag indicates to the language handler (and possibly any C-language scope handlers) that the handler is being called as the “final destination” of the unwind operation. The Microsoft C/C++ compiler does not expose functionality to detect whether a “finally” handler is being called in the context of a target unwind or if the “finally” handler is simply being called as an intermediate step towards the unwind target.

After the last unwind handler (if applicable) has been called, RtlUnwindEx restores the execution context that has been continually updated by successive calls to RtlVirtualUnwind. This restoration is performed by a call to RtlRestoreContext (a documented, exported function), which simply transfers a given context record to the thread’s execution context (thus “realizing” it).

RtlUnwindEx does not return a value to its caller. In fact, it typically does not return to its caller at all; the only “return” path for RtlUnwindEx is in the case where the passed-in execution context is corrupted (typically due to a bogus stack pointer), or if an exception handler does something illegal (such as returning an unrecognized EXCEPTION_DISPOSITION) value. In these cases, RtlUnwindEx will raise a noncontinuable exception describing the problem (via RtlRaiseStatus). These error conditions are usually fatal (and are indicative of something being seriously corrupted in the process), and virtually always result in the process being terminated. As a result, it is atypical for a caller of RtlUnwindEx to attempt to handle these error cases with an exception handler block.

In the case where RtlUnwindEx performs the requested unwind successfully, a new execution context describing the state at the requested (unwound) call frame is directly realized, and as such RtlUnwindEx does not ever truly return in the success case.

Although RtlUnwindEx is principally used in conjunction with exception handling, there are other use cases implemented by the Microsoft C/C++ compiler which internally rely upon RtlUnwindEx in unrelated capacities. Specifically, RtlUnwindEx implements the core of the standard setjmp and longjmp routines (assuming the exception safe versions of these are enabled by use of the <setjmpex.h> header file) provided by the C runtime library in the Microsoft CRT.

In the exception-safe setjmp/longjmp case, the jmp_buf argument essentially contains an abridged version of the execution context (specifically, volatile register values are omitted). When longjmp is called, the routine constructs an EXCEPTION_RECORD with STATUS_LONGJUMP as the exception code, sets up one exception information parameter (which is a pointer to the jmp_buf), and passes control to RtlUnwindEx (for the curious, the x64 version of the jmp_buf structure is described as _JUMP_BUFFER in setjmp.h under the _M_AMD64_ section). In this particular instance, the ReturnValue argument of RtlUnwindEx is significant; it corresponds to the value that is seemingly returned by setjmp when control is being transferred to the saved setjmp context as part of a longjmp call (somewhat similar in principal as to how the UNIX fork system call indicates whether it is returning to the child process or the parent process). The internal operations of RtlUnwindEx are identical whether it is being used for the implementation of setjmp/longjmp, or for conventional exception-handler-based triggering of procedure call frame unwinding.

However, there are differences that appear when RtlUnwindEx restores the execution context via RtlRestoreContext. There is special support inside RtlRestoreContext for STATUS_LONGJUMP exceptions with one exception information parameter; if this situation is detected, then RtlRestoreContext internally reinitializes portions of the passed-in context record based on the jmp_buf pointer stored in the exception information parameter block of the exception record provided to RtlRestoreContext by RtlUnwindEx. After this special-case partial reinitialization of the context record is complete, RtlRestoreContext realizes the context record as normal (causing execution control to be transferred to the stored Rip value). This can be seen as a hack (and a violation of abstraction layers; there is intended to be a logical separation between operating system level SEH support, and language level SEH support; this special support in RtlRestoreContext blurs the distinction between the two for C language support with the Microsoft C/C++ compiler). This layering violation is not the most egregious in the x64 exception handling scene, however.

This concludes the basic overview of the interface provided by RtlUnwindEx. There are some things that I have not yet covered, such as exit unwinds, collided unwinds, or the deep integration and support for C++ try/catch, and some of the highly unsavory things done in the name of C++ exception support. Next time: A walkthrough of the complete internal implementation of RtlUnwindEx, including undocumented, never-before-seen (or barely documented) corner cases like exit unwinds or collided unwinds (the internals of C++ exception support from the perspective of RtlUnwindEx are reserved for a future posting, due to size considerations).

Think before you optimize

Friday, December 29th, 2006

“Premature optimization is the root of all evil” is a famous quote in computer science, and it absolutely holds true. Before optimizing a problem, you must make sure that you are optimizing the bottleneck, and that your optimization doesn’t actually make things worse.

These rules may seem obvious, but not everyone adheres to them; you’d be surprised how many newsgroup postings I see where people are asking how to solve the wrong problem because they didn’t take the time to profile their program and locate their real bottleneck.

One example of this kind of premature (or perhaps just not all the way thought-through) optimization that bothers me on a daily basis is in the Microsoft Terminal Server client (mstsc.exe). Terminal Server is a remote windowing protocol, and as such it is designed to take great pains to improve responsiveness to users. In most cases, improving responsiveness over the network involves minimizing the amount of data sent from the server to the client. In this spirit, the designers of Terminal Server implemented an innocent-seeming optimization, wherein the Terminal Server client detects when it has been minimized. If this occurs, the Terminal Server client sends a special message to the server asking that it stop sending window updates to the client. When the user restores the Terminal Server client window, the server will resync with the client.

This may seem like a clever little optimization with no downsides at first, but it turns out that it actually worsens the user experience (at least in my opinion) when you look at things a little bit closer. First, there’s how Terminal Server resynchronizes with the client when the client requests that it again wants to receive windowing data. Windows follows the model of not saving data that can be recalculated on demand in its user interface design. In many ways, this is a perfectly valid model, and there are a number of valid reasons for it (especially given that as you open more windows, it starts to become non-trivially-expensive to cache bitmap data for every window on the screen – even more especially on the very low end systems that 16-bit Windows has to work on). As a result, when Windows wants to retrieve the contents of a window on screen, the typical course of action is that a WM_PAINT message is sent to the window. This message asks the window to draw itself into a device context, or a storage area where the bits can then be transferred to the screen, a printer, or any other visual display device.

If you’ve been paying attention, then you might be seeing where this starts to go wrong with Terminal Server. When you restore a minimized Terminal Server client window, the client asks the server to resynchronize. This is necessary because the server has since stopped sending updates to the client, which means that the client has to assume that its display data is now stale. In order to do this resynchronization, Terminal Server has to figure out what contents have changed on the overall desktop bitmap that describes the entire screen. Terminal Server is free to cache the entire contents of a session’s interactive desktop as a whole (and indeed this is necessary so that during resynchronization, the entire desktop doesn’t have to be transferred as a bitmap to the client). However, it still needs to compare the last copy of the bitmap that was sent to a client with the “current” view of the desktop. In order to do that, Terminal Server essentially does something along the lines of asking each visible window on the desktop to paint itself. Then, Terminal Server can update the client with new display data for each window.

The problem here is that many programs don’t repaint themselves in a very graceful fashion. Many programs have unpleasant tendencies like triggering multiple draw operations over the same region before the end result is achieved, something that manifests itself as a very slightly annoying flicker when a window repaints. Even Microsoft programs exhibit this problem; for instance, Visual Studio 2005 tends to do this, as does Internet Explorer when drawing pages with foreground images overlayed with background images.

Now, while this may be a minor annoyance when working locally, it turns out to be a big problem when a program is running over Terminal Server. What would otherwise be an innocuous flicker over the course of a couple of milliseconds on a “glass terminal” display turns into multiple draw commands being sent and realized over the network to the Terminal Server client. This translates to bandwidth waste as redundant draw commands are transmitted, and even worse, a lack of responsiveness when restoring a minimized Terminal Server client window (due to having to wait on the programs on Terminal Server to finish updating themselves in the resynchronization process). If you have several programs running on the Terminal Server, this can correspond to three or four seconds of waiting before the Terminal Server session is responsive to input from the client.

While this is annoying in and of itself, it may still not seem all that bad. After all, this problem only happens if you minimize and restore a window, and you generally don’t just minimize and restore windows all the time, right? It turns out that with the Terminal Server client, most people do just that, if they are working in fullscreen mode. Remember that fullscreen Terminal Server obscures the task bar on the physical client computer, and in many cases, results in task switching keystrokes such as Alt+Tab or the Windows key being sent to the remote Terminal Server session and not the physical client system. In order to switch to another program on the physical client computer, then, one needs to either minimize the Terminal Server window or (perhaps temporarily) take it out of fullscreen mode. At least for me, if I want to switch to a program on the physical client system, the logical choice is to hit the minimize button on the Terminal Server client info bar at the top of the fullscreen Terminal Server client window. Unfortunately, that little minimize button invokes the clever redraw optimization that stops the server from updating the client. This means that when I switch back to the Terminal Server session, I need to wait several seconds while programs running in the Terminal Server session finish redrawing themselves and transmitting the draw operations to the client (which is especially painful if you are dealing with bitmaps, such as Internet Explorer on a page with foreground images overlaying a background image).

As a result, thanks to somebody’s “clever optimization”, my Terminal Server sessions now take several seconds to come back when I switch away from them to work on something locally (perhaps to copy and paste some text from the remote system to my computer) and then switch back.

Now, Terminal Server is a great example of a highly optimized program on the whole (and it’s absolutely usable, don’t get me wrong about that). It beats the pants off of VNC and any of the other remote windowing systems that I have ever used any day of the week, for one. However, this just goes to show that even with the best of intentions, one little optimization can blow up in unintended (negative) ways if you are not careful.

Oh, and if you run into this little annoyance as frequently as I do, there is one thing that you may be able to do that alleviates it (at least looking to the future, anyway). When using the Windows Vista (or later) Terminal Server client to connect to a Windows Vista or Windows Server “Longhorn” Terminal Server (or Remote Desktop), you can prevent this lack of responsiveness when restoring minimized Terminal Server windows by enabling desktop composition on the Terminal Server connection. This may seem a bit counter-intuitive at first (enabling 3D transparent windows would sure make you think that a lot more data would need to be transferred, thus slowing down the experience as a whole), but if you are on a high-bandwidth, low-latency link to the target computer, it turns out that desktop composition improves responsiveness when restoring minimized Terminal Server windows. This is because with desktop composition enabled, Windows breaks from the traditional model of not saving data that you can recalculate. Instead, with desktop composition enabled, Windows will save the contents of all windows on the screen for future reference, so that if Windows needs to access the bits of a window, it doesn’t need to ask that window to redraw. (This allows all sorts of neat tricks, such as how you can have a window appearing to be drawn twice with the new Alt+Tab window on Windows Vista, with the live preview, without a major performance hit – try it out with a 3D game in windowed mode to see what I mean). Because of this caching of window data, when resynchronizing with the client after a minimize and restore operation, the server end of Terminal Server doesn’t need to ask every program to redraw itself; all it needs to do is fetch the bits out of the cache that is created for each window by desktop composition (and thus the differences sent to the client will only show “real” differences, not multiple layers of a redraw operation. Try this with an Internet Explorer window open on a page with foreground images overlaying background images, and the difference is immediately visible between Terminal Server with desktop composition enabled and Terminal Server without desktop composition.) This means that there are no more painful multi-step-redraw operations that are visible in real time on the client, at least when it comes to pathological bitmap drawing cases, such as Internet Explorer (and no annoying flicker in the less severe cases).

Programming against the x64 exception handling support, part 2: A description of the new unwind APIs

Tuesday, December 19th, 2006

Last time, I described many of the structures and prototypes necessary to program against the new x64 exception handling (EH) support. This posting continues that series, and describes how to manually initiate an unwind of procedure frames (and when and why you might want to do this).

Because x64 has built-in support for data-driven unwinding, there are a great many interesting things that you can do with unwinding functions at arbitrary points in execution. Unlike x86, you don’t have to either assume that all functions use a frame pointer (which is typically not the case in many programs), and you don’t need to call code with a certain register context setup in the correct way (with the right local variables at the right displacements from the stack pointer) in order to initiate an unwind of a function that had registered an unwind or exception handler.

If you’ve been reading some of my recent postings about performing stack traces on x86, then you one of the first things that might come to mind is designing an approach that can create a “perfect” call stack in all situations without symbols. There are other benefits to this data-driven unwind data approach, however, than simply being able to take accurate call stacks at arbitrary points in the execution process. For instance, there are particularly interesting benefits as far as instrumentation and code analysis go (such as an improved ability to detect most functions in an image programmatically with a great deal of certainty based on unwind data), and there are interesting implications for techniques such as function patching and modification on the fly as well.

First things first, however. The initial step is to get familiar with the new unwinding APIs that Win64 exposes on x64. Although these APIs can be manually duplicated by explicit parsing of all unwind information, I would recommend calling the APIs directly instead of doing all of the work to manually emulate unwinds yourself. The reason that I make that recommendation is that while the unwind metadata is documented, there is still a significant amount of work involved in reimplementing them from scratch, and the unwind APIs themselves are (mostly) documented on MSDN and thus unlikely to change.

There are several APIs in particular that you’ll frequently find yourself using for unwind support on x64. These APIs are available in both user mode and kernel mode (and aside for a lack of support for dynamically generated unwind data) the two operating environments use exactly the same semantics for unwinding. Thus, for the most part, you can interact with unwind metadata in the same fashion for both user mode and kernel mode.

  1. RtlLookupFunctionEntry: The first API that you’ll likely end up having to call for any unwind-related operation is RtlLookupFunctionEntry. This routine is the basis of all unwind operations in that it allows the caller to translate a raw 64-bit RIP value into two important values: An image base for any associated image in the address space of the caller, and a pointer to the RUNTIME_FUNCTION structure associated with the RIP value passed in. For virtually all cases on x64, you’ll be able to retrieve a valid RUNTIME_FUNCTION structure for the current RIP value. The exception to this rule relates to what are known as leaf functions, or functions that both make no direct modifications to the stack pointer (or any nonvolatile registers), and do not call any subfunctions. For these leaf functions only, the emission of unwind metadata is optional by the compiler. To handle this case, it is typical to read the first ULONG64 from the current RSP value (i.e. the return address of the current leaf function). This address can then be passed to RtlLookupFunctionEntry. Because leaf functions do not touch any nonvolatile registers or alter the stack pointer or call any subfunctions, they can be safely skipped in the unwind process in this fashion. (Virtually all functions in a given x64 binary are non-leaf functions (otherwise known as frame functions), or functions that do not meet the previously described three criteria. In either case, however, the restrictions on leaf functions mean that they do not impact the ability to perform complete unwinds despite the lack of unwind metadata associated with them.)

    The typical usage case for RtlLookupFunctionEntry is simply to retrieve the function entry for the currently executing function. (For leaf functions, it may be necessary to retrieve the function entry for the caller, if there is no unwind metadata for the current function, as described above.) Then, the PRUNTIME_FUNCTION returned is typically passed to one of the “high level” unwind support routines, although if necessary, it can be manually interpreted directly (this is typically not required, however).

  2. RtlVirtualUnwind: The RtlVirtualUnwind API is the core of the Win64 x64 unwind support. This API implements the lowest level interface exposed for interacting with unwind metadata through a RUNTIME_FUNCTION. In particular, it implements all of the code necessary to interpret UNWIND_CODEs and adjust the stack and nonvolatile register context according to the unwind information specified via a RUNTIME_FUNCTION. It also has logic to locate and execute exception or unwind handlers for a given function.

    RtlVirtualUnwind provides the infrastructure upon which higher level exception and unwind handling support is implemented. It exposes the concept of a virtual unwind (as one might guess, given the routine’s name). The virtual unwind concept is one that is entirely new to x64 (and IA64), and does not exist in any form on x86. This is due entirely to the fact that IA64 and x64 have data-driven unwind support, while x86 has code-driven unwind support.

    The distinction is important in that on x64 and IA64, it is possible to simulate an unwind, at an arbitrary point in time, without running code with potentially unknown side effects (or unknown entry conditions, as with x86 exception or unwind handlers that utilize local variables). This is accomplished by interpreting the unwind codes described by a RUNTIME_FUNCTION and associated UNWIND_INFO blocks. This is the essence of what a virtual unwind is; a simulated unwind operation that can operate on an arbitrary, isolated register context without affecting (or otherwise impacting) the actual realized state of the program. In its purest form, a virtual unwind can be accomplished by invoking RtlVirtualUnwind with a register context that you wish to have the unwind applied to, and the UNW_FLAG_NHANDLER flag value for the HandlerType parameter (which suppresses the invokation of any unwind or exception handlers registered by the function).

    This is a very powerful capability indeed, as it allows for a much more complete and thorough traversal of call frames than ever possible on x86. With the ability to describe and undo the changes to nonvolatile registers given an initial register context and stack, virtual unwinding allows programmatic, completely-reliable access to not only the return address, stack frame, and arguments of arbitrary functions at any point in an active call stack, but also access to nonvolatile register values at any point in a call stack. If you have ever debugged optimized code where parameter values and intermediate values are frequently only present in registers, then you can immediately see how valuable this particular benefit of virtual unwinding is to debugging (it is important to note that as volatile registers are not saved anywhere, it is not necessarily possible to reconstruct their values at any point in the call frame).

    It is also possible to use RtlVirtualUnwind to effect a “realized” unwind, and indeed, RtlVirtualUnwind is the cornerstore on which the rest of the unwinding architecture in Win64 x64 is built. By directing RtlVirtualUnwind to call unwind (or exception) handlers, as appropriate, and then further altering the returned context (such as by specifying a return value), it is possible to perform a complete “realized” unwind from a procedure at an arbitrary point in execution.

  3. RtlUnwindEx: RtlUnwindEx supplants the RtlUnwind API that exists on x86 for purposes of implementing a “hard unwind” that alters the realized execution state of the program. RtlUnwindEx is a natural extension of RtlUnwind that includes support for features new to 64-bit exception handling support. Unlike RtlUnwind, it can operate on a register context other than the current register context.

    RtlUnwindEx implements an unwind that calls all of the necessary unwind handlers necessary to unwind to a particular point. It also adjusts the register context based on the unwind metadata at the given procedure frame being unwound. Internally, RtlUnwindEx is essentially implemented as a wrapper that calls RtlVirtualUnwind and registered unwind handlers as necessary for each frame in between the active frame and the target frame. It also houses all of the logic necessary to deal with some of the other subtleties of unwinding, such as detection of a bogus stack pointer value in the passed in register context.

    RtlUnwindEx is useful if you are needing to execute a complete unwind (and only a complete unwind) of a particular procedure frame or set of procedure frames. In most cases where you would be doing this, it is usually sufficient to just be relying on the language-level exception handling support, so I consider RtlUnwindEx as relatively uninteresting (at least when compared to RtlVirtualUnwind). Many of the more interesting use cases for directly calling the x64 exception handling support thus require the use of RtlVirtualUnwind directly (although selectively unwinding past certain procedure frames with complete support for calling unwind handlers is made easier by direct usage of RtlUnwindEx).

  4. RtlCaptureStackBackTrace: The RtlCaptureStackBackTrace routine is essentially a high-level implementation of a stack walking routine that utilizes the lower level unwind support (in particular, RtlVirtualUnwind). Unlike StackWalk64, RtlCaptureStackBackTrace is very light-weight and does not use symbols (it is implemented entirely with the unwind metadata present on x64). As such, it does not exist on x86. It is, however, handy for quickly capturing stack traces (and can be used in both user mode and kernel mode in the same fashion). RtlCaptureStackBackTrace does not return non-volatile register contexts for each frame being traced, however, so if you require this functionality, then you would need to implement your own stack trace mechanism on top of RtlVirtualUnwind. (It is worth noting that this is sort of mechanism is essentially what functionality like handle tracing and page heap tracing are built on top of, to give you an idea of how useful it can be.) If you only need return addresses for each frame, however, then RtlCaptureStackBackTrace is an excellent API to consider for use if you need to log stack traces at periodic locations in your own programs for later analysis (especially since it doesn’t require anything as invasive as loading symbols).

That’s all there is in this posting. More details on how to use the new unwind support next time…

Vista ASLR is not on by default for image base addresses

Saturday, December 16th, 2006

This little tidbit seems to be missed in all of the press about Vista’s ASLR implementation: Vista ASLR (when speaking of randomizing image base addresses) does not apply to image bases by default. This is a sacrifice for application compatibility’s sake, in an effort to make fewer programs break “out of the box” on Vista. Most notably, this is the case even for images with base relocations.

Unfortunately, the mechanism to mark an executable image as “ASLR aware” (such that it can be freely rebased by Vista’s ASLR) is not at present documented. Furthermore, the linker version that is included with Visual Studio 2005 and the Windows Vista Platform SDK does not support the option necessary to mark as image as ASLR aware (though you could technically modify the image by hand with a hex editor or the like to enable it).

The WDK linker does support the new ASLR-enabling linker option, however (though it too does not appear to document it anywhere). You can find references to this new linker option in makefile.new:

!if defined(NO_DYNAMICBASE)
DYNAMICBASE_FLAG=
!else
! if $(_NT_TOOLS_VERSION) >= 0x800
DYNAMICBASE_FLAG=/dynamicbase
! else
DYNAMICBASE_FLAG=
! endif
!endif

Passing /dynamicbase to the WDK version of link.exe (8.00.50727.215) or later will set the 0x40 DllCharacteristics value in the PE header of the output binary. This corresponds to a newly-defined constant which is at present only documented briefly in the WDK version of ntimage.h:

#define IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE 0x0040
     // DLL can move.

If this flag is set, then the base address of an image can be randomized by Vista’s ASLR; if the flag is clear, however, then no ASLR-style randomizations are performed to the image base address of a particular image (in this case, however, it is important to note that heap and stack allocations are still randomized – it is only the image base address that does not become randomized).

Now, virtually all of the Microsoft PE images that ship with the operating system are built with /dynamicbase, so they will take full advantage of Vista’s ASLR with respect to image base randomization. However, third party (ISV)-built programs will not, by default, gain all the benefits of ASLR due to this application compatibility sacrifice. This is where the potential problem is, as effectively all existing third party PE images will need to be recompiled to enable ASLR on image base addresses. (Technically, you could use link /edit with the WDK linker to do this without a rebuild, or hex edit binaries, but this is not a real solution in my mind. In Microsoft’s defense, many third-party .exe files are often built without base relocations, which means that even if Microsoft had enabled ASLR by default, many third party programs would still not be getting the full benefit. This does not, however, mean that I fully agree with their decision…)

I can understand where Microsoft is coming from with an application compatibility perspective as far as ASLR’s impact on poorly written programs (of which there are an abundance of in the Windows world), but it is a bit unfortunate that there is no real way to administratively enable ASLR globally, or at least administratively make it an opt-out instead of opt-in setting.

So, if you are an ISV, here’s a heads up to be on the lookout for a link.exe version shipping with Visual Studio that supports /dynamicbase. When such becomes available, I would highly recommend enabling /dynamicbase for all of your projects (so long as you aren’t doing anything terribly stupid in your programs, enabling image base randomizations should be fairly harmless in most cases). You should also mark your .exe files as /FIXED:NO such that they contain a relocation section. This, when combined with /dynamicbase, will allow your .exe files to be randomized by ASLR (just the same as with DLLs that have relocation information and are built with /dynamicbase).

Update: Visual Studio 2005 SP1 has shipped. This update to Visual Studio includes a newer version of the linker, which supports the /dynamicbase option described above. So, be sure to rebuild your programs with /dynamicbase and /fixed:no with VS 2005 SP1 in order to take full advantage of ASLR on Vista.