Why does every heap trace in UMDH get stuck at “malloc”?

February 21st, 2008

One of the more useful tools for tracking down memory leaks in Windows is a utility called UMDH that ships with the WinDbg distribution. Although I’ve previously covered what UMDH does at a high level, and how it functions, the basic principle for it, in a nutshell, is that it uses special instrumentation in the heap manager that is designed to log stack traces when heap operations occur.

UMDH utilizes the heap manager’s stack trace instrumentation to associate call stacks with outstanding allocations. More specifically, UMDH is capable of taking a “snapshot” of the current state of all heaps in a process, associating like-sized allocations from like-sized callstacks, and aggregrating them in a useful form.

The general principle of operation is that UMDH is typically run two (or more times), once to capture a “baseline” snapshot of the process after it has finished initializing (as there are expected to always be a number of outstanding allocations while the process is running that would not be normally expected to be freed until process exit time, for example, any allocations used to build the command line parameter arrays provided to the main function of a C program, or any other application-derived allocations that would be expected to remain checked out for the lifetime of the program.

This first “baseline” snapshot is essentially intended to be a means to filter out all of these expected, long-running allocations that would otherwise show up as useless noise if one were to simply take a single snapshot of the heap after the process had leaked memory.

The second (and potentially subsequent) snapshots are intended to be taken after the process has leaked a noticeable amount of memory. UMDH is then run again in a special mode that is designed to essentially do a logical “diff” between the “baseline” snapshot and the “leaked” snapshot, filtering out any allocations that were present in both of them and returning a list of new, outstanding allocations, which would generally include any leaked heap blocks (although there may well be legitimate outstanding allocations as well, which is why it is important to ensure that the “leaked” snapshot is taken only after a non-trivial amount of memory has been leaked, if at all possible).

Now, this is all well and good, and while UMDH proves to be a very effective tool for tracking down memory leaks with this strategy, taking a “before” and “after” diff of a problem and analyzing the two to determine what’s gone wrong is hardly a new, ground-breaking concept.

While the theory behind UMDH is sound, however, there are some situations where it can work less than optimally. The most common failure case of UMDH in my experience is not actually so much related to UMDH itself, but rather the heap manager instrumentation code that is responsible for logging stack traces in the first place.

As I had previously discussed, the heap manager stack trace instrumentation logic does not have access to symbols, and on x86, “perfect” stack traces are not generally possible, as there is no metadata attached with a particular function (outside of debug symbols) that describes how to unwind past it.

The typical approach taken on x86 is to assume that all functions in the call stack do not use frame pointer omission (FPO) optimizations that allow the compiler to eliminate the usage of ebp for a function entirely, or even repurpose it for a scratch register.

Now, most of the libraries that ship with the operating system in recent OS releases have FPO explicitly turned off for x86 builds, with the sole intent of allowing the built-in stack trace instrumentation logic to be able to traverse through system-supplied library functions up through to application code (after all, if every heap stack trace dead-ended at kernel32!HeapAlloc, the whole concept of heap allocation traces would be fairly useless).

Unfortunately, there happens to be a notable exception to this rule, one that actually came around to bite me at work recently. I was attempting to track down a suspected leak with UMDH in one of our programs, and noticed that all of the allocations were grouped into a single stack trace that dead-ended in a rather spectacularly unhelpful way. Digging in a bit deeper, in the individual snapshot dumps from UMDH contained scores of allocations with the following backtrace logged:

00000488 bytes in 0x1 allocations
   (@ 0x00000428 + 0x00000018) by: BackTrace01786

        7C96D6DC : ntdll!RtlDebugAllocateHeap+000000E1
        7C949D18 : ntdll!RtlAllocateHeapSlowly+00000044
        7C91B298 : ntdll!RtlAllocateHeap+00000E64
        211A179A : program!malloc+0000007A

This particular outcome happened to be rather unfortunate, as in the specific case of the program I was debugging at work, virtually all memory allocations in the program (including the ones I suspected of leaking) happened to ultimately get funneled through malloc.

Obviously, getting told that “yes, every leaked memory allocation goes through malloc” isn’t really all that helpful if (most) every allocation in the program in question happened to go through malloc. The UMDH output begged the question, however, as to why exactly malloc was breaking the stack traces. Digging in a bit deeper, I discovered the following gem while disassembling the implementation of malloc:

0:011> u program!malloc
program!malloc
[f:\sp\vctools\crt_bld\self_x86\crt\src\malloc.c @ 155]:
211a1720 55              push    ebp
211a1721 8b6c2408        mov     ebp,dword ptr [esp+8]
211a1725 83fde0          cmp     ebp,0FFFFFFE0h
[...]

In particular, it would appear that the default malloc implementation on the static link CRT on Visual C++ 2005 not only doesn’t use a frame pointer, but it trashes ebp as a scratch register (here, using it as an alias register for the first parameter, the count in bytes of memory to allocate). Disassembling the DLL version of the CRT revealed the same problem; ebp was reused as a scratch register.

What does this all mean? Well, anything using malloc that’s built with Visual C++ 2005 won’t be diagnosable with UMDH or anything else that relies on ebp-based stack traces, at least not on x86 builds. Given that many things internally go through malloc, including operator new (at least in the default implementation), this means that in the default configuration, things get a whole lot harder to debug than they should be.

One workaround here would be to build your own copy of the CRT with /Oy- (force frame pointer usage), but I don’t really consider building the CRT a very viable option, as that’s a whole lot of manual work to do and get up and running correctly on every developer’s machine, not to mention all the headaches that service releases that will require rebuilds will bring with such an approach.

For operator new, it’s fortunately relatively doable to overload it in a relatively supported way to be implemented against a different allocation strategy. In the case of malloc, however, things don’t really have such a happy ending; one is either forced to re-alias the name using preprocessor macro hackery to a custom implementation that does not suffer from a lack of frame pointer usage, or otherwise change all references to malloc/free to refer to a custom allocator function (perhaps implemented against the process heap directly instead of the CRT heap a-la malloc).

So, the next time you use UMDH and get stuck scratching your head while trying to figure out why your stack traces are all dead-ending somewhere less than useful, keep in mind that the CRT itself may be to blame, especially if you’re relying on CRT allocators. Hopefully, in a future release of Visual Studio, the folks responsible for turning off FPO in the standard OS libraries can get in touch with the persons responsible for CRT builds and arrange for the same to be done, if not for the entire CRT, then at least for all the code paths in the standard heap routines. Until then, however, these CRT allocator routines remain roadblocks for effective leak diagnosis, at least when using the better tools available for the job (UMDH).

Power management observations with my XV6800

December 11th, 2007

As I previously mentioned, I recently switched to a full-blown Windows Mobile-based phone. One of the interesting conundrums that I’ve ran into along the way is feeling out what’s actually killer on the battery life for the device and what isn’t.

It’s actually kind of surprising how seemingly innocent little things can have a dramatic impact on battery life in a situation like an always-on handheld phone. For example, one of the main power saving features of modern CDMA cell networks and packet switched data connections is the ability for the device to go into dormant mode, a state in which the device relinquishes it’s active radio link resources and essentially goes into a standby mode that is in some respects akin to how the device operates when it is otherwise idle and waiting for a call. This is beneficial for the device’s battery life (because while in dormant mode, much of the logic related to maintaining the active high-rate radio link can be powered off while the network link is idle). Dormant mode is also beneficial for network operators, in that it allows the resources previously dedicated to a particular device to be used by other users that are actively sending or receiving data.

When data is ready to be sent on either end of the link, the device will “wake up” and exit dormant mode in order to send/receive data, keeping the link fully active until the session has been idle for enough time for the dormant idle timer to expire. (Entering and exiting dormant mode is mostly seamless, but may involve delays of 1-2 seconds, at least in my experience, so doing so on every packet would be highly undesirable.)

What all of this means is that there’s some significant gains to be had in terms of radio power consumption if network traffic can be limited to where it is really necessary. There are a couple of things that I had initially running on my device that didn’t really fit this criteria. For example, I have an always-active SSH connection to a screen session that, among other things, runs my console SILC client, which is based off of irssi (an IRC client). One gotcha that I encountered there is that in the default configuration, this client has a clock on the ncurses console UI. The clock is updated every minute with the current time, which is (normally) handy as you can at-a-glance compare the current timestamp with message timestamps in the backlog.

However, this becomes problematic in my case, because the clock forces the device out of dormant mode every minute in order to repaint that portion of the console UI. (Aside from the clock, the remainder of the SILC/irssi UI is static until there is actual chat activity, which would otherwise make it relatively well suited for this situation.) Fortunately, it turned out to be relatively easy to remove the clock from the client’s UI, which prevents the SSH session from causing network activity every minute while I’ve got the SILC client up.

Another network power management “gotcha” that I ran into was the built-in L2TP/IPsec VPN client. It turns out that while an L2TP/IPsec link is active, the device was, in my case, unable to enter dormant mode for any non-trivial length of time. I suspect that this is caused by the default behavior of L2TP to send keep-alive messages every minute to ensure that the session is still alive (at the very least, packet captures VPN-server-side did not seem to indicate any traffic destined to the device’s projected IP Address).

In any case, this unpleasant side effect of the L2TP/IPsec client ruled out using it for full network connectivity in an always-on fashion. Unfortunately, dialing the VPN link on-demand has it’s own share of pitfalls, as the device seems to make the VPN link the default gateway. When I configured IMAPv4 mail to be fetched every so often via demand-dialing the VPN link (which would alleviate the difficulty with dormant mode not operating as desired with the VPN link always on), my SSH session would often die if either side happened to try and send data while Pocket Outlook was polling IMAP mail.

My solution in this case was to fall back to using SSH port-forwards through PocketPuTTY for IMAP mail (after I fixed SSH port forwards to work reliably, that is, of course). Unlike the L2TP/IPsec link, the SSH link doesn’t inherently force the device out of dormant mode without there being real activity, which allows me to continue to leave the SSH connection always on without sacrificing my battery life.

These two changes seem to have fixed the worst offenders in terms of battery drain, but there’s still plenty of room for improvement. For example, PocketPuTTY will tell the remote end of the session that the window has resized when you switch the device from portrait to landscape mode (or vice versa), such as when pulling out the device’s built in keyboard. While this is a very handy feature if you’re actually using the SSH session (as it allows the remote console to resize itself appropriately, which is how SILC/irssi magically reshapes itself to fit the screen when the device keyboard is engaged), it does present the annoyance that pulling out the keyboard while PocketPuTTY is not active will still cause the packet radio link to be fully established to send the terminal resize notification and receive the resulting redraw data.

This behavior could, for instance, be further optimized to defer sending such notifications until the PocketPuTTY window becomes the foreground window (potentially aborting the resize notification entirely if the screen size is switched back to how it previously was before PocketPuTTY is activated).

One of the nice things about having a (relatively) easily end-user-programmable device instead of a locked-down, DRM’ified BREW-based handset is that there’s actually opportunity to change these sorts of things instead of just blithely accepting what’s offered with the device with no real hope of improving it.

How to not write code for a mobile device

December 10th, 2007

Earlier this week, I got a shiny, brand-new XV6800 to replace my aging (and rather limited in terms of running user-created programs) BREW-ified phone.

After setting up ActiveSync, IPsec, and all of the other usual required configuration settings, the next step was, of course, to install the baseline minimum set of applications on the device to get by. Close to the top of this list is an SSH client (being able to project arbitrary console programs over SSH and screen to the device is, at the very least, a workable enough solution in the interm for programs that are not feasible to run directly on the device, such as my SILC client). I’ve found PuTTY a fairly acceptable client for “big Windows”, and there just so happened to be a Windows Mobile port of it. Great, or so I thought…

While PocketPuTTY does technically work, I noticed a couple of deficiencies with it pretty quickly:

  1. The page up / page down keys are treated the same as the up arrow / down arrow keys when sent to the remote system. This sucks, because I can’t access my scrollback buffer in SILC easily without page up / page down. Other applications can tell the difference between the two keys (obviously, or there wouldn’t be much of a point to having the keys at all), so this one is clearly a PocketPuTTY bug.
  2. There is no way to restart a session once it is closed, or at least, not that I’ve found. Normally, on “big Windows”, there’s a “Restart Session” menu option in the window menu of the PuTTy session, but (as far as I can tell) there’s no such equivalent to the window menu on PocketPuTTY. There is a “Tools” menu, although it has some rather useless menu items (e.g. About) instead of some actually useful menu items, like “Restart Session”.
  3. Running PocketPuTTY seems to have a significantly negative effect on battery life. This is really unfortunate for me, since the expected use is to leave an SSH session to a terminal running screen for a long period of time. (Note that this was resolved, partially, by locating a slightly more maintained copy of PocketPuTTY.)
  4. SSH port forward support seems to be fairly broken, in that as soon as a socket is cleaned up, all receives in the process break untiil you restart it. This is annoying, but workable if one can go without SSH port forwards.

Most of these problems are actually not all that difficult to fix, and since the source code is available, I’m actually planning on trying my hand at doing just that, since I expect that this is an app that I’ll make fairly heavy use of.

The latter problem is one I really want to call attention to among these deficiencies, however. My intent here is not to bash the PocketPuTTY folks (and I’m certainly happy that they’ve at least gotten a (semi)-working port of PuTTY with source code out, so that other people can improve on it from there), but rather to point out some things that should really just not be done when you’re writing code that is intended to run on a mobile device (especially if the code is intended to run exclusively on a mobile device).

On a portable device, one of the things that most users really expect is long battery life. Though this particular point certainly holds true for laptops as well, it is arguably even more important that for converged mobile phone devices. After all, most people consider their phone an “always on” contact mechanism, and unexpectedly running out of battery life is extremely annoying in this aspect. Furthermore, if your mobile phone has the capability to run all sorts of useful programs on it, but doing so eats up your battery in a couple of hours, then there is really not that much point in having that capability at all, is there?

Returning to PocketPuTTY, one of the main problems (at least with the version I initially used) was, again, that PocketPuTTY would reduce battery life significantly. Looking around in the source code for possible causes, I noticed the following gem, which turned out to be the core of the network read loop for the program.

Yes, there really is a Sleep(1) spin loop in there, in a software port that is designed to run on battery powered devices. For starters, that blows any sort of processor power management completely out of the water. Mobile devices have a lot of different components vying for power, and the easiest (and most effective) way to save on power (and thus battery life) is to not turn those components on. Of course, it becomes difficult to do that for power hungry components like the 400MHz CPU in my XV6800 if there’s a program that has an always-ready-to-run thread…

Fortunately, there happened to be a newer revision of PocketPuTTY floating about with the issue fixed (although getting ahold of the source code for that version proved to be slightly more difficult, as the original maintainer of the project seems to have disappeared). I did eventually manage to get into contact with someone who had been maintaining their own set of improvements and grab a not-so-crusty-old source tree from them to do my own improvements, primarily for the purposes of fixing some of the annoyances I mentioned previously (thus beginning my initial forey into Windows CE development).

Mini USB charging for devices is a great idea

December 6th, 2007

One recent innovation that I recently stumbled across is the use of mini USB ports for power. I actually originally encountered a device doing this with the Morotola HT820 Bluetooth headphones I got some months ago, although at the time, I didn’t realize that the power brick that came with the device was mini USB and not yet another proprietary power cable connection.

For those unaware, mini USB is a small form factor of USB plug (otherwise essentially identical to standard USB). As with regular USB, mini USB devices can be off the bus. Now, standardizing on mini USB for power connectors has some really nice advantages:

  1. Everyone’s power bricks are now compatible. Normally, each gadget you own is going to have its own (incompatible) power transformer brick, which means that if you’re travelling, you’ll quite likely have to lug around several such power bricks (in addition to your laptop AC adapter), even if every device doesn’t really need to be plugged into “wall power” at the same time.
  2. You can charge these devices off of a laptop USB port. Of course, you’ll need a USB to mini USB cable, assuming your laptop doesn’t have any mini USB ports (I haven’t seen any that do). The really cool thing about being able to do this is that for all your mini USB, battery powered devices, you only need to bring one cable for all of them while on the move, since you can just plug the device into your laptop and charge it off of that.

Furthermore, because the connection is still USB, devices can use the same port to charge and transfer data to a computer at the same time.

Now, I didn’t really put all the pieces together about my HT820 being mini USB until I had to go and buy a replacement set of headphones (a different model this time, as Best Buy didn’t carry the HT820 anymore). The volume button on my set had died, which was rather inconvenient, to say the least.

Anyways, I noticed that the new headphones, from a different manufacturer, seemed to have a compatible power cable port from the AC adapter brick to the device’s charging port. This time, the power cable even had the ubiquitous USB logo on it (the Motorola’s charging cable didn’t, for one reason or another). Sure enough, I could use the Motorola charger with the new set of headphones, even though they weren’t of a Motorola make. Cool.

Recently, I finally ditched my old cell phone for a proper smart phone (an XV6800). This, too, is mini USB powered (and it can use the mini USB connection for data as well, at the same time). There’s a lot of other neat things about the XV6800, but the fact that I don’t need to (ever) worry about spare chargers or anything of that sort is really just the icing on the cake.

Thanks to this little advancement, I can now ditch two additional power bricks (one for the cell phone and one for the Bluetooth headset) when traveling, and just charge both of devices off of my laptop. I’m actually kind of surprised that nobody implemented this earlier, given the immediate advantages of standardizing on a uniform power source (especially one that is as easily multiplexed as USB is).

So, the next time you’re looking for a new gadget of some sort or another, look for one chargeable via mini USB and circumvent the gadget charger nightmare (or at least, begin to do so, one gadget at a time).

A catalog of NTDLL kernel mode to user mode callbacks, part 6: LdrInitializeThunk

November 28th, 2007

Previously, I described the mechanism by which the kernel mode to user mode callback dispatcher (KiUserCallbackDispatcher) operates, and how it is utilized by win32k.sys for various window manager related operations.

The next special NTDLL kernel mode to user mode “callback” up on the list is LdrInitializeThunk, which is not so much a callback as the entry point at which all user mode threads begin their execution system-wide. Although the Win32 CreateThread API (and even the NtCreateThread system service that is used to implement the Win32 CreateThread) provide the illusion that a thread begins its execution at a specified start routine (or instruction pointer, in the case of NtCreateThread), this is not truly the case.

CreateThread internally layers in a special kernel32 stub routine (BaseThreadStart or BaseProcessStart) in between the specified thread routine and the actual initial instruction of the thread. The kernel32 stub routine wrappers the call to the user-supplied thread start routine to provide services such a “top-level” SEH frame for the support of UnhandledExceptionFilter.

However, there exists yet another layer of indirection before a thread begins its execution at the specified thread start routine, even beyond the kernel32 stub routine at the start of all Win32 threads (the way the kernel32 stub routine works changes slightly with Windows Vista, though that is outside the scope of this discussion). The presence of this extra layer of indirection can be inferred by examining the documentation on MSDN for DllMain, which states that all threads call out to the DllMain routine at some point during thread start-up. The kernel32 stub routine is not involved in this process, and obviously the user-supplied thread entry point does not have to explicitly attempt to call DllMain for every loaded DLL with DLL_THREAD_ATTACH. This leaves us with the question of who actually arranges for these DllMain calls to happen when a thread begins.

The answer to this question is, of course, the feature routine of this article, LdrInitializeThunk. When a user mode thread is readied to begin initial execution after being created, the initial context that is realized is not the context value supplied to NtCreateThread (which would eventually end up in the user-supplied thread entry point). Instead, execution really begins at LdrInitializeThunk within NTDLL, which is supplied a CONTEXT record that describes the initially requested state of the thread (as supplied to NtCreateThread, or as created by NtCreateThreadEx on Windows Vista). This context record is provided as an argument to LdrInitializeThunk, in order to allow for control to (eventually) be transferred to the user-supplied thread entry point.

When invoked by a new thread, LdrInitializeThunk invokes LdrpInitialize to perform the remainder of the initialization tasks, and then calls upon the NtContinue system service to restore the supplied context record. I have made available a C-like representation of this process for illustration purposes.

LdrpInitialize makes a determination as to whether the process has already been initialized (for the purposes of NTDLL). This step is necessary as LdrInitializeThunk (and by extension, LdrpInitialize) is not only the actual entry point for a new thread in an already initialized process, but it is also the entry point for the initial thread in a completely new process (making it the first piece of code that is run in user mode in a new process). If the process has not already been initialized, then LdrpInitialize performs process initialization tasks by invoking LdrpInitializeProcess (some process initialization is performed inline by LdrpInitialize in the process initialization case as well). If the process is a Wow64 process, then the Wow64 NTDLL is loaded and invoked to initialize the 32-bit NTDLL’s per-process state.

Otherwise, if the process is already initialized, then LdrpInitialize invokes LdrpInitializeThread to perform per-thread initialization, which primarily involves invoking DllMain and TLS callbacks for loaded modules while the loader lock is held. (This is the reason why it is not supported to wait on a new thread to run its thread initialization routine from DllMain, because the new thread will immediately become blocked upon the loader lock, waiting for the thread already in DllMain to complete its processing.) If the process is a Wow64 process, then there is again support for making a call to the 32-bit NTDLL for purposes of running 32-bit per-thread initialization code.

When the required process or thread initialization tasks have been completed, LdrpInitialize returns to LdrInitializeThunk, which then realizes the user-supplied thread start context with a call to the NtContinue system service.

One consequence of this architecture for process and thread initialization is that it becomes (slightly) more difficult to step through process initialization, because the thread that initializes the process is not the first thread created in the process, but rather the first thread that executes LdrInitializeThunk. This means that one cannot simply create a process as suspended and attach a debugger to the process in order to step through process initialization, as the debugger break-in thread will run the very process initialization code that one wishes to step through before executing the break-in thread breakpoint instruction!

There is, fortunately, support for debugging this scenario built in to the Windows debugger package. By setting the debugger to break on the ‘create process’ event, it is possible to manually set a breakpoint on a new process (created by the debugger) or a child process (if child process debugging is enabled) before LdrInitializeThunk is run. This support is activated by breaking on the cpr (Create Process) event. For example, using the following ntsd command line, it is possible to examine the target and set breakpoints before any user mode code runs:

ntsd.exe -xe cpr C:\windows\system32\cmd.exe

Note that as the loaded module list will not have been initialized at this point, symbols will not be functional, so it is necessary to resolve the offset of LdrInitializeThunk manually in order to set a breakpoint. The process can be continued with the typical “g” command.

Update: Pavel Labedinsky points out that you can use “-xe ld:ntdll.dll” to achieve the same effect, but with the benefit that symbols for NTDLL are available. This is a better option as it alleviates the necessity to manually resolve the addresses of locations that you wish to set a breakpoint on..

Next up: Examining RtlUserThreadStart (present on Windows Vista and beyond).

A catalog of NTDLL kernel mode to user mode callbacks, part 5: KiUserCallbackDispatcher

November 21st, 2007

Last time, I briefly outlined the operation of KiRaiseUserExceptionDispatcher, and how it is used by the NtClose system service to report certain classes of handle misuse under the debugger.

All of the NTDLL kernel mode to user mode “callbacks” that I have covered thus far have been, for the most part fairly “passive” in nature. By this, I mean that the kernel does not explicitly call any of these callbacks, at least in the usual nature of making a function call. Instead, all of the routines that we have discussed thus far are only invoked instead of the normal return procedure for a system call or interrupt, under certain conditions. (Conceptually, this is similar in some respects to returning to a different location using longjmp.)

In contrast to the other routines that we have discussed thus far, KiUserCallbackDispatcher breaks completely out of the passive callback model. The user mode callback dispatcher is, as the name implies, a trampoline that is used to make full-fledged calls to user mode, from kernel mode. (It is complemented by the NtCallbackReturn system service, which resumes execution in kernel mode following a user mode callback’s completion. Note that this means that a user mode callback can make auxiliary calls into the kernel without “returning” back to the original kernel mode caller.)

Calling user mode to kernel mode is a very non-traditional approach in the Windows world, and for good reason. Such calls are typically dangerous and need to be implemented very carefully in order to avoid creating any number of system reliability or integrity issues. Beyond simply validating any data returned to kernel mode from user mode, there are a far greater number of concerns with a direct kernel mode to user mode call model as supported by KiUserCallbackDispatcher. For example, a thread running in user mode can be freely suspended, delayed for a very long period of time due to a high priority user thread, or even terminated. These actions mean that any code spanning a call out to user mode must not hold locks, have acquired memory or other resources that might need to be released, or soforth.

From a kernel mode perspective, the way a user mode callback using KiUserCallbackDispatcher works is that the kernel saves the current processor state on the current kernel stack, alters the view of the top of the current kernel stack to point after the saved register state, sets a field in the current thread (CallbackStack) to point to the stack frame containing the saved register state (the previous CallbackStack value is saved to allow for recursive callbacks), and then executes a return to user mode using the standard return mechanism.

The user mode return address is, of course, set to the feature NTDLL routine of this article, KiUserCallbackDispatcher. The way the user mode callback dispatcher operates is fairly simple. First, it indexes into an array stored in the PEB with an argument to the callback dispatcher that is used to select the function to be invoked. Then, the callback routine located in the array is invoked, and provided with a single pointer-sized argument from kernel mode (this argument is typically a structure pointer containing several parameters packaged up into one contiguous block of memory). The actual implementation of KiUserCallbackDispatcher is fairly simple, and I have posted a C representation of it.

In Win32, kernel mode to user mode callbacks are used exclusively by User32 for windowing related aspects, such as calling a window procedure to send a WM_NCCREATE message during the creation of a new window on behalf of a user mode caller that has invoked NtUserCreateWindowEx. For example, during window creation processing, if we set a breakpoint on KiUserCallbackDispatcher, we might see the following:

Breakpoint 1 hit
ntdll!KiUserCallbackDispatch:
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`775851ca ntdll!KiUserCallbackDispatch
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0×27c
00000000`77585550 USER32!CreateWindowEx+0×3fe
000007fe`fddfa5b5 USER32!CreateWindowExW+0×70
000007fe`fde221d3 ole32!InitMainThreadWnd+0×65
000007fe`fde2150c ole32!wCoInitializeEx+0xfa
00000000`ff7e6db0 ole32!CoInitializeEx+0×18c
00000000`ff7ecf8b notepad!WinMain+0×5c
00000000`7746cdcd notepad!IsTextUTF8+0×24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0×1d

If we step through this call a bit more, we’ll see that it eventually ends up in a function by the name of user32!_fnINLPCREATESTRUCT, which eventually calls user32!DispatchClientMessage with the WM_NCCREATE window message, allowing the window procedure of the new window to participate in the window creation process, despite the fact that win32k.sys handles the creation of a window in kernel mode.

Callbacks are, as previously mentioned, permitted to be nested (or even recursively made) as well. For example, after watching calls to KiUserCallbackDispatcher for a time, we’ll probably see something akin to the following:

Breakpoint 1 hit
ntdll!KiUserCallbackDispatch:
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`7758b45a ntdll!KiUserCallbackDispatch
00000000`7758b4a4 USER32!NtUserMessageCall+0xa
00000000`7758e55a USER32!RealDefWindowProcWorker+0xb1
000007fe`fca62118 USER32!RealDefWindowProcW+0×5a
000007fe`fca61fa1 uxtheme!_ThemeDefWindowProc+0×298
00000000`7758b992 uxtheme!ThemeDefWindowProcW+0×11
00000000`ff7e69ef USER32!DefWindowProcW+0xe6
00000000`7758e25a notepad!NPWndProc+0×217
00000000`7758cbaf USER32!UserCallWinProcCheckWow+0×1ad
00000000`77584e1c USER32!DispatchClientMessage+0xc3
00000000`77692016 USER32!_fnINOUTNCCALCSIZE+0×3c
00000000`775851ca ntdll!KiUserCallbackDispatcherContinue
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0×27c
00000000`77585550 USER32!CreateWindowEx+0×3fe
00000000`ff7e9525 USER32!CreateWindowExW+0×70
00000000`ff7e6e12 notepad!NPInit+0×1f9
00000000`ff7ecf8b notepad!WinMain+0xbe
00000000`7746cdcd notepad!IsTextUTF8+0×24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd

This support for recursive callbacks is a large factor in why threads that talk to win32k.sys often have so-called “large kernel stacks”. The kernel mode dispatcher for user mode calls will attempt to convert the thread to a large kernel stack when a call is made, as the typical sized kernel stack is not large enough to support the number of recursive kernel mode to user mode calls present in a many complicated window messaging calls.

If the process is a Wow64 process, then the callback array in the PEB is prepointed to an array of conversion functions inside the Wow64 layer, which map the callback argument to a version compatible with the 32-bit user32.dll, as appropriate.

Next up: Taking a look at LdrInitializeThunk, where all user mode threads really begin their execution.

A catalog of NTDLL kernel mode to user mode callbacks, part 4: KiRaiseUserExceptionDispatcher

November 20th, 2007

The previous post in this series outlined how KiUserApcDispatcher operates for the purposes of enabling user mode APCs. Unlike KiUserExceptionDispatcher, which is expected to modify the return information from the context of an interrupt (or exception) in kernel mode, KiUserApcDispatcher is intended to operate on the return context of an active system call.

In this vein, the third kernel mode to user mode NTDLL “callback” that we shall be investigating, KiRaiseUserExceptionDispatcher, is fairly similar. Although in some respects akin to KiUserExceptionDispatcher, at least relating to raising an exception in user mode on the behalf of a kernel mode caller, the KiRaiseUserExceptionDispatcher “callback” really has more in common with KiUserApcDispatcher from a usage and implementation standpoint. It also so happens that the implementation of this callback routine, which is quite simple, is completely representable in C, as a standard calling convention is used.

KiRaiseUserExceptionDispatcher is used if a system call wishes to raise an exception in user mode instead of simply return an NTSTATUS, as is the standard convention. It simply constructs a standard exception record using a supplied status code (which must be written to a well-known location in the current thread’s TEB beforehand), and passes the exception to RtlRaiseException (the same routine that is used internally by the Win32 RaiseException API).

This is a fairly atypical scenario, as most errors (including bad pointer parameters) will simply result in an appropriate status code being returned by the system call, such as STATUS_ACCESS_VIOLATION.

The one place the does currently use the services of KiRaiseUserExceptionDispatcher is NtClose, the system service responsible for closing handles (which implements the Win32 CloseHandle API). When a debugger is attached to a process and a protected handle (as set by a call to SetHandleInformation with HANDLE_FLAG_PROTECT_FROM_CLOSE) is passed to NtClose, then a STATUS_HANDLE_NOT_CLOSABLE exception is raised to user mode via KiRaiseUserExceptionDispatcher. For example, one might see the following when trying to close a protected handle with a debugger attached:

(2600.1994): Unknown exception - code c0000235 (first chance)
(2600.1994): Unknown exception - code c0000235 (second chance)
ntdll!KiRaiseUserExceptionDispatcher+0x3a:
00000000`776920e7 8b8424c0000000  mov eax,dword ptr [rsp+0C0h]
0:000> !error c0000235
Error code: (NTSTATUS) 0xc0000235 (3221226037) - NtClose was
called on a handle that was protected from close via
NtSetInformationObject.
0:000> k
RetAddr           Call Site
00000000`7746dadd ntdll!KiRaiseUserExceptionDispatcher+0×3a
00000000`01001955 kernel32!CloseHandle+0×29
00000000`01001e60 TestApp!wmain+0×35
00000000`7746cdcd TestApp!__tmainCRTStartup+0×120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0×1d

The exception can be manually continued in a debugger to allow the program to operate after this occurs, though such a bad handle reference is typically indicative of a serious bug.

A similar behavior is activated if a program is being debugged under Application Verifier and an invalid handle is closed. This behavior of raising an exception on a bad handle closure attempt is intended as a debugging aid, since most programs do not bother to check the return value of CloseHandle.

Incidentally, bad kernel handle closure references like this in kernel mode will result in a bugcheck instead of simply raising an exception that can be caught. Drivers do not have the “luxury” of continuing when they touch a bad handle, except when probing a user handle, of course. (Old versions of Process Explorer used to bring down the box with a bugcheck due to this, for instance, if you tried to close a protected handle. This bug has fortunately since been fixed.)

Next time, we’ll take a look at KiUserCallbackDispatcher, which is used to a great degree by win32k.sys.

A catalog of NTDLL kernel mode to user mode callbacks, part 3: KiUserApcDispatcher

November 19th, 2007

I previously described the behavior of the kernel mode to user mode exception dispatcher (KiUserExceptionDispatcher). While exceptions are arguably the most commonly seen of the kernel mode to user mode “callbacks” in NTDLL, they are not the only such event.

Much like exceptions that traverse into user mode, user mode APCs are all (with the exception of initial thread startup) funneled through a single dispatcher routine inside NTDLL, KiUserApcDispatcher. KiUserApcDispatcher is responsible for invoking the actual APC routine, re-testing for APC delivery (to allow for the entire APC queue to be drained at once), and returning to the return point of the alertable system call that was interrupted to deliver the user mode APC. From the perspective of the user mode APC dispatcher, the stack is logically arranged as if the APC dispatcher would “return” to the instruction immediately following the syscall instruction in the system call that began the alertable wait.

For example, if we breakpoint on KiUserApcDispatcher and examine the stack (in this case, after an alertable WaitForSingleObjectEx call that has been interrupted to deliver a user mode APC), we might see the following:

Breakpoint 0 hit
ntdll!KiUserApcDispatcher:
00000000`77691f40 488b0c24  mov     rcx,qword ptr [rsp]
0:000> k
RetAddr           Call Site
00000000`776902ba ntdll!KiUserApcDispatcher
00000000`7746d820 ntdll!NtWaitForSingleObject+0xa
00000000`01001994 kernel32!WaitForSingleObjectEx+0×9c
00000000`01001eb0 TestApp!wmain+0×64
00000000`7746cdcd TestApp!__tmainCRTStartup+0×120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0×1d

From an implementation standpoint, KiUserApcDispatcher is conceptually fairly simple. I’ve posted a translation of the assembler for those interested. Keep in mind, however, that as with the other kernel mode to user mode callbacks, the routine is actually written in assembler and utilizes constructs not expressible through C, such as a custom calling convention.

Note that the APC routine invoked by KiUserApcDispatcher corresponds roughly to a standard native APC routine, specifically a PKNORMAL_ROUTINE (except on x64, but more on that later). This is not compatible with the Win32 view of an APC routine, which takes a single parameter, as opposed to be three that a KNORMAL_ROUTINE takes. As a result, there is a kernel32 function, BaseDispatchAPC that wrappers all Win32 APCs, providing an SEH frame around the call (and activating the appropriate activation context, if necessary). BaseDispatchAPC also converts from the native APC calling convention into the Win32 APC calling convention.

For Win32 I/O completion routines, a similar wrapper routine (BasepIoCompletion) serves the purpose of converting from the standard NT APC calling convention to the Win32 I/O completion callback calling convention (which primarily includes unpackaging any I/O completion parameters from the IO_STATUS_BLOCK).

With Windows x64, the behavior of KiUserApcDispatcher changes slightly. Specifically, the APC routine invoked has four parameters instead of the standard set of three parameters for a PKNORMAL_ROUTINE. This is still compatible with standard NT APC routines due to a quirk of the x64 calling convention, whereby the first four arguments are passed by register. (This means that internally, any routines with zero through four arguments that fit within the confines of a standard pointer-sized argument slot are “compatible”, at least from a calling convention perspective.)

The fourth parameter added by KiUserApcDispatcher in the x64 case is a pointer to the context record that is to be resumed when the APC dispatching process is finished. This additional argument is used by Wow64.dll if the process is a Wow64 process, as Wow64.dll wrappers all 32-bit APC routines around a thunk routine (Wow64ApcRoutine). Wow64ApcRoutine internally uses this extra PCONTEXT argument to take control of resuming execution after the real APC routine is invoked. Thus in the 64-bit NTDLL Wow64 case, the NtContinue call following the call to the user APC routine never occurs.

Next time, we’ll take a look at the other kernel mode to user mode exception “callback”, KiRaiseUserExceptionDispatcher.

A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher

November 16th, 2007

Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each of them, in terms of both calling convention and what functionality they perform.

KiUserExceptionDispatcher is the routine responsible for calling the user mode portion of the SEH dispatcher. When an exception occurs, and it is an exception that would generate an SEH event, the kernel checks to see whether the exception occurred while running user mode code. If so, then the kernel alters the trap frame on the stack, such that when the kernel returns from the interrupt or exception, execution resumes at KiUserExceptionDispatcher instead of the instruction that raised the fault. The kernel also arranges for several parameters (a PCONTEXT and a PEXCEPTION_RECORD) that describe the state of the machine when the exception occurred to be passed to KiUserExceptionDispatcher upon the return to user mode. (This model of changing the return address for a return from kernel mode to user mode is a common idiom in the Windows kernel for several user mode event notification mechanisms.)

Once the kernel mode stack unwinds and control is transferred to KiUserExceptionDispatcher in user mode, the exception is processed locally via a call to RtlDispatchException, which is the core of the user mode exception dispatcher logic. If the exception was successfully dispatched (that is, an exception handler handled it), the final user mode context is realized with a call to RtlRestoreContext, which simply loads the registers in the given context into the processor’s architectural execution state.

Otherwise, the exception is “re-thrown” to kernel mode for last chance processing via a call to NtRaiseException. This gives the user mode debugger (if any) a final shot at handling the exception, before the kernel terminates the process. (The kernel internally provides the user mode debugger and kernel debugger a first chance shot at such exceptions before arranging for KiUserExceptionDispatcher to be run.)

I have posted a pseudo-C representation of KiUserExceptionDispatcher for the curious. Note that it uses a specialized custom calling convention, and by virtue of this, is actually an assembler function. However, a C representation is often easier to understand.

Not all user mode exceptions originate from kernel mode in this fashion; in many cases (such as with the RaiseException API), the exception dispatching process is originated entirely from user mode and KiUserExceptionDispatcher is not involved.

Incidentally, if one has been following along with some of the recent postings, the reason why the invalid parameter reporting mechanism of the Visual Studio 2005 CRT doesn’t properly break into into an already attached debugger should start to become clear now, given some additional knowledge of how the exception dispatching process works.

Because the VS2005 CRT simulates an exception by building a context and exception record, and passing these to UnhandledExceptionFilter, the normal exception dispatcher logic is not involved. This means that nobody makes a call to the NtRaiseException system service (as would normally be the case if UnhandledExceptionFilter were called as a part of normal exception dispatching), and thus there is no notification sent to the user mode debugger asking it to pre-process the simulated STATUS_INVALID_PARAMETER exception.

Update: Posted representation of KiUserExceptionDispatcher.

Next up: Taking a look at the user mode APC dispatcher (KiUserApcDispatcher).

A catalog of NTDLL kernel mode to user mode callbacks, part 1: Overview

November 15th, 2007

As I previously mentioned, NTDLL maintains a set of special entrypoints that are used by the kernel to invoke certain functionality on the behalf of user mode.

In general, the functionality offered by these entrypoints is fairly simple, although having an understanding of how each are used provides useful insight into how certain features (such as user mode APCs) really work “under the hood”.

For the purposes of this discussion, the following are the NTDLL exported entrypoints that the kernel uses to communicate to user mode:

  1. KiUserExceptionDispatcher
  2. KiUserApcDispatcher
  3. KiRaiseUserExceptionDispatcher
  4. KiUserCallbackDispatcher
  5. LdrInitializeThunk
  6. RtlUserThreadStart
  7. EtwpNotificationThread
  8. LdrHotPatchRoutine

(There are other NTDLL exports used by the kernel, but not for direct user mode communication.)

These routines are generally used to inform user mode of a particular event occurring, though the specifics of how each routine is called vary somewhat.

KiUserExceptionDispatcher, KiUserApcDispatcher, and KiRaiseUserExceptionDispatcher are exclusively used when user mode has entered kernel mode, either implicitly, due to a processor interrupt (say, a page fault that will eventually trigger an access violation), or explicitly, due to a system call (such as NtWaitForSingleObject). The mechanism that the kernel uses to invoke these entrypoints is to alter the context that will be realized upon return from kernel mode to user mode. The return to user mode context information (a KTRAP_FRAME) is modified such that when kernel mode returns, instead of returning to the point upon which user mode invoked a kernel mode transition, control is transferred to one of the three dispatcher routines. Additional arguments are supplied to these dispatcher routines as necessary.

KiUserCallbackDispatcher is used to explicitly call out to user mode from kernel mode. This inverted mode of operation is typically discouraged in favor of models such as the pending IRP “inverted call model”. For historical design reasons, however, the Win32 subsystem (win32k.sys) uses this for a number of tasks (such as calling a user mode window procedure in response to a kernel mode window message operation). The user mode callout mechanism is not extensible to support arbitrary user mode callback destinations.

LdrInitializeThunk is the first instruction that any user mode thread runs, before the “actual” thread entrypoint. As such, it is the address at which every user mode thread system-wide begins execution.

RtlUserThreadStart is used on Windows Vista and Windows Server 2008 (and later OS’s) to form the initial entrypoint context for a thread started with NtCreateThreadEx (this is a marked departure from the approach taken by NtCreateThread, wherein the user mode caller supplies the initial thread context).

EtwpNotificationThread and LdrHotPatchRoutine both correspond to a standard thread entrypoint routine. These entrypoints are referenced by user mode threads that are created in the context of a particular process to carry out certain tasks on behalf of the kernel. As the latter two routines are generally only rarely encountered, this series does not describe them in detail.

Despite (or perhaps in spite of) the more or less completely unrealized promise of less reboots for hotfixes with Windows Server 2003, I think I’ve seen a grand total of one or two hotfixes in the entire lifetime of the OS that supported hotpatching. Knowing the effort that must have gone into developing hotpatching support, it is depressing to see security bulletin after security bulletin state “No, this hotfix does not support hotpatching. A reboot will be required.”. That is, however, a topic for another day.

Next up: Examining the operation of KiUserExceptionDispatcher in more detail.