Archive for the ‘Windows’ Category

Handy debugger tricks: Setting osloader options on a per-boot basis

Friday, July 30th, 2010

Sometimes, you’ll find yourself wishing that you could edit the boot options for a particular Windows instance just for a single boot (perhaps to enable debugging with non-default parameters, above and beyond what the F8 menus allows).

While the F8 boot menu has a lot of options, it’s sometimes the case that you need greater flexibility than what it provides (say, to try and enable USB2 debugging on the fly for a single boot — perhaps you need to use the debugger to rescue a system that won’t boot after a change you’ve made, for instance).

It turns out that there’s actually a (perhaps little-known) way to do this starting with Windows Vista and later; at boot time (whenever you could enter the F8 menu), you can strike F10 and find yourself at a prompt that allows you to directly edit the osloader options for the current boot. (Remember that the settings you enter here aren’t persisted across reboots.)

From here, can enable debugging or change any other osloader setting which you could permanently configure through bcdedit (but only for this boot). This capability is especially helpful if you need to debug setup with non-standard debugger connection setting, which otherwise presents a painful problem.

Remember that osloader options are the old-style options that you used to set in boot.ini (the debugger documentation and MSDN outline these). Don’t use the new-style bcdedit names, as those are only recognized by bcdedit; internally, the options passed on the osloader command line continue to be their old forms (i.e. /DEBUG /DEBUGPORT=1394 /CHANNEL=0).

Debugger tricks: Find all probable CONTEXT records in a crash dump

Monday, March 30th, 2009

If you’ve debugged crash dumps for awhile, then you’ve probably ran into a situation where the initial dump context provided by the debugger corresponds to a secondary exception that happened while processing an initial exception that’s likely closer to the original underlying problem in the issue you’re investigating.

This can be annoying, as the “.ecxr” command will point you at the site of the secondary failure exception, and not the original exception context itself. However, in most cases the original, primary exception context is still there on the stack; one just needs to know how to find it.

There are a couple of ways to go about this:

  • For hardware generated exceptions (such as access violations), one can look for ntdll!KiUserExceptionDispatcher on the stack, which takes a PCONTEXT and an PEXCEPTION_RECORD as arguments.
  • For software-generated exceptions (such as C++ exceptions), things get a bit dicier. One can look for ntdll!RtlDispatchException being called on the stack, and from there, grab the PCONTEXT parameter.

This can be a bit tedious if stack unwinding fails, or you’re dealing with one of those dumps where exceptions on multiple threads at the same time, typically due to crash dump writing going haywire (I’m looking at you, Outlook…). It would be nice if the debugger could automate this process a little bit.

Fortunately, it’s actually not hard to do this with a little bit of a brute-force approach. Specifically, just a plain old “dumb” memory scan for something common to most all CONTEXT records. It’s not exactly a finesse approach, but it’s usually a lot faster than manually groveling through the stack, especially if multiple threads or multiple nested exceptions are involved. While there may be false-positives, it’s usually immediately obvious as to what makes sense to be involved with a live exception or not. Sometimes, however, quick-and-dirty brute force type solutions really end up doing the trick, though.

In order to find CONTEXT records based on a memory search, though, we need some common data points that are typically the same for all CONTEXT structures, and, preferably, contiguous (for ease of use with the “s” command, the debugger’s memory search support). Fortunately, it turns out that this exists in the form of the segment registers of a CONTEXT structure:

0:000> dt ntdll!_CONTEXT
+0x000 ContextFlags : Uint4B
[…]

+0x08c SegGs : Uint4B
+0x090 SegFs : Uint4B
+0x094 SegEs : Uint4B
+0x098 SegDs : Uint4B

[…]

Now, it turns out that for all threads in a given process will almost always have the same segment selector values, excluding exotic and highly out of the ordinary cases like VDM processes. (The same goes for the segment selector values on x64 as well.) Four non-zero 32-bit values (actually, 16-bit values with zero padding to 32-bits) are enough to be able to reasonably pull a search off without being buried in false positives. Here’s how to do it with the infamous WinDbg debugger script (also applicable to other DbgEng-enabled programs, such as kd):

.foreach ( CxrPtr { s -[w1]d 0 l?ffffffff @gs @fs @es @ds } ) { .cxr CxrPtr – 8c }

This is a bit of a long-winded command, so let’s break it down into the individual components. First, we have a “.foreach” construct, which according to the debugger documentation, follows this convention:

.foreach [Options] ( Variable { InCommands } ) { OutCommands }

The .foreach command (actually one of the more versitle debugger-scripting commands, once one gets used to using it) basically takes a series of input strings generated by an input command (InCommands) and invokes an command to process that output (OutCommands), with the results of the input command being subsituted in as a macro specified by the Variable argument. It’s ugly and operates based on text parsing (and there’s support for skipping every X inputs, among other things; see the debugger documentation), but it gets the job done.

The next part of this operation is the s command, which instructs the debugger to search for a pattern in-memory in the target. The arguments supplied here instruct the debugger to search only writable memory (w), output only the address of each match (1), scan for DWORD (4-byte) sized quanitites (d) in the lower 4GB of the address space (0 l?ffffffff); in this case, we’re assuming that the target is a 32-bit process (which might be hosted on Wow64, hence 4GB instead of 3GB used). The remainder of the command specifies the search pattern to look for; the segment register values of the current thread. The “s” command sports a plethora of other options (with a rather unwieldy and convoluted syntax, unfortunately); the debugger documentation runs through the gamut of the other capabilities.

The final part of this command string is the output command, which simply directs the debugger to set the current context to the input command output replacement macro’s value at an offset of 0x8c. (If one recalls, 0x8c is the offset from the start of a struct _CONTEXT to the SegGs member, which is the first value we search for; as a result, the addresses returned by the “s” command will be the address of the SegGs member.) Remember that we restricted the output of the “s” command to just being the address itself, which lets us easily pass that on to a different command (which might give one the idea that the “s” and “.foreach” commands were made to work together).

Putting the command string together, it directs the debugger to search for a sequence of four 32-bit values (the gs, fs, es, and ds segment selector values for the current thread) in contiguous memory, and display the containing CONTEXT structure for each match.

You may find some other CONTEXT records aside from exception-related ones while executing this comamnd (in particular, initial thread contexts are common), but the ones related to a fault are usually pretty obvious and self-evident. Of course, this method isn’t foolproof, but it lets the debugger do some of the hard work for you (which beats manually groveling in corrupted stacks across multiple threads just to pull out some CONTEXT records).

Naturally, there are a number of other uses for both the “.foreach” and “s” commands; don’t be afraid to experiment with them. There are other helpers for automating certain tasks (!for_each_frame, !for_each_local, !for_each_module, !for_each_process, and !for_each_thread, to name a few) too, aside from the general “.foreach“. The debugger scripting support might not be the prettiest to look at, but it can be quite handy at speeding up common, repetitive tasks.

One parting tip with “.foreach” (well, actually two parting tips): The variable replacement macro only works if you separate it from other symbols with a space. This can be a problem in some cases (where you need to perform some arithmetic on the resulting expanded macro in particular, such as subtracting 0x8c in this case), however, as the spaces remain when the macro symbol is expanded. Some commands, such as “dt“, don’t follow the standard expression parsing rules (much to my eternal irritation), and choke if they’ve given arguments with spaces.

All is not lost for these commands, however; one way to work around this issue is to store the macro replacement into a pseudo-register (say, “r @$t0 = ReplacementVariableMacro – 0x8c“) and use that pseudo-register in the actual output command, as you can issue multiple, semi-colon delimited commands in the output commands section.

Examining kernel stacks on Vista/Srv08 using kdbgctrl -td, even when you haven’t booted /DEBUG

Friday, March 13th, 2009

Starting with Vista/Srv08, local kernel debugging support has been locked down to require that the system was booted /DEBUG before-hand.

For me, this has been the source of great annoyance, as if something strange happens to a Vista/Srv08 box that requires peering into kernel mode (usually to get a stack trace), then one tends to hit a brick wall if the box wasn’t booted with /DEBUG. (The only supported option usually available would be manually bugcheck the box and hope that the crash dump, if the system was configured to write one, has useful data. This is, of course, not a particularly great option; especially as in the default configuration for dumps on Vista/Srv08, it’s likely that user mode memory won’t be captured.)

However, it turns out that there’s limited support in the operating system to capture some kernel mode data for debugging purposes, even if you haven’t booted /DEBUG. Specifically, kdbgctrl.exe (a tool shipping with the WinDbg distribution) supports the notion of capturing a “kernel triage dump”, which basically takes a miniature snapshot of a given process and its active user mode threads. To use this feature, you need to have SeDebugPrivilege (i.e. you need to be an administrative-equivalently-privileged user).

Unlike conventional local KD support, triage dump writing doesn’t give unrestricted access to kernel mode memory on a live system. Instead, it instructs the kernel to create a small dump file that contains a limited, pre-set amount of information (mostly, just the process and associated threads, and if possible, the stacks of said threads). As a result, you can’t use it for general kernel memory examination. However, it’s sometimes enough to do the trick if you just need to capture what the kernel mode side stack trace of a particular thread is.

To use this feature, you can invoke “kdbgctrl.exe -td pid dump-filename“, which takes a snapshot of the process identified by a pid, and writes it out to a dump file named by dump-filename. This support is very tersely documented if you invoke kdbgctrl.exe on the command line with no arguments:

kdbgctrl
Usage: kdbgctrl <options>
Options:
[...]
-td <pid> <file> - Get a kernel triage dump

Now, the kernel’s support for writing out a triage dump isn’t by any means directly comparable to the power afforded by kd.exe -kl. As previously mentioned, the dump is extremely minimalistic, only containing information about a process object and its associated threads and the bare minimums allowing the debugger to function enough to pull this information from the dump. Just about all that you can reasonably expect to do with it is examine the stacks of the process that was captured, via the “!process -1 1f” command. (In addition, some extra information, such as the set of pending APCs attached to each thread, is also saved – enough to allow “!apc thread” to function.) Sometimes, however, it’s just enough to be able to figure out what something is blocking on kernel mode side (especially when troubleshooting deadlocks), and the triage dump functionality can aid with that.

However, there are some limitations with triage dumps, even above the relatively terse amount of data they convey. The triage dump support needs to be careful not to crash the system when writing out the dump, so all of this information is captured within the normal confines of safe kernel mode operation. In particular, this means that the triage dump writing logic doesn’t just blithely walk the thread or process lists like kd -kl would, or perform other potentially unsafe operations.

Instead, the triage dump creation process operates with the cooperation of all threads involved. This is immediately apparent when taking a look at the thread stacks captured in a triage dump:

BugCheck 69696969, {0, 0, 0, 0}

Probably caused by : ntkrnlmp.exe ( nt!IopKernSnapAPCMiniDump+55 )

Followup: MachineOwner
---------

2: kd> k
Call Site
nt!IopKernSnapAPCMiniDump+0x55
nt!IopKernSnapSpecialApc+0x50
nt!KiDeliverApc+0x1e2
nt!KiSwapThread+0x491
nt!KeWaitForSingleObject+0x2da
nt!KiSuspendThread+0x29
nt!KiDeliverApc+0x420
nt!KiSwapThread+0x491
nt!KeWaitForMultipleObjects+0x2d6
nt!ObpWaitForMultipleObjects+0x26e
nt!NtWaitForMultipleObjects+0xe2
nt!KiSystemServiceCopyEnd+0x13
0x7741602a

(Note that the system doesn’t actually bugcheck as a result of creating a triage dump. Kernel dumps implicitly need a bug check parameters record, however, so one is synthesized for the dump file.)

As is immediately obvious from the call stack, thread stack collection in triage dumps operates (in Vista/Srv08) by tripping a kernel APC on the target thread, which then (as it’s in-thread-context) safely collects data about that thread’s stack.

This, of course, presents a limitation: If a thread is sufficiently wedged such that it can’t run kernel APCs, then a triage dump won’t be able to capture full information about that thread. However, for some classes of scenarios where simply capturing a kernel mode stack is sufficient, triage dumps can sometimes fill in the gap in conjuction with user mode debugging, even if the system wasn’t booted with /DEBUG.

Recovering a process from a hung debugger

Saturday, February 21st, 2009

One of the more annoying things that can happen while debugging processes that deal with network traffic is happening to attach to something that is in the “critical path” for accessing the debugger’s active symbol path.

In such a scenario, the debugger will usually deadlock itself trying to request symbols, which requires going through some code path that involves the debuggee, which being frozen by the debugger, never gets to run. Usually, one would think that this means a lost repro (and, depending on the criticality of the process attached to the debugger, possibly a forced reboot as well), neither of which happen to be particularly fun outcomes.

It turns out that you’re not actually necessarily hosed if this happens, though. If you can still start a new process on the computer, then there’s actually a way to steal the process back from the debugger (on Windows XP and later), with the new-fangled fancy kernel debug object-based debugging support. Here’s what you need to do:

  1. Attach a new debugger to the process, with symbol support disabled. (You will be attaching to the debuggee and not the debugger.)

    Normally, you can’t attach a debugger to process while it’s already being debugged. However, there’s an option in windbg (and ntsd/cdb as well) that allows you to do this: the “-pe” debugger command-line parameter (documented in the debugger documentation), which forcibly attaches to the target process despite the presence of the hung debugger.

    Of course, force-attaching to the process won’t do any good if the new debugger process will just deadlock right away. As a result, you should make sure that the debugger won’t try and do any symbol-loading activity that might engender a deadlock. This is the command line that I usually use for that purpose, which disables _NT_SYMBOL_PATH appending (“-sins“), disables CodeView pdb pointer following (“-sicv“), and resets the symbol path to a known good value (“-y .“):

    ntsd -sicv -sins -y . -pe -p hung_debuggee_pid

    I recommend using ntsd and not WinDbg for this purpose in order to reduce the chance of a symbol path that might be stored in a WinDbg workspace from causing the debugger to deadlock itself again.

  2. Kill the hung debugger.

    After successfully attaching with “-pe“, you can safely kill the hung debugger (by whatever means necessary) without causing the former debuggee to get terminated along with it.

  3. Resume all threads in the target.

    The suspend count of most threads in the target is likely to be wrong. You can correct this by issuing the “~*M” command set several times (which resumes all threads in the process with the “~M” command).

    To determine the suspend count of all processes in the thread, you can use the “~” command. For example, you might see the following:

    0:001> ~
       0  Id: 18c4.16f0 Suspend: 2 Teb: 7efdd000 Unfrozen
    .  1  Id: 18c4.1a04 Suspend: 2 Teb: 7efda000 Unfrozen
    

    You should issue the “~*M” command enough times to bring the suspend count of all threads down to zero. (Don’t worry if you need to resume a thread more times than it is suspended.) Typically, this would be two times, for the common case, but by checking the suspend count of active threads, you can be certain of the number of times that you need you need to resume all threads in the process.

  4. Detach the new debugger from the debuggee.

    After resuming all threads in the target, use the “qd” command to detach the debugger. Do not attempt to resume the debugger with the “g” command (as it will stay suspended), or quit the debugger without attaching (as that would cause the debuggee to get terminated).

    If you needed to keep a particular thread in the debuggee suspended so that you can re-attach the debugger without losing your place, you can leave that thread with its suspend count above zero.

Voila, the debuggee should return back to life. Now, you should be able to re-attach a debugger (hopefully, with a safe symbol path this time), or not, as desired.

Hotpatching MS08-067

Friday, October 24th, 2008

If you have been watching the Microsoft security bulletins lately, then you’ve likely noticed yesterday’s bulletin, MS08-067. This is a particularly nasty bug, as it doesn’t require authentication to exploit in the default configuration for Windows Server 2003 and earlier systems (assuming that an attacker can talk over port 139 or port 445 to your box).

The usual mitigation for this particular vulnerability is to block off TCP/139 (NetBIOS) and TCP/445 (Direct hosted SMB), thus cutting off remote access to the srvsvc pipe, a prerequisite for exploiting the vulnerability in question. In my case, however, I had a box that I really didn’t want to reboot immediately. In addition, for the box in question, I did not wish to leave SMB blocked off remotely.

Given that I didn’t want to assume that there’d be no possible way for an untrusted user to be able to establish a TCP/139 or TCP/445, this left me with limited options; either I could simply hope that there wasn’t a chance for the box to get compromised before I had a chance for it to be convenient to reboot, or I could see if I could come up with some form of alternative mitigation on my own. After all, a debugger is the software development equivalent of a swiss army knife and duct-tape; I figured that it would be worth a try seeing if I could cobble together some sort of mitigation by manually patching the vulnerable netapi32.dll. To do this, however, it would be necessary to gain enough information about the flaw in question in order to discern what the fix was, in the hope of creating some form of alternative countermeasure for the vulnerability.

The first stop for gaining more information about the bug in question would be the Microsoft advisory. As usual, however, the bulletin released for the MS08-067 issue was lacking in sufficiently detailed technical information as required to fully understand the flaw in question to the degree necessary down to the level of what functions were patched, aside from the fact that the vulnerability resided somewhere in netapi32.dll (the Microsoft rationale behind this policy is that providing that level of technical detail would simply aid the creation of exploits). However, as Pusscat presented at Blue Hat Fall ’07, reverse engineering most present-day Microsoft security patches is not particularly insurmountable.

The usual approach to the patch reverse engineering process is to use a program called bindiff (an IDA plugin) that analyzes two binaries in order to discover the differences between the two. In my case, however, I didn’t have a copy of bindiff handy (it’s fairly pricey). Fortunately (or unfortunately, depending on your persuasion), there already existed a public exploit for this bug, as well as some limited public information from persons who had already reverse engineered the patch to a degree. To this end, I had a particular function in the affected module (netapi32!NetpwPathCanonicalize) which I knew was related to the vulnerability in some form.

At this point, I brought up a copy of the unpatched netapi32.dll in IDA, as well as a patched copy of netapi32.dll, then started looking through and comparing disassembly one page at a time until an interesting difference popped up in a subfunction of netapi32!NetpwPathCanonicalize:

Unpatched code:

.text:000007FF7737AF90 movzx   eax, word ptr [rcx]
.text:000007FF7737AF93 xor     r10d, r10d
.text:000007FF7737AF96 xor     r9d, r9d
.text:000007FF7737AF99 cmp     ax, 5Ch
.text:000007FF7737AF9D mov     r8, rcx
.text:000007FF7737AFA0 jz      loc_7FF7737515E

Patched code:

.text:000007FF7737AFA0 mov     r8, rcx

.text:000007FF7737AFA3 xor     eax, eax

.text:000007FF7737AFA5 mov     [rsp+arg_10], rbx
.text:000007FF7737AFAA mov     [rsp+arg_18], rdi
.text:000007FF7737AFAF jmp     loc_7FF7738E5D6

.text:000007FF7738E5D6 mov     rcx, 0FFFFFFFFFFFFFFFFh
.text:000007FF7738E5E0 mov     rdi, r8
.text:000007FF7738E5E3 repne scasw

.text:000007FF7738E5E6 movzx   eax, word ptr [r8]
.text:000007FF7738E5EA xor     r11d, r11d

.text:000007FF7738E5ED not     rcx

.text:000007FF7738E5F0 xor     r10d, r10d

.text:000007FF7738E5F3 dec     rcx

.text:000007FF7738E5F6 cmp     ax, 5Ch

.text:000007FF7738E5FA lea     rbx, [r8+rcx*2+2]

.text:000007FF7738E5FF jnz     loc_7FF7737AFB4

Now, without even really understanding what’s going on here on the function as a whole, it’s pretty obvious that here’s where (at least one) modification is being made; the new code involved the addition of an inline wcslen call. Typically, security fixes for buffer overrun conditions involve the creation of previously missing boundary checks, so a new call to a string-length function such as wcslen is a fairly reliable indicator that one’s found the site of the fix for the the vulnerability in question.

(The repne scasw instruction set scans a memory region two-bytes at a time until a particular value (in rax) is reached, or the maximum count (in rcx, typically initialized to (size_t)-1) is reached. Since we’re scanning two bytes at a time, and we’ve initialized rax to zero, we’re looking for an 0x0000 value in a string of two-byte quantities; in other words, an array of WCHARs (or a null terminated Unicode string). The resultant value on rcx after executing the repne scasw can be used to derive the length of the string, as it will have been decremented based on the number of WCHARs encountered before the 0x0000 WCHAR.)

My initial plan was, assuming that the fix was trivial, to simply perform a small opcode patch on the unpatched version of netapi32.dll in the Server service process on the box in question. In this particular instance, however, there were a number of other changes throughout the patched function that made use of the additional length check. As a result, a small opcode patch wasn’t ideal, as large parts of the function would need to be rewritten to take advantage of the extra length check.

Thus, plan B evolved, wherein the Microsoft-supplied patched version of netapi32.dll would be injected into an already-running Server service process. From there, the plan was to detour buggy_netapi32!NetpwPathCanonicalize to fixed_netapi32!NetpwPathCanonicalize.

As it turns out, netapi32!NetpwPathCanonicalize and all of its subfunctions are stateless with respect to global netapi32 variables (aside from the /GS cookie), which made this approach feasible. If the call tree involved a dependancy on netapi32 global state, then simply detouring the entire call tree wouldn’t have been a valid option, as the globals in the fixed netapi32.dll would be distinct from the globals in the buggy netapi32.dll.

This approach also makes the assumption that the only fixes made for the patch were in netapi32!NetpwPathCanonicalize and its call tree; as far as I know, this is the case, but this method is (of course) completely unsupported by Microsoft. Furthermore, as x64 binaries are built without hotpatch nop stubs at their prologue, the possibility for atomic patching in of a detour appeared to be out, so this approach has a chance of failing in the (unlikely) scenario where the first few instructions of netapi32!NetpwPathCanonicalize were being executed at the time of the detour.

Nonetheless, the worst case scenario would be that the box went down, in which case I’d be rebooting now instead of later. As the whole point of this exercise was to try and delay rebooting the system in question, I decided that this was an acceptable risk in my scenario, and elected to proceed. For the first step, I needed a program to inject a DLL into the target process (SDbgExt does not support !loaddll on 64-bit targets, sadly). The program that I came up with is certainly quick’n’dirty, as it fudges the thread start routine in terms of using kernel32!LoadLibraryA as the initial start address (which is a close enough analogue to LPTHREAD_START_ROUTINE to work), but it does the trick in this particular case.

The next step was to actually load the app into the svchost instance containing the Server service instance. To determine which svchost process this happens to be, one can use “tasklist /svc” from a cmd.exe console, which provides a nice formatted view of which services are hosted in which processes:

C:\WINDOWS\ms08-067-hotpatch>tasklist /svc
[…]
svchost.exe 840 AeLookupSvc, AudioSrv, BITS, Browser,
CryptSvc, dmserver, EventSystem, helpsvc,
HidServ, IAS, lanmanserver,
[…]

That being done, the next step was to inject the DLL into the process. Unfortunately, the default security descriptor on svchost.exe instances doesn’t allow Administrators the access required to inject a thread. One way to solve this problem would have been to write the code to enable the debug privilege in the injector app, but I elected to simply use the age-old trick of using the scheduler service (at.exe) to launch the program in question as LocalSystem (this, naturally, requires that you already be an administrator in order to succeed):

C:\WINDOWS\ms08-067-hotpatch>at 21:32 C:\windows\ms08-067-hotpatch\testapp.exe 840 C:\windows\ms08-067-hotpatch\netapi32.dll
Added a new job with job ID = 1

(21:32 was one minute from the time when I entered that command, minutes being the minimum granularity for the scheduler service.)

Roughly one minute later, the debugger (attached to the appropriate svchost instance) confirmed that the patched DLL was loaded successfully:

ModLoad: 00000000`04ff0000 00000000`05089000 
C:\windows\ms08-067-hotpatch\netapi32.dll

Stepping back for a moment, attaching WinDbg to an svchost instance containing services in the symbol server lookup code path is risky business, as you can easily deadlock the debugger. Proceed with care!

Now that the patched netapi32.dll was loaded, it was time to detour the old netapi32.dll to refer to the new netapi32.dll. Unfortunately, WinDbg doesn’t support assembling amd64 instructions very well (64-bit addresses and references to the extended registers don’t work properly), so I had to use a separate assembler (HIEW, [Hacker’s vIEW]) and manually patch in the opcode bytes for the detour sequence (mov rax, <absolute addresss> ; jmp rax):

0:074> eb NETAPI32!NetpwPathCanonicalize 48 C7 C0 40 AD FF 04 FF E0
0:074> u NETAPI32!NetpwPathCanonicalize
NETAPI32!NetpwPathCanonicalize:
000007ff`7737ad30 48c7c040adff04
mov rax,offset netapi32_4ff0000!NetpwPathCanonicalize
000007ff`7737ad37 ffe0            jmp     rax
0:076> bp NETAPI32!NetpwPathCanonicalize
0:076> g

This said and done, all that remained was to set a breakpoint on netapi32!NetpwPathCanonicalize and give the proof of concept exploit a try against my hotpatched system (it survived). Mission accomplished!

The obvious disclaimer: This whole procedure is a complete hack, and not recommended for production use, for reasons that should be relatively obvious. Additionally, MS08-067 did not come built as officially “hotpatch-enabled” (i.e. using the Microsoft supported hotpatch mechanism); “hotpatch-enabled” patches do not entail such a sequence of hacks in the deployment process.

(Thanks to hdm for double-checking some assumptions for me.)

Why does every heap trace in UMDH get stuck at “malloc”?

Thursday, February 21st, 2008

One of the more useful tools for tracking down memory leaks in Windows is a utility called UMDH that ships with the WinDbg distribution. Although I’ve previously covered what UMDH does at a high level, and how it functions, the basic principle for it, in a nutshell, is that it uses special instrumentation in the heap manager that is designed to log stack traces when heap operations occur.

UMDH utilizes the heap manager’s stack trace instrumentation to associate call stacks with outstanding allocations. More specifically, UMDH is capable of taking a “snapshot” of the current state of all heaps in a process, associating like-sized allocations from like-sized callstacks, and aggregrating them in a useful form.

The general principle of operation is that UMDH is typically run two (or more times), once to capture a “baseline” snapshot of the process after it has finished initializing (as there are expected to always be a number of outstanding allocations while the process is running that would not be normally expected to be freed until process exit time, for example, any allocations used to build the command line parameter arrays provided to the main function of a C program, or any other application-derived allocations that would be expected to remain checked out for the lifetime of the program.

This first “baseline” snapshot is essentially intended to be a means to filter out all of these expected, long-running allocations that would otherwise show up as useless noise if one were to simply take a single snapshot of the heap after the process had leaked memory.

The second (and potentially subsequent) snapshots are intended to be taken after the process has leaked a noticeable amount of memory. UMDH is then run again in a special mode that is designed to essentially do a logical “diff” between the “baseline” snapshot and the “leaked” snapshot, filtering out any allocations that were present in both of them and returning a list of new, outstanding allocations, which would generally include any leaked heap blocks (although there may well be legitimate outstanding allocations as well, which is why it is important to ensure that the “leaked” snapshot is taken only after a non-trivial amount of memory has been leaked, if at all possible).

Now, this is all well and good, and while UMDH proves to be a very effective tool for tracking down memory leaks with this strategy, taking a “before” and “after” diff of a problem and analyzing the two to determine what’s gone wrong is hardly a new, ground-breaking concept.

While the theory behind UMDH is sound, however, there are some situations where it can work less than optimally. The most common failure case of UMDH in my experience is not actually so much related to UMDH itself, but rather the heap manager instrumentation code that is responsible for logging stack traces in the first place.

As I had previously discussed, the heap manager stack trace instrumentation logic does not have access to symbols, and on x86, “perfect” stack traces are not generally possible, as there is no metadata attached with a particular function (outside of debug symbols) that describes how to unwind past it.

The typical approach taken on x86 is to assume that all functions in the call stack do not use frame pointer omission (FPO) optimizations that allow the compiler to eliminate the usage of ebp for a function entirely, or even repurpose it for a scratch register.

Now, most of the libraries that ship with the operating system in recent OS releases have FPO explicitly turned off for x86 builds, with the sole intent of allowing the built-in stack trace instrumentation logic to be able to traverse through system-supplied library functions up through to application code (after all, if every heap stack trace dead-ended at kernel32!HeapAlloc, the whole concept of heap allocation traces would be fairly useless).

Unfortunately, there happens to be a notable exception to this rule, one that actually came around to bite me at work recently. I was attempting to track down a suspected leak with UMDH in one of our programs, and noticed that all of the allocations were grouped into a single stack trace that dead-ended in a rather spectacularly unhelpful way. Digging in a bit deeper, in the individual snapshot dumps from UMDH contained scores of allocations with the following backtrace logged:

00000488 bytes in 0x1 allocations
   (@ 0x00000428 + 0x00000018) by: BackTrace01786

        7C96D6DC : ntdll!RtlDebugAllocateHeap+000000E1
        7C949D18 : ntdll!RtlAllocateHeapSlowly+00000044
        7C91B298 : ntdll!RtlAllocateHeap+00000E64
        211A179A : program!malloc+0000007A

This particular outcome happened to be rather unfortunate, as in the specific case of the program I was debugging at work, virtually all memory allocations in the program (including the ones I suspected of leaking) happened to ultimately get funneled through malloc.

Obviously, getting told that “yes, every leaked memory allocation goes through malloc” isn’t really all that helpful if (most) every allocation in the program in question happened to go through malloc. The UMDH output begged the question, however, as to why exactly malloc was breaking the stack traces. Digging in a bit deeper, I discovered the following gem while disassembling the implementation of malloc:

0:011> u program!malloc
program!malloc
[f:\sp\vctools\crt_bld\self_x86\crt\src\malloc.c @ 155]:
211a1720 55              push    ebp
211a1721 8b6c2408        mov     ebp,dword ptr [esp+8]
211a1725 83fde0          cmp     ebp,0FFFFFFE0h
[...]

In particular, it would appear that the default malloc implementation on the static link CRT on Visual C++ 2005 not only doesn’t use a frame pointer, but it trashes ebp as a scratch register (here, using it as an alias register for the first parameter, the count in bytes of memory to allocate). Disassembling the DLL version of the CRT revealed the same problem; ebp was reused as a scratch register.

What does this all mean? Well, anything using malloc that’s built with Visual C++ 2005 won’t be diagnosable with UMDH or anything else that relies on ebp-based stack traces, at least not on x86 builds. Given that many things internally go through malloc, including operator new (at least in the default implementation), this means that in the default configuration, things get a whole lot harder to debug than they should be.

One workaround here would be to build your own copy of the CRT with /Oy- (force frame pointer usage), but I don’t really consider building the CRT a very viable option, as that’s a whole lot of manual work to do and get up and running correctly on every developer’s machine, not to mention all the headaches that service releases that will require rebuilds will bring with such an approach.

For operator new, it’s fortunately relatively doable to overload it in a relatively supported way to be implemented against a different allocation strategy. In the case of malloc, however, things don’t really have such a happy ending; one is either forced to re-alias the name using preprocessor macro hackery to a custom implementation that does not suffer from a lack of frame pointer usage, or otherwise change all references to malloc/free to refer to a custom allocator function (perhaps implemented against the process heap directly instead of the CRT heap a-la malloc).

So, the next time you use UMDH and get stuck scratching your head while trying to figure out why your stack traces are all dead-ending somewhere less than useful, keep in mind that the CRT itself may be to blame, especially if you’re relying on CRT allocators. Hopefully, in a future release of Visual Studio, the folks responsible for turning off FPO in the standard OS libraries can get in touch with the persons responsible for CRT builds and arrange for the same to be done, if not for the entire CRT, then at least for all the code paths in the standard heap routines. Until then, however, these CRT allocator routines remain roadblocks for effective leak diagnosis, at least when using the better tools available for the job (UMDH).

Power management observations with my XV6800

Tuesday, December 11th, 2007

As I previously mentioned, I recently switched to a full-blown Windows Mobile-based phone. One of the interesting conundrums that I’ve ran into along the way is feeling out what’s actually killer on the battery life for the device and what isn’t.

It’s actually kind of surprising how seemingly innocent little things can have a dramatic impact on battery life in a situation like an always-on handheld phone. For example, one of the main power saving features of modern CDMA cell networks and packet switched data connections is the ability for the device to go into dormant mode, a state in which the device relinquishes it’s active radio link resources and essentially goes into a standby mode that is in some respects akin to how the device operates when it is otherwise idle and waiting for a call. This is beneficial for the device’s battery life (because while in dormant mode, much of the logic related to maintaining the active high-rate radio link can be powered off while the network link is idle). Dormant mode is also beneficial for network operators, in that it allows the resources previously dedicated to a particular device to be used by other users that are actively sending or receiving data.

When data is ready to be sent on either end of the link, the device will “wake up” and exit dormant mode in order to send/receive data, keeping the link fully active until the session has been idle for enough time for the dormant idle timer to expire. (Entering and exiting dormant mode is mostly seamless, but may involve delays of 1-2 seconds, at least in my experience, so doing so on every packet would be highly undesirable.)

What all of this means is that there’s some significant gains to be had in terms of radio power consumption if network traffic can be limited to where it is really necessary. There are a couple of things that I had initially running on my device that didn’t really fit this criteria. For example, I have an always-active SSH connection to a screen session that, among other things, runs my console SILC client, which is based off of irssi (an IRC client). One gotcha that I encountered there is that in the default configuration, this client has a clock on the ncurses console UI. The clock is updated every minute with the current time, which is (normally) handy as you can at-a-glance compare the current timestamp with message timestamps in the backlog.

However, this becomes problematic in my case, because the clock forces the device out of dormant mode every minute in order to repaint that portion of the console UI. (Aside from the clock, the remainder of the SILC/irssi UI is static until there is actual chat activity, which would otherwise make it relatively well suited for this situation.) Fortunately, it turned out to be relatively easy to remove the clock from the client’s UI, which prevents the SSH session from causing network activity every minute while I’ve got the SILC client up.

Another network power management “gotcha” that I ran into was the built-in L2TP/IPsec VPN client. It turns out that while an L2TP/IPsec link is active, the device was, in my case, unable to enter dormant mode for any non-trivial length of time. I suspect that this is caused by the default behavior of L2TP to send keep-alive messages every minute to ensure that the session is still alive (at the very least, packet captures VPN-server-side did not seem to indicate any traffic destined to the device’s projected IP Address).

In any case, this unpleasant side effect of the L2TP/IPsec client ruled out using it for full network connectivity in an always-on fashion. Unfortunately, dialing the VPN link on-demand has it’s own share of pitfalls, as the device seems to make the VPN link the default gateway. When I configured IMAPv4 mail to be fetched every so often via demand-dialing the VPN link (which would alleviate the difficulty with dormant mode not operating as desired with the VPN link always on), my SSH session would often die if either side happened to try and send data while Pocket Outlook was polling IMAP mail.

My solution in this case was to fall back to using SSH port-forwards through PocketPuTTY for IMAP mail (after I fixed SSH port forwards to work reliably, that is, of course). Unlike the L2TP/IPsec link, the SSH link doesn’t inherently force the device out of dormant mode without there being real activity, which allows me to continue to leave the SSH connection always on without sacrificing my battery life.

These two changes seem to have fixed the worst offenders in terms of battery drain, but there’s still plenty of room for improvement. For example, PocketPuTTY will tell the remote end of the session that the window has resized when you switch the device from portrait to landscape mode (or vice versa), such as when pulling out the device’s built in keyboard. While this is a very handy feature if you’re actually using the SSH session (as it allows the remote console to resize itself appropriately, which is how SILC/irssi magically reshapes itself to fit the screen when the device keyboard is engaged), it does present the annoyance that pulling out the keyboard while PocketPuTTY is not active will still cause the packet radio link to be fully established to send the terminal resize notification and receive the resulting redraw data.

This behavior could, for instance, be further optimized to defer sending such notifications until the PocketPuTTY window becomes the foreground window (potentially aborting the resize notification entirely if the screen size is switched back to how it previously was before PocketPuTTY is activated).

One of the nice things about having a (relatively) easily end-user-programmable device instead of a locked-down, DRM’ified BREW-based handset is that there’s actually opportunity to change these sorts of things instead of just blithely accepting what’s offered with the device with no real hope of improving it.

How to not write code for a mobile device

Monday, December 10th, 2007

Earlier this week, I got a shiny, brand-new XV6800 to replace my aging (and rather limited in terms of running user-created programs) BREW-ified phone.

After setting up ActiveSync, IPsec, and all of the other usual required configuration settings, the next step was, of course, to install the baseline minimum set of applications on the device to get by. Close to the top of this list is an SSH client (being able to project arbitrary console programs over SSH and screen to the device is, at the very least, a workable enough solution in the interm for programs that are not feasible to run directly on the device, such as my SILC client). I’ve found PuTTY a fairly acceptable client for “big Windows”, and there just so happened to be a Windows Mobile port of it. Great, or so I thought…

While PocketPuTTY does technically work, I noticed a couple of deficiencies with it pretty quickly:

  1. The page up / page down keys are treated the same as the up arrow / down arrow keys when sent to the remote system. This sucks, because I can’t access my scrollback buffer in SILC easily without page up / page down. Other applications can tell the difference between the two keys (obviously, or there wouldn’t be much of a point to having the keys at all), so this one is clearly a PocketPuTTY bug.
  2. There is no way to restart a session once it is closed, or at least, not that I’ve found. Normally, on “big Windows”, there’s a “Restart Session” menu option in the window menu of the PuTTy session, but (as far as I can tell) there’s no such equivalent to the window menu on PocketPuTTY. There is a “Tools” menu, although it has some rather useless menu items (e.g. About) instead of some actually useful menu items, like “Restart Session”.
  3. Running PocketPuTTY seems to have a significantly negative effect on battery life. This is really unfortunate for me, since the expected use is to leave an SSH session to a terminal running screen for a long period of time. (Note that this was resolved, partially, by locating a slightly more maintained copy of PocketPuTTY.)
  4. SSH port forward support seems to be fairly broken, in that as soon as a socket is cleaned up, all receives in the process break until you restart it. This is annoying, but workable if one can go without SSH port forwards.

Most of these problems are actually not all that difficult to fix, and since the source code is available, I’m actually planning on trying my hand at doing just that, since I expect that this is an app that I’ll make fairly heavy use of.

The latter problem is one I really want to call attention to among these deficiencies, however. My intent here is not to bash the PocketPuTTY folks (and I’m certainly happy that they’ve at least gotten a (semi)-working port of PuTTY with source code out, so that other people can improve on it from there), but rather to point out some things that should really just not be done when you’re writing code that is intended to run on a mobile device (especially if the code is intended to run exclusively on a mobile device).

On a portable device, one of the things that most users really expect is long battery life. Though this particular point certainly holds true for laptops as well, it is arguably even more important that for converged mobile phone devices. After all, most people consider their phone an “always on” contact mechanism, and unexpectedly running out of battery life is extremely annoying in this aspect. Furthermore, if your mobile phone has the capability to run all sorts of useful programs on it, but doing so eats up your battery in a couple of hours, then there is really not that much point in having that capability at all, is there?

Returning to PocketPuTTY, one of the main problems (at least with the version I initially used) was, again, that PocketPuTTY would reduce battery life significantly. Looking around in the source code for possible causes, I noticed the following gem, which turned out to be the core of the network read loop for the program.

Yes, there really is a Sleep(1) spin loop in there, in a software port that is designed to run on battery powered devices. For starters, that blows any sort of processor power management completely out of the water. Mobile devices have a lot of different components vying for power, and the easiest (and most effective) way to save on power (and thus battery life) is to not turn those components on. Of course, it becomes difficult to do that for power hungry components like the 400MHz CPU in my XV6800 if there’s a program that has an always-ready-to-run thread…

Fortunately, there happened to be a newer revision of PocketPuTTY floating about with the issue fixed (although getting ahold of the source code for that version proved to be slightly more difficult, as the original maintainer of the project seems to have disappeared). I did eventually manage to get into contact with someone who had been maintaining their own set of improvements and grab a not-so-crusty-old source tree from them to do my own improvements, primarily for the purposes of fixing some of the annoyances I mentioned previously (thus beginning my initial forey into Windows CE development).

A catalog of NTDLL kernel mode to user mode callbacks, part 6: LdrInitializeThunk

Wednesday, November 28th, 2007

Previously, I described the mechanism by which the kernel mode to user mode callback dispatcher (KiUserCallbackDispatcher) operates, and how it is utilized by win32k.sys for various window manager related operations.

The next special NTDLL kernel mode to user mode “callback” up on the list is LdrInitializeThunk, which is not so much a callback as the entry point at which all user mode threads begin their execution system-wide. Although the Win32 CreateThread API (and even the NtCreateThread system service that is used to implement the Win32 CreateThread) provide the illusion that a thread begins its execution at a specified start routine (or instruction pointer, in the case of NtCreateThread), this is not truly the case.

CreateThread internally layers in a special kernel32 stub routine (BaseThreadStart or BaseProcessStart) in between the specified thread routine and the actual initial instruction of the thread. The kernel32 stub routine wrappers the call to the user-supplied thread start routine to provide services such a “top-level” SEH frame for the support of UnhandledExceptionFilter.

However, there exists yet another layer of indirection before a thread begins its execution at the specified thread start routine, even beyond the kernel32 stub routine at the start of all Win32 threads (the way the kernel32 stub routine works changes slightly with Windows Vista, though that is outside the scope of this discussion). The presence of this extra layer of indirection can be inferred by examining the documentation on MSDN for DllMain, which states that all threads call out to the DllMain routine at some point during thread start-up. The kernel32 stub routine is not involved in this process, and obviously the user-supplied thread entry point does not have to explicitly attempt to call DllMain for every loaded DLL with DLL_THREAD_ATTACH. This leaves us with the question of who actually arranges for these DllMain calls to happen when a thread begins.

The answer to this question is, of course, the feature routine of this article, LdrInitializeThunk. When a user mode thread is readied to begin initial execution after being created, the initial context that is realized is not the context value supplied to NtCreateThread (which would eventually end up in the user-supplied thread entry point). Instead, execution really begins at LdrInitializeThunk within NTDLL, which is supplied a CONTEXT record that describes the initially requested state of the thread (as supplied to NtCreateThread, or as created by NtCreateThreadEx on Windows Vista). This context record is provided as an argument to LdrInitializeThunk, in order to allow for control to (eventually) be transferred to the user-supplied thread entry point.

When invoked by a new thread, LdrInitializeThunk invokes LdrpInitialize to perform the remainder of the initialization tasks, and then calls upon the NtContinue system service to restore the supplied context record. I have made available a C-like representation of this process for illustration purposes.

LdrpInitialize makes a determination as to whether the process has already been initialized (for the purposes of NTDLL). This step is necessary as LdrInitializeThunk (and by extension, LdrpInitialize) is not only the actual entry point for a new thread in an already initialized process, but it is also the entry point for the initial thread in a completely new process (making it the first piece of code that is run in user mode in a new process). If the process has not already been initialized, then LdrpInitialize performs process initialization tasks by invoking LdrpInitializeProcess (some process initialization is performed inline by LdrpInitialize in the process initialization case as well). If the process is a Wow64 process, then the Wow64 NTDLL is loaded and invoked to initialize the 32-bit NTDLL’s per-process state.

Otherwise, if the process is already initialized, then LdrpInitialize invokes LdrpInitializeThread to perform per-thread initialization, which primarily involves invoking DllMain and TLS callbacks for loaded modules while the loader lock is held. (This is the reason why it is not supported to wait on a new thread to run its thread initialization routine from DllMain, because the new thread will immediately become blocked upon the loader lock, waiting for the thread already in DllMain to complete its processing.) If the process is a Wow64 process, then there is again support for making a call to the 32-bit NTDLL for purposes of running 32-bit per-thread initialization code.

When the required process or thread initialization tasks have been completed, LdrpInitialize returns to LdrInitializeThunk, which then realizes the user-supplied thread start context with a call to the NtContinue system service.

One consequence of this architecture for process and thread initialization is that it becomes (slightly) more difficult to step through process initialization, because the thread that initializes the process is not the first thread created in the process, but rather the first thread that executes LdrInitializeThunk. This means that one cannot simply create a process as suspended and attach a debugger to the process in order to step through process initialization, as the debugger break-in thread will run the very process initialization code that one wishes to step through before executing the break-in thread breakpoint instruction!

There is, fortunately, support for debugging this scenario built in to the Windows debugger package. By setting the debugger to break on the ‘create process’ event, it is possible to manually set a breakpoint on a new process (created by the debugger) or a child process (if child process debugging is enabled) before LdrInitializeThunk is run. This support is activated by breaking on the cpr (Create Process) event. For example, using the following ntsd command line, it is possible to examine the target and set breakpoints before any user mode code runs:

ntsd.exe -xe cpr C:\windows\system32\cmd.exe

Note that as the loaded module list will not have been initialized at this point, symbols will not be functional, so it is necessary to resolve the offset of LdrInitializeThunk manually in order to set a breakpoint. The process can be continued with the typical “g” command.

Update: Pavel Labedinsky points out that you can use “-xe ld:ntdll.dll” to achieve the same effect, but with the benefit that symbols for NTDLL are available. This is a better option as it alleviates the necessity to manually resolve the addresses of locations that you wish to set a breakpoint on..

Next up: Examining RtlUserThreadStart (present on Windows Vista and beyond).

A catalog of NTDLL kernel mode to user mode callbacks, part 5: KiUserCallbackDispatcher

Wednesday, November 21st, 2007

Last time, I briefly outlined the operation of KiRaiseUserExceptionDispatcher, and how it is used by the NtClose system service to report certain classes of handle misuse under the debugger.

All of the NTDLL kernel mode to user mode “callbacks” that I have covered thus far have been, for the most part fairly “passive” in nature. By this, I mean that the kernel does not explicitly call any of these callbacks, at least in the usual nature of making a function call. Instead, all of the routines that we have discussed thus far are only invoked instead of the normal return procedure for a system call or interrupt, under certain conditions. (Conceptually, this is similar in some respects to returning to a different location using longjmp.)

In contrast to the other routines that we have discussed thus far, KiUserCallbackDispatcher breaks completely out of the passive callback model. The user mode callback dispatcher is, as the name implies, a trampoline that is used to make full-fledged calls to user mode, from kernel mode. (It is complemented by the NtCallbackReturn system service, which resumes execution in kernel mode following a user mode callback’s completion. Note that this means that a user mode callback can make auxiliary calls into the kernel without “returning” back to the original kernel mode caller.)

Calling user mode to kernel mode is a very non-traditional approach in the Windows world, and for good reason. Such calls are typically dangerous and need to be implemented very carefully in order to avoid creating any number of system reliability or integrity issues. Beyond simply validating any data returned to kernel mode from user mode, there are a far greater number of concerns with a direct kernel mode to user mode call model as supported by KiUserCallbackDispatcher. For example, a thread running in user mode can be freely suspended, delayed for a very long period of time due to a high priority user thread, or even terminated. These actions mean that any code spanning a call out to user mode must not hold locks, have acquired memory or other resources that might need to be released, or soforth.

From a kernel mode perspective, the way a user mode callback using KiUserCallbackDispatcher works is that the kernel saves the current processor state on the current kernel stack, alters the view of the top of the current kernel stack to point after the saved register state, sets a field in the current thread (CallbackStack) to point to the stack frame containing the saved register state (the previous CallbackStack value is saved to allow for recursive callbacks), and then executes a return to user mode using the standard return mechanism.

The user mode return address is, of course, set to the feature NTDLL routine of this article, KiUserCallbackDispatcher. The way the user mode callback dispatcher operates is fairly simple. First, it indexes into an array stored in the PEB with an argument to the callback dispatcher that is used to select the function to be invoked. Then, the callback routine located in the array is invoked, and provided with a single pointer-sized argument from kernel mode (this argument is typically a structure pointer containing several parameters packaged up into one contiguous block of memory). The actual implementation of KiUserCallbackDispatcher is fairly simple, and I have posted a C representation of it.

In Win32, kernel mode to user mode callbacks are used exclusively by User32 for windowing related aspects, such as calling a window procedure to send a WM_NCCREATE message during the creation of a new window on behalf of a user mode caller that has invoked NtUserCreateWindowEx. For example, during window creation processing, if we set a breakpoint on KiUserCallbackDispatcher, we might see the following:

Breakpoint 1 hit
ntdll!KiUserCallbackDispatch:
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`775851ca ntdll!KiUserCallbackDispatch
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0x27c
00000000`77585550 USER32!CreateWindowEx+0x3fe
000007fe`fddfa5b5 USER32!CreateWindowExW+0x70
000007fe`fde221d3 ole32!InitMainThreadWnd+0x65
000007fe`fde2150c ole32!wCoInitializeEx+0xfa
00000000`ff7e6db0 ole32!CoInitializeEx+0x18c
00000000`ff7ecf8b notepad!WinMain+0x5c
00000000`7746cdcd notepad!IsTextUTF8+0x24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

If we step through this call a bit more, we’ll see that it eventually ends up in a function by the name of user32!_fnINLPCREATESTRUCT, which eventually calls user32!DispatchClientMessage with the WM_NCCREATE window message, allowing the window procedure of the new window to participate in the window creation process, despite the fact that win32k.sys handles the creation of a window in kernel mode.

Callbacks are, as previously mentioned, permitted to be nested (or even recursively made) as well. For example, after watching calls to KiUserCallbackDispatcher for a time, we’ll probably see something akin to the following:

Breakpoint 1 hit
ntdll!KiUserCallbackDispatch:
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`7758b45a ntdll!KiUserCallbackDispatch
00000000`7758b4a4 USER32!NtUserMessageCall+0xa
00000000`7758e55a USER32!RealDefWindowProcWorker+0xb1
000007fe`fca62118 USER32!RealDefWindowProcW+0x5a
000007fe`fca61fa1 uxtheme!_ThemeDefWindowProc+0x298
00000000`7758b992 uxtheme!ThemeDefWindowProcW+0x11
00000000`ff7e69ef USER32!DefWindowProcW+0xe6
00000000`7758e25a notepad!NPWndProc+0x217
00000000`7758cbaf USER32!UserCallWinProcCheckWow+0x1ad
00000000`77584e1c USER32!DispatchClientMessage+0xc3
00000000`77692016 USER32!_fnINOUTNCCALCSIZE+0x3c
00000000`775851ca ntdll!KiUserCallbackDispatcherContinue
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0x27c
00000000`77585550 USER32!CreateWindowEx+0x3fe
00000000`ff7e9525 USER32!CreateWindowExW+0x70
00000000`ff7e6e12 notepad!NPInit+0x1f9
00000000`ff7ecf8b notepad!WinMain+0xbe
00000000`7746cdcd notepad!IsTextUTF8+0x24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd

This support for recursive callbacks is a large factor in why threads that talk to win32k.sys often have so-called “large kernel stacks”. The kernel mode dispatcher for user mode calls will attempt to convert the thread to a large kernel stack when a call is made, as the typical sized kernel stack is not large enough to support the number of recursive kernel mode to user mode calls present in a many complicated window messaging calls.

If the process is a Wow64 process, then the callback array in the PEB is prepointed to an array of conversion functions inside the Wow64 layer, which map the callback argument to a version compatible with the 32-bit user32.dll, as appropriate.

Next up: Taking a look at LdrInitializeThunk, where all user mode threads really begin their execution.