Recovering a process from a hung debugger

February 21st, 2009

One of the more annoying things that can happen while debugging processes that deal with network traffic is happening to attach to something that is in the “critical path” for accessing the debugger’s active symbol path.

In such a scenario, the debugger will usually deadlock itself trying to request symbols, which requires going through some code path that involves the debuggee, which being frozen by the debugger, never gets to run. Usually, one would think that this means a lost repro (and, depending on the criticality of the process attached to the debugger, possibly a forced reboot as well), neither of which happen to be particularly fun outcomes.

It turns out that you’re not actually necessarily hosed if this happens, though. If you can still start a new process on the computer, then there’s actually a way to steal the process back from the debugger (on Windows XP and later), with the new-fangled fancy kernel debug object-based debugging support. Here’s what you need to do:

  1. Attach a new debugger to the process, with symbol support disabled. (You will be attaching to the debuggee and not the debugger.)

    Normally, you can’t attach a debugger to process while it’s already being debugged. However, there’s an option in windbg (and ntsd/cdb as well) that allows you to do this: the “-pe” debugger command-line parameter (documented in the debugger documentation), which forcibly attaches to the target process despite the presence of the hung debugger.

    Of course, force-attaching to the process won’t do any good if the new debugger process will just deadlock right away. As a result, you should make sure that the debugger won’t try and do any symbol-loading activity that might engender a deadlock. This is the command line that I usually use for that purpose, which disables _NT_SYMBOL_PATH appending (“-sins“), disables CodeView pdb pointer following (“-sicv“), and resets the symbol path to a known good value (“-y .“):

    ntsd -sicv -sins -y . -pe -p hung_debuggee_pid

    I recommend using ntsd and not WinDbg for this purpose in order to reduce the chance of a symbol path that might be stored in a WinDbg workspace from causing the debugger to deadlock itself again.

  2. Kill the hung debugger.

    After successfully attaching with “-pe“, you can safely kill the hung debugger (by whatever means necessary) without causing the former debuggee to get terminated along with it.

  3. Resume all threads in the target.

    The suspend count of most threads in the target is likely to be wrong. You can correct this by issuing the “~*M” command set several times (which resumes all threads in the process with the “~M” command).

    To determine the suspend count of all processes in the thread, you can use the “~” command. For example, you might see the following:

    0:001> ~
       0  Id: 18c4.16f0 Suspend: 2 Teb: 7efdd000 Unfrozen
    .  1  Id: 18c4.1a04 Suspend: 2 Teb: 7efda000 Unfrozen
    

    You should issue the “~*M” command enough times to bring the suspend count of all threads down to zero. (Don’t worry if you need to resume a thread more times than it is suspended.) Typically, this would be two times, for the common case, but by checking the suspend count of active threads, you can be certain of the number of times that you need you need to resume all threads in the process.

  4. Detach the new debugger from the debuggee.

    After resuming all threads in the target, use the “qd” command to detach the debugger. Do not attempt to resume the debugger with the “g” command (as it will stay suspended), or quit the debugger without attaching (as that would cause the debuggee to get terminated).

    If you needed to keep a particular thread in the debuggee suspended so that you can re-attach the debugger without losing your place, you can leave that thread with its suspend count above zero.

Voila, the debuggee should return back to life. Now, you should be able to re-attach a debugger (hopefully, with a safe symbol path this time), or not, as desired.

Hotpatching MS08-067

October 24th, 2008

If you have been watching the Microsoft security bulletins lately, then you’ve likely noticed yesterday’s bulletin, MS08-067. This is a particularly nasty bug, as it doesn’t require authentication to exploit in the default configuration for Windows Server 2003 and earlier systems (assuming that an attacker can talk over port 139 or port 445 to your box).

The usual mitigation for this particular vulnerability is to block off TCP/139 (NetBIOS) and TCP/445 (Direct hosted SMB), thus cutting off remote access to the srvsvc pipe, a prerequisite for exploiting the vulnerability in question. In my case, however, I had a box that I really didn’t want to reboot immediately. In addition, for the box in question, I did not wish to leave SMB blocked off remotely.

Given that I didn’t want to assume that there’d be no possible way for an untrusted user to be able to establish a TCP/139 or TCP/445, this left me with limited options; either I could simply hope that there wasn’t a chance for the box to get compromised before I had a chance for it to be convenient to reboot, or I could see if I could come up with some form of alternative mitigation on my own. After all, a debugger is the software development equivalent of a swiss army knife and duct-tape; I figured that it would be worth a try seeing if I could cobble together some sort of mitigation by manually patching the vulnerable netapi32.dll. To do this, however, it would be necessary to gain enough information about the flaw in question in order to discern what the fix was, in the hope of creating some form of alternative countermeasure for the vulnerability.

The first stop for gaining more information about the bug in question would be the Microsoft advisory. As usual, however, the bulletin released for the MS08-067 issue was lacking in sufficiently detailed technical information as required to fully understand the flaw in question to the degree necessary down to the level of what functions were patched, aside from the fact that the vulnerability resided somewhere in netapi32.dll (the Microsoft rationale behind this policy is that providing that level of technical detail would simply aid the creation of exploits). However, as Pusscat presented at Blue Hat Fall ’07, reverse engineering most present-day Microsoft security patches is not particularly insurmountable.

The usual approach to the patch reverse engineering process is to use a program called bindiff (an IDA plugin) that analyzes two binaries in order to discover the differences between the two. In my case, however, I didn’t have a copy of bindiff handy (it’s fairly pricey). Fortunately (or unfortunately, depending on your persuasion), there already existed a public exploit for this bug, as well as some limited public information from persons who had already reverse engineered the patch to a degree. To this end, I had a particular function in the affected module (netapi32!NetpwPathCanonicalize) which I knew was related to the vulnerability in some form.

At this point, I brought up a copy of the unpatched netapi32.dll in IDA, as well as a patched copy of netapi32.dll, then started looking through and comparing disassembly one page at a time until an interesting difference popped up in a subfunction of netapi32!NetpwPathCanonicalize:

Unpatched code:

.text:000007FF7737AF90 movzx   eax, word ptr [rcx]
.text:000007FF7737AF93 xor     r10d, r10d
.text:000007FF7737AF96 xor     r9d, r9d
.text:000007FF7737AF99 cmp     ax, 5Ch
.text:000007FF7737AF9D mov     r8, rcx
.text:000007FF7737AFA0 jz      loc_7FF7737515E

Patched code:

.text:000007FF7737AFA0 mov     r8, rcx

.text:000007FF7737AFA3 xor     eax, eax

.text:000007FF7737AFA5 mov     [rsp+arg_10], rbx
.text:000007FF7737AFAA mov     [rsp+arg_18], rdi
.text:000007FF7737AFAF jmp     loc_7FF7738E5D6

.text:000007FF7738E5D6 mov     rcx, 0FFFFFFFFFFFFFFFFh
.text:000007FF7738E5E0 mov     rdi, r8
.text:000007FF7738E5E3 repne scasw

.text:000007FF7738E5E6 movzx   eax, word ptr [r8]
.text:000007FF7738E5EA xor     r11d, r11d

.text:000007FF7738E5ED not     rcx

.text:000007FF7738E5F0 xor     r10d, r10d

.text:000007FF7738E5F3 dec     rcx

.text:000007FF7738E5F6 cmp     ax, 5Ch

.text:000007FF7738E5FA lea     rbx, [r8+rcx*2+2]

.text:000007FF7738E5FF jnz     loc_7FF7737AFB4

Now, without even really understanding what’s going on here on the function as a whole, it’s pretty obvious that here’s where (at least one) modification is being made; the new code involved the addition of an inline wcslen call. Typically, security fixes for buffer overrun conditions involve the creation of previously missing boundary checks, so a new call to a string-length function such as wcslen is a fairly reliable indicator that one’s found the site of the fix for the the vulnerability in question.

(The repne scasw instruction set scans a memory region two-bytes at a time until a particular value (in rax) is reached, or the maximum count (in rcx, typically initialized to (size_t)-1) is reached. Since we’re scanning two bytes at a time, and we’ve initialized rax to zero, we’re looking for an 0x0000 value in a string of two-byte quantities; in other words, an array of WCHARs (or a null terminated Unicode string). The resultant value on rcx after executing the repne scasw can be used to derive the length of the string, as it will have been decremented based on the number of WCHARs encountered before the 0x0000 WCHAR.)

My initial plan was, assuming that the fix was trivial, to simply perform a small opcode patch on the unpatched version of netapi32.dll in the Server service process on the box in question. In this particular instance, however, there were a number of other changes throughout the patched function that made use of the additional length check. As a result, a small opcode patch wasn’t ideal, as large parts of the function would need to be rewritten to take advantage of the extra length check.

Thus, plan B evolved, wherein the Microsoft-supplied patched version of netapi32.dll would be injected into an already-running Server service process. From there, the plan was to detour buggy_netapi32!NetpwPathCanonicalize to fixed_netapi32!NetpwPathCanonicalize.

As it turns out, netapi32!NetpwPathCanonicalize and all of its subfunctions are stateless with respect to global netapi32 variables (aside from the /GS cookie), which made this approach feasible. If the call tree involved a dependancy on netapi32 global state, then simply detouring the entire call tree wouldn’t have been a valid option, as the globals in the fixed netapi32.dll would be distinct from the globals in the buggy netapi32.dll.

This approach also makes the assumption that the only fixes made for the patch were in netapi32!NetpwPathCanonicalize and its call tree; as far as I know, this is the case, but this method is (of course) completely unsupported by Microsoft. Furthermore, as x64 binaries are built without hotpatch nop stubs at their prologue, the possibility for atomic patching in of a detour appeared to be out, so this approach has a chance of failing in the (unlikely) scenario where the first few instructions of netapi32!NetpwPathCanonicalize were being executed at the time of the detour.

Nonetheless, the worst case scenario would be that the box went down, in which case I’d be rebooting now instead of later. As the whole point of this exercise was to try and delay rebooting the system in question, I decided that this was an acceptable risk in my scenario, and elected to proceed. For the first step, I needed a program to inject a DLL into the target process (SDbgExt does not support !loaddll on 64-bit targets, sadly). The program that I came up with is certainly quick’n’dirty, as it fudges the thread start routine in terms of using kernel32!LoadLibraryA as the initial start address (which is a close enough analogue to LPTHREAD_START_ROUTINE to work), but it does the trick in this particular case.

The next step was to actually load the app into the svchost instance containing the Server service instance. To determine which svchost process this happens to be, one can use “tasklist /svc” from a cmd.exe console, which provides a nice formatted view of which services are hosted in which processes:

C:\WINDOWS\ms08-067-hotpatch>tasklist /svc
[…]
svchost.exe 840 AeLookupSvc, AudioSrv, BITS, Browser,
CryptSvc, dmserver, EventSystem, helpsvc,
HidServ, IAS, lanmanserver,
[…]

That being done, the next step was to inject the DLL into the process. Unfortunately, the default security descriptor on svchost.exe instances doesn’t allow Administrators the access required to inject a thread. One way to solve this problem would have been to write the code to enable the debug privilege in the injector app, but I elected to simply use the age-old trick of using the scheduler service (at.exe) to launch the program in question as LocalSystem (this, naturally, requires that you already be an administrator in order to succeed):

C:\WINDOWS\ms08-067-hotpatch>at 21:32 C:\windows\ms08-067-hotpatch\testapp.exe 840 C:\windows\ms08-067-hotpatch\netapi32.dll
Added a new job with job ID = 1

(21:32 was one minute from the time when I entered that command, minutes being the minimum granularity for the scheduler service.)

Roughly one minute later, the debugger (attached to the appropriate svchost instance) confirmed that the patched DLL was loaded successfully:

ModLoad: 00000000`04ff0000 00000000`05089000 
C:\windows\ms08-067-hotpatch\netapi32.dll

Stepping back for a moment, attaching WinDbg to an svchost instance containing services in the symbol server lookup code path is risky business, as you can easily deadlock the debugger. Proceed with care!

Now that the patched netapi32.dll was loaded, it was time to detour the old netapi32.dll to refer to the new netapi32.dll. Unfortunately, WinDbg doesn’t support assembling amd64 instructions very well (64-bit addresses and references to the extended registers don’t work properly), so I had to use a separate assembler (HIEW, [Hacker’s vIEW]) and manually patch in the opcode bytes for the detour sequence (mov rax, <absolute addresss> ; jmp rax):

0:074> eb NETAPI32!NetpwPathCanonicalize 48 C7 C0 40 AD FF 04 FF E0
0:074> u NETAPI32!NetpwPathCanonicalize
NETAPI32!NetpwPathCanonicalize:
000007ff`7737ad30 48c7c040adff04
mov rax,offset netapi32_4ff0000!NetpwPathCanonicalize
000007ff`7737ad37 ffe0            jmp     rax
0:076> bp NETAPI32!NetpwPathCanonicalize
0:076> g

This said and done, all that remained was to set a breakpoint on netapi32!NetpwPathCanonicalize and give the proof of concept exploit a try against my hotpatched system (it survived). Mission accomplished!

The obvious disclaimer: This whole procedure is a complete hack, and not recommended for production use, for reasons that should be relatively obvious. Additionally, MS08-067 did not come built as officially “hotpatch-enabled” (i.e. using the Microsoft supported hotpatch mechanism); “hotpatch-enabled” patches do not entail such a sequence of hacks in the deployment process.

(Thanks to hdm for double-checking some assumptions for me.)

Why hooking system services is more difficult (and dangerous) than it looks

July 26th, 2008

System service hooking (and kernel mode hooking in general) is one of those near-and-dear things to me which I’ve got mixed feelings about. (System service hooking refers to intercepting system calls, like NtCreateFile, in kernel mode using a custom driver.)

On one hand, hooking things like system services can be extremely tricky at best, and outright dangerous at worst. There are a number of ways to write code that is almost right, but will fail in rare edge cases. Furthermore, there are even more ways to write code that will behave correctly as long as system service callers “play nicely” and do the right thing, but sneaks in security holes that only become visible once somebody in user mode tries to “bend the rules” a bit.

On the other hand, though, there are certainly some interesting things that one can do in kernel mode, for which there really aren’t any supported or well-defined extensibility interfaces without “getting one’s hands dirty” with a little hooking, here or there.

Microsoft’s policy on hooking in kernel mode (whether patching system services or otherwise) is pretty clear: They would very much rather that nobody even thought about doing anything remotely like code patching or hooking. Having seen virtually every anti-virus product under the sun employ hooking in one way or another at some point during their lifetime (and subsequently fail to do it correctly, usually with catastrophic consequences), I can certainly understand where they are coming from. Then again, I have also been on the other side of the fence once or twice, having been blocked from implementing a particular useful new capability due to a certain anti-patching system on recent Windows versions that shall remain nameless, at least for this posting.

Additionally, in further defense of Microsoft’s position, the vast majority of code I have seen in the field that has performed some sort of kernel mode hooking has done some out of a lack of understanding of the available and supported kernel mode extensibility interfaces. There are, furthermore, a number of things where hooking system services won’t even get the desired result (even if it weren’t otherwise fraught with problems); for instance, attempting to catch file I/O by hooking NtReadFile/NtReadFileScatter/NtWriteFile/NtWriteFileGather will cause one to completely miss any I/O that is performed using a mapped section view.

Nonetheless, system service hooking seems to be an ever-increasing trend, one that doesn’t seem to be likely to go away any time soon (at least for 32-bit Windows). There are also, of course, bits of malicious code out there which attempt to do system service hooking as well, though in my experience, once again, the worst (and most common) offenders have been off-the-shelf software that was likely written by someone who didn’t really understand the consequences of what they were doing.

Rather than weigh in on the side of “Microsoft should allow all kernel mode hooking”, or “All code hooking is bad”, however, I thought that a different viewpoint might be worth considering on this subject. At the risk of incurring Don Burn’s wrath by even mentioning the subject (an occurrence that is quite common on the NTDEV list (if one happens to follow that), I figured that it might be instructive to provide an example of just how easy it is to slip up in a seemingly innocent way while writing a system service hook. Of course, “innocent slip-ups” in kernel mode often equate to bugchecks at their best, or security holes at their worst.

Thus, the following, an example of how a perfectly honest attempt at hooking a system service can go horribly wrong. This is not intended to be a tutorial in how to hook system services, but rather an illustration of how one can easily create all sorts of nasty bugs even while trying to be careful.

The following code assumes that the system service in question (NtCreateSection) has had a hook “safely” installed (BuggyDriverHookNtCreateSection). This posting doesn’t even touch on the many and varied problems of safely trying to hook (or even worse, unhook) a system service, which are also typically done wrong in my experience. Even discounting the diciness of those problems, there’s plenty that can go wrong here.

I’ll post a discussion of some of the worst bugs in this code later on, after people have had a chance to mull it over for a little while. This routine is a fairly standard attempt to post-process a system service by doing some additional work after it completes. Feel free to post a comment if you think you have found one of the problems (there are several). Bonus points if you can identify why something is broken and not just that it is broken (e.g. provide a scenario where the existing code breaks). Even more bonus points if you can explain how to fix the problem you have found without introducing yet another problem or otherwise making things worse – by asking that, I am more wanting to show that there are a great deal of subleties at play here, instead of simply showing how to operate an NtCreateSection hook correctly. Oh, and there’s certainly more than one bug here to be found, as well.

N.B. I haven’t run any of this through a compiler, so excuse any syntax errors that I missed.

Without further adeu, here’s the code:

//
// Note: This code has bugs. Please don't actually try this at home!
//

NTSTATUS
NTAPI
BuggyDriverNtCreateSection(
 OUT PHANDLE SectionHandle,
 IN ACCESS_MASK DesiredAccess,
 IN POBJECT_ATTRIBUTES ObjectAttributes,
 IN PLARGE_INTEGER SectionSize OPTIONAL,
 IN ULONG Protect,
 IN ULONG Attributes,
 IN HANDLE FileHandle
 )
{
 NTSTATUS Status;
 HANDLE Section;
 SECTION_IMAGE_INFORMATION ImageInfo;
 ULONG ReturnLength;
 PVOID SectionObject;
 PVOID FileObject;
 BOOLEAN SavedSectionObject;
 HANDLE SectionKernelHandle;
 LARGE_INTEGER LocalSectionSize;

 SavedSectionObject = FALSE;

 //
 // Let's call the original NtCreateSection, as we only care about successful
 // calls with valid parameters.
 //

 Status = RealNtCreateSection(
  SectionHandle,
  DesiredAccess,
  ObjectAttributes,
  SectionSize,
  Protect,
  Attributes,
  FileHandle
  );

 //
 // Failed? We'll bail out now.
 //

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // Okay, we've got a successful call made, let's do our work.
 // First, capture the returned section handle. Note that we do not need to
 // do a probe as that was already done by NtCreateSection, but we still do
 // need to use SEH.
 //

 __try
 {
  Section = *SectionHandle;
 }
 __except( EXCEPTION_EXECUTE_HANDLER )
 {
  Status = (NTSTATUS)GetExceptionCode();
 }

 //
 // The user unmapped our buffer, let's bail out.
 //

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // We need a pointer to the section object for our work. Let's grab it now.
 //

 Status = ObReferenceObjectByHandle(
  Section,
  0,
  NULL,
  KernelMode,
  &SectionObject,
  NULL
  );

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // Just for fun, let's check if the section was an image section, and if so,
 // we'll do special work there.
 //

 Status = ZwQuerySection(
  Section,
  SectionImageInformation,
  &ImageInfo,
  sizeof( SECTION_IMAGE_INFORMATION ),
  &ReturnLength
  );

 //
 // If we are an image section, then let's save away a pointer to to the
 // section object for our own use later.
 //

 
 if (NT_SUCCESS(Status))
 {
  //
  // Save pointer away for something that we might do with it later. We might
  // want to care about the section image information for some unspecified
  // reason, so we will copy that and save it in our tracking list. For
  // example, maybe we want to map a view of the section into the initial
  // system process from a worker thread.
  //

  Status = SaveImageSectionObjectInList(
   SectionObject,
   &ImageInfo
   );

  if (!NT_SUCCESS( Status ))
  {
   ObDereferenceObject( SectionObject );
   return Status;
  }

  SavedSectionObject = TRUE;
 }

 //
 // Let's also grab a kernel handle for the file object so that we can do some
 // sort of work with it later on.
 //

 Status = ObReferenceObjectByHandle(
  Section,
  0,
  NULL,
  KernelMode,
  &FileObject,
  NULL
  );

 if (!NT_SUCCESS( Status ))
 {
  if (SavedSectionObject)
   DeleteImageSectionObjectInList( SectionObject );

  ObDereferenceObject( SectionObject );

  return Status;
 }

 //
 // Save the file object away, as well as maximum size of the section object.
 // We need the size of the section object for a length check when accessing
 // the section later.
 //

 if (SectionSize)
 {
  __try
  {
   LocalSectionSize = *SectionSize;
  }
  __except( EXCEPTION_EXECUTE_HANDLER )
  {
   Status = (NTSTATUS)GetExceptionCode();

   ObDereferenceObject( FileObject );

   if (SavedSectionObject)
    DeleteImageSectionObjectInList( SectionObject );

   ObDereferenceObject( SectionObject );

   return Status;
  }
 }
 else
 {
  //
  // Ask the file object for it's length, this could be done by any of the
  // usual means to do that.
  //

  Status = QueryAllocationLengthFromFileObject(
   FileObject,
   &LocalSectionSize
   );

  if (!NT_SUCCESS( Status ))
  {
   ObDereferenceObject( FileObject );

   if (SavedSectionObject)
    DeleteImageSectionObjectInList( SectionObject );

   ObDereferenceObject( SectionObject );

   return Status;
  }
 }

 //
 // Save the file object + section object + section length away for future
 // reference.
 //

 Status = SaveSectionFileInfoInList(
  FileObject,
  SectionObject,
  &LocalSectionSize
  );

 if (!NT_SUCCESS( Status ))
 {
  ObDereferenceObject( FileObject );

  if (SavedSectionObject)
   DeleteImageSectionObjectInList( SectionObject );

  ObDereferenceObject( SectionObject );

  return Status;
 }

 //
 // All done. Lose our references now. Assume that the Save*InList routines
 // took their own references on the objects in question. Return to the caller
 // successfully.
 //

 ObDereferenceObject( FileObject );
 ObDereferenceObject( SectionObject );

 return STATUS_SUCCESS;
}

Why does every heap trace in UMDH get stuck at “malloc”?

February 21st, 2008

One of the more useful tools for tracking down memory leaks in Windows is a utility called UMDH that ships with the WinDbg distribution. Although I’ve previously covered what UMDH does at a high level, and how it functions, the basic principle for it, in a nutshell, is that it uses special instrumentation in the heap manager that is designed to log stack traces when heap operations occur.

UMDH utilizes the heap manager’s stack trace instrumentation to associate call stacks with outstanding allocations. More specifically, UMDH is capable of taking a “snapshot” of the current state of all heaps in a process, associating like-sized allocations from like-sized callstacks, and aggregrating them in a useful form.

The general principle of operation is that UMDH is typically run two (or more times), once to capture a “baseline” snapshot of the process after it has finished initializing (as there are expected to always be a number of outstanding allocations while the process is running that would not be normally expected to be freed until process exit time, for example, any allocations used to build the command line parameter arrays provided to the main function of a C program, or any other application-derived allocations that would be expected to remain checked out for the lifetime of the program.

This first “baseline” snapshot is essentially intended to be a means to filter out all of these expected, long-running allocations that would otherwise show up as useless noise if one were to simply take a single snapshot of the heap after the process had leaked memory.

The second (and potentially subsequent) snapshots are intended to be taken after the process has leaked a noticeable amount of memory. UMDH is then run again in a special mode that is designed to essentially do a logical “diff” between the “baseline” snapshot and the “leaked” snapshot, filtering out any allocations that were present in both of them and returning a list of new, outstanding allocations, which would generally include any leaked heap blocks (although there may well be legitimate outstanding allocations as well, which is why it is important to ensure that the “leaked” snapshot is taken only after a non-trivial amount of memory has been leaked, if at all possible).

Now, this is all well and good, and while UMDH proves to be a very effective tool for tracking down memory leaks with this strategy, taking a “before” and “after” diff of a problem and analyzing the two to determine what’s gone wrong is hardly a new, ground-breaking concept.

While the theory behind UMDH is sound, however, there are some situations where it can work less than optimally. The most common failure case of UMDH in my experience is not actually so much related to UMDH itself, but rather the heap manager instrumentation code that is responsible for logging stack traces in the first place.

As I had previously discussed, the heap manager stack trace instrumentation logic does not have access to symbols, and on x86, “perfect” stack traces are not generally possible, as there is no metadata attached with a particular function (outside of debug symbols) that describes how to unwind past it.

The typical approach taken on x86 is to assume that all functions in the call stack do not use frame pointer omission (FPO) optimizations that allow the compiler to eliminate the usage of ebp for a function entirely, or even repurpose it for a scratch register.

Now, most of the libraries that ship with the operating system in recent OS releases have FPO explicitly turned off for x86 builds, with the sole intent of allowing the built-in stack trace instrumentation logic to be able to traverse through system-supplied library functions up through to application code (after all, if every heap stack trace dead-ended at kernel32!HeapAlloc, the whole concept of heap allocation traces would be fairly useless).

Unfortunately, there happens to be a notable exception to this rule, one that actually came around to bite me at work recently. I was attempting to track down a suspected leak with UMDH in one of our programs, and noticed that all of the allocations were grouped into a single stack trace that dead-ended in a rather spectacularly unhelpful way. Digging in a bit deeper, in the individual snapshot dumps from UMDH contained scores of allocations with the following backtrace logged:

00000488 bytes in 0x1 allocations
   (@ 0x00000428 + 0x00000018) by: BackTrace01786

        7C96D6DC : ntdll!RtlDebugAllocateHeap+000000E1
        7C949D18 : ntdll!RtlAllocateHeapSlowly+00000044
        7C91B298 : ntdll!RtlAllocateHeap+00000E64
        211A179A : program!malloc+0000007A

This particular outcome happened to be rather unfortunate, as in the specific case of the program I was debugging at work, virtually all memory allocations in the program (including the ones I suspected of leaking) happened to ultimately get funneled through malloc.

Obviously, getting told that “yes, every leaked memory allocation goes through malloc” isn’t really all that helpful if (most) every allocation in the program in question happened to go through malloc. The UMDH output begged the question, however, as to why exactly malloc was breaking the stack traces. Digging in a bit deeper, I discovered the following gem while disassembling the implementation of malloc:

0:011> u program!malloc
program!malloc
[f:\sp\vctools\crt_bld\self_x86\crt\src\malloc.c @ 155]:
211a1720 55              push    ebp
211a1721 8b6c2408        mov     ebp,dword ptr [esp+8]
211a1725 83fde0          cmp     ebp,0FFFFFFE0h
[...]

In particular, it would appear that the default malloc implementation on the static link CRT on Visual C++ 2005 not only doesn’t use a frame pointer, but it trashes ebp as a scratch register (here, using it as an alias register for the first parameter, the count in bytes of memory to allocate). Disassembling the DLL version of the CRT revealed the same problem; ebp was reused as a scratch register.

What does this all mean? Well, anything using malloc that’s built with Visual C++ 2005 won’t be diagnosable with UMDH or anything else that relies on ebp-based stack traces, at least not on x86 builds. Given that many things internally go through malloc, including operator new (at least in the default implementation), this means that in the default configuration, things get a whole lot harder to debug than they should be.

One workaround here would be to build your own copy of the CRT with /Oy- (force frame pointer usage), but I don’t really consider building the CRT a very viable option, as that’s a whole lot of manual work to do and get up and running correctly on every developer’s machine, not to mention all the headaches that service releases that will require rebuilds will bring with such an approach.

For operator new, it’s fortunately relatively doable to overload it in a relatively supported way to be implemented against a different allocation strategy. In the case of malloc, however, things don’t really have such a happy ending; one is either forced to re-alias the name using preprocessor macro hackery to a custom implementation that does not suffer from a lack of frame pointer usage, or otherwise change all references to malloc/free to refer to a custom allocator function (perhaps implemented against the process heap directly instead of the CRT heap a-la malloc).

So, the next time you use UMDH and get stuck scratching your head while trying to figure out why your stack traces are all dead-ending somewhere less than useful, keep in mind that the CRT itself may be to blame, especially if you’re relying on CRT allocators. Hopefully, in a future release of Visual Studio, the folks responsible for turning off FPO in the standard OS libraries can get in touch with the persons responsible for CRT builds and arrange for the same to be done, if not for the entire CRT, then at least for all the code paths in the standard heap routines. Until then, however, these CRT allocator routines remain roadblocks for effective leak diagnosis, at least when using the better tools available for the job (UMDH).

Power management observations with my XV6800

December 11th, 2007

As I previously mentioned, I recently switched to a full-blown Windows Mobile-based phone. One of the interesting conundrums that I’ve ran into along the way is feeling out what’s actually killer on the battery life for the device and what isn’t.

It’s actually kind of surprising how seemingly innocent little things can have a dramatic impact on battery life in a situation like an always-on handheld phone. For example, one of the main power saving features of modern CDMA cell networks and packet switched data connections is the ability for the device to go into dormant mode, a state in which the device relinquishes it’s active radio link resources and essentially goes into a standby mode that is in some respects akin to how the device operates when it is otherwise idle and waiting for a call. This is beneficial for the device’s battery life (because while in dormant mode, much of the logic related to maintaining the active high-rate radio link can be powered off while the network link is idle). Dormant mode is also beneficial for network operators, in that it allows the resources previously dedicated to a particular device to be used by other users that are actively sending or receiving data.

When data is ready to be sent on either end of the link, the device will “wake up” and exit dormant mode in order to send/receive data, keeping the link fully active until the session has been idle for enough time for the dormant idle timer to expire. (Entering and exiting dormant mode is mostly seamless, but may involve delays of 1-2 seconds, at least in my experience, so doing so on every packet would be highly undesirable.)

What all of this means is that there’s some significant gains to be had in terms of radio power consumption if network traffic can be limited to where it is really necessary. There are a couple of things that I had initially running on my device that didn’t really fit this criteria. For example, I have an always-active SSH connection to a screen session that, among other things, runs my console SILC client, which is based off of irssi (an IRC client). One gotcha that I encountered there is that in the default configuration, this client has a clock on the ncurses console UI. The clock is updated every minute with the current time, which is (normally) handy as you can at-a-glance compare the current timestamp with message timestamps in the backlog.

However, this becomes problematic in my case, because the clock forces the device out of dormant mode every minute in order to repaint that portion of the console UI. (Aside from the clock, the remainder of the SILC/irssi UI is static until there is actual chat activity, which would otherwise make it relatively well suited for this situation.) Fortunately, it turned out to be relatively easy to remove the clock from the client’s UI, which prevents the SSH session from causing network activity every minute while I’ve got the SILC client up.

Another network power management “gotcha” that I ran into was the built-in L2TP/IPsec VPN client. It turns out that while an L2TP/IPsec link is active, the device was, in my case, unable to enter dormant mode for any non-trivial length of time. I suspect that this is caused by the default behavior of L2TP to send keep-alive messages every minute to ensure that the session is still alive (at the very least, packet captures VPN-server-side did not seem to indicate any traffic destined to the device’s projected IP Address).

In any case, this unpleasant side effect of the L2TP/IPsec client ruled out using it for full network connectivity in an always-on fashion. Unfortunately, dialing the VPN link on-demand has it’s own share of pitfalls, as the device seems to make the VPN link the default gateway. When I configured IMAPv4 mail to be fetched every so often via demand-dialing the VPN link (which would alleviate the difficulty with dormant mode not operating as desired with the VPN link always on), my SSH session would often die if either side happened to try and send data while Pocket Outlook was polling IMAP mail.

My solution in this case was to fall back to using SSH port-forwards through PocketPuTTY for IMAP mail (after I fixed SSH port forwards to work reliably, that is, of course). Unlike the L2TP/IPsec link, the SSH link doesn’t inherently force the device out of dormant mode without there being real activity, which allows me to continue to leave the SSH connection always on without sacrificing my battery life.

These two changes seem to have fixed the worst offenders in terms of battery drain, but there’s still plenty of room for improvement. For example, PocketPuTTY will tell the remote end of the session that the window has resized when you switch the device from portrait to landscape mode (or vice versa), such as when pulling out the device’s built in keyboard. While this is a very handy feature if you’re actually using the SSH session (as it allows the remote console to resize itself appropriately, which is how SILC/irssi magically reshapes itself to fit the screen when the device keyboard is engaged), it does present the annoyance that pulling out the keyboard while PocketPuTTY is not active will still cause the packet radio link to be fully established to send the terminal resize notification and receive the resulting redraw data.

This behavior could, for instance, be further optimized to defer sending such notifications until the PocketPuTTY window becomes the foreground window (potentially aborting the resize notification entirely if the screen size is switched back to how it previously was before PocketPuTTY is activated).

One of the nice things about having a (relatively) easily end-user-programmable device instead of a locked-down, DRM’ified BREW-based handset is that there’s actually opportunity to change these sorts of things instead of just blithely accepting what’s offered with the device with no real hope of improving it.

How to not write code for a mobile device

December 10th, 2007

Earlier this week, I got a shiny, brand-new XV6800 to replace my aging (and rather limited in terms of running user-created programs) BREW-ified phone.

After setting up ActiveSync, IPsec, and all of the other usual required configuration settings, the next step was, of course, to install the baseline minimum set of applications on the device to get by. Close to the top of this list is an SSH client (being able to project arbitrary console programs over SSH and screen to the device is, at the very least, a workable enough solution in the interm for programs that are not feasible to run directly on the device, such as my SILC client). I’ve found PuTTY a fairly acceptable client for “big Windows”, and there just so happened to be a Windows Mobile port of it. Great, or so I thought…

While PocketPuTTY does technically work, I noticed a couple of deficiencies with it pretty quickly:

  1. The page up / page down keys are treated the same as the up arrow / down arrow keys when sent to the remote system. This sucks, because I can’t access my scrollback buffer in SILC easily without page up / page down. Other applications can tell the difference between the two keys (obviously, or there wouldn’t be much of a point to having the keys at all), so this one is clearly a PocketPuTTY bug.
  2. There is no way to restart a session once it is closed, or at least, not that I’ve found. Normally, on “big Windows”, there’s a “Restart Session” menu option in the window menu of the PuTTy session, but (as far as I can tell) there’s no such equivalent to the window menu on PocketPuTTY. There is a “Tools” menu, although it has some rather useless menu items (e.g. About) instead of some actually useful menu items, like “Restart Session”.
  3. Running PocketPuTTY seems to have a significantly negative effect on battery life. This is really unfortunate for me, since the expected use is to leave an SSH session to a terminal running screen for a long period of time. (Note that this was resolved, partially, by locating a slightly more maintained copy of PocketPuTTY.)
  4. SSH port forward support seems to be fairly broken, in that as soon as a socket is cleaned up, all receives in the process break until you restart it. This is annoying, but workable if one can go without SSH port forwards.

Most of these problems are actually not all that difficult to fix, and since the source code is available, I’m actually planning on trying my hand at doing just that, since I expect that this is an app that I’ll make fairly heavy use of.

The latter problem is one I really want to call attention to among these deficiencies, however. My intent here is not to bash the PocketPuTTY folks (and I’m certainly happy that they’ve at least gotten a (semi)-working port of PuTTY with source code out, so that other people can improve on it from there), but rather to point out some things that should really just not be done when you’re writing code that is intended to run on a mobile device (especially if the code is intended to run exclusively on a mobile device).

On a portable device, one of the things that most users really expect is long battery life. Though this particular point certainly holds true for laptops as well, it is arguably even more important that for converged mobile phone devices. After all, most people consider their phone an “always on” contact mechanism, and unexpectedly running out of battery life is extremely annoying in this aspect. Furthermore, if your mobile phone has the capability to run all sorts of useful programs on it, but doing so eats up your battery in a couple of hours, then there is really not that much point in having that capability at all, is there?

Returning to PocketPuTTY, one of the main problems (at least with the version I initially used) was, again, that PocketPuTTY would reduce battery life significantly. Looking around in the source code for possible causes, I noticed the following gem, which turned out to be the core of the network read loop for the program.

Yes, there really is a Sleep(1) spin loop in there, in a software port that is designed to run on battery powered devices. For starters, that blows any sort of processor power management completely out of the water. Mobile devices have a lot of different components vying for power, and the easiest (and most effective) way to save on power (and thus battery life) is to not turn those components on. Of course, it becomes difficult to do that for power hungry components like the 400MHz CPU in my XV6800 if there’s a program that has an always-ready-to-run thread…

Fortunately, there happened to be a newer revision of PocketPuTTY floating about with the issue fixed (although getting ahold of the source code for that version proved to be slightly more difficult, as the original maintainer of the project seems to have disappeared). I did eventually manage to get into contact with someone who had been maintaining their own set of improvements and grab a not-so-crusty-old source tree from them to do my own improvements, primarily for the purposes of fixing some of the annoyances I mentioned previously (thus beginning my initial forey into Windows CE development).

Mini USB charging for devices is a great idea

December 6th, 2007

One recent innovation that I recently stumbled across is the use of mini USB ports for power. I actually originally encountered a device doing this with the Morotola HT820 Bluetooth headphones I got some months ago, although at the time, I didn’t realize that the power brick that came with the device was mini USB and not yet another proprietary power cable connection.

For those unaware, mini USB is a small form factor of USB plug (otherwise essentially identical to standard USB). As with regular USB, mini USB devices can be off the bus. Now, standardizing on mini USB for power connectors has some really nice advantages:

  1. Everyone’s power bricks are now compatible. Normally, each gadget you own is going to have its own (incompatible) power transformer brick, which means that if you’re travelling, you’ll quite likely have to lug around several such power bricks (in addition to your laptop AC adapter), even if every device doesn’t really need to be plugged into “wall power” at the same time.
  2. You can charge these devices off of a laptop USB port. Of course, you’ll need a USB to mini USB cable, assuming your laptop doesn’t have any mini USB ports (I haven’t seen any that do). The really cool thing about being able to do this is that for all your mini USB, battery powered devices, you only need to bring one cable for all of them while on the move, since you can just plug the device into your laptop and charge it off of that.

Furthermore, because the connection is still USB, devices can use the same port to charge and transfer data to a computer at the same time.

Now, I didn’t really put all the pieces together about my HT820 being mini USB until I had to go and buy a replacement set of headphones (a different model this time, as Best Buy didn’t carry the HT820 anymore). The volume button on my set had died, which was rather inconvenient, to say the least.

Anyways, I noticed that the new headphones, from a different manufacturer, seemed to have a compatible power cable port from the AC adapter brick to the device’s charging port. This time, the power cable even had the ubiquitous USB logo on it (the Motorola’s charging cable didn’t, for one reason or another). Sure enough, I could use the Motorola charger with the new set of headphones, even though they weren’t of a Motorola make. Cool.

Recently, I finally ditched my old cell phone for a proper smart phone (an XV6800). This, too, is mini USB powered (and it can use the mini USB connection for data as well, at the same time). There’s a lot of other neat things about the XV6800, but the fact that I don’t need to (ever) worry about spare chargers or anything of that sort is really just the icing on the cake.

Thanks to this little advancement, I can now ditch two additional power bricks (one for the cell phone and one for the Bluetooth headset) when traveling, and just charge both of devices off of my laptop. I’m actually kind of surprised that nobody implemented this earlier, given the immediate advantages of standardizing on a uniform power source (especially one that is as easily multiplexed as USB is).

So, the next time you’re looking for a new gadget of some sort or another, look for one chargeable via mini USB and circumvent the gadget charger nightmare (or at least, begin to do so, one gadget at a time).

A catalog of NTDLL kernel mode to user mode callbacks, part 6: LdrInitializeThunk

November 28th, 2007

Previously, I described the mechanism by which the kernel mode to user mode callback dispatcher (KiUserCallbackDispatcher) operates, and how it is utilized by win32k.sys for various window manager related operations.

The next special NTDLL kernel mode to user mode “callback” up on the list is LdrInitializeThunk, which is not so much a callback as the entry point at which all user mode threads begin their execution system-wide. Although the Win32 CreateThread API (and even the NtCreateThread system service that is used to implement the Win32 CreateThread) provide the illusion that a thread begins its execution at a specified start routine (or instruction pointer, in the case of NtCreateThread), this is not truly the case.

CreateThread internally layers in a special kernel32 stub routine (BaseThreadStart or BaseProcessStart) in between the specified thread routine and the actual initial instruction of the thread. The kernel32 stub routine wrappers the call to the user-supplied thread start routine to provide services such a “top-level” SEH frame for the support of UnhandledExceptionFilter.

However, there exists yet another layer of indirection before a thread begins its execution at the specified thread start routine, even beyond the kernel32 stub routine at the start of all Win32 threads (the way the kernel32 stub routine works changes slightly with Windows Vista, though that is outside the scope of this discussion). The presence of this extra layer of indirection can be inferred by examining the documentation on MSDN for DllMain, which states that all threads call out to the DllMain routine at some point during thread start-up. The kernel32 stub routine is not involved in this process, and obviously the user-supplied thread entry point does not have to explicitly attempt to call DllMain for every loaded DLL with DLL_THREAD_ATTACH. This leaves us with the question of who actually arranges for these DllMain calls to happen when a thread begins.

The answer to this question is, of course, the feature routine of this article, LdrInitializeThunk. When a user mode thread is readied to begin initial execution after being created, the initial context that is realized is not the context value supplied to NtCreateThread (which would eventually end up in the user-supplied thread entry point). Instead, execution really begins at LdrInitializeThunk within NTDLL, which is supplied a CONTEXT record that describes the initially requested state of the thread (as supplied to NtCreateThread, or as created by NtCreateThreadEx on Windows Vista). This context record is provided as an argument to LdrInitializeThunk, in order to allow for control to (eventually) be transferred to the user-supplied thread entry point.

When invoked by a new thread, LdrInitializeThunk invokes LdrpInitialize to perform the remainder of the initialization tasks, and then calls upon the NtContinue system service to restore the supplied context record. I have made available a C-like representation of this process for illustration purposes.

LdrpInitialize makes a determination as to whether the process has already been initialized (for the purposes of NTDLL). This step is necessary as LdrInitializeThunk (and by extension, LdrpInitialize) is not only the actual entry point for a new thread in an already initialized process, but it is also the entry point for the initial thread in a completely new process (making it the first piece of code that is run in user mode in a new process). If the process has not already been initialized, then LdrpInitialize performs process initialization tasks by invoking LdrpInitializeProcess (some process initialization is performed inline by LdrpInitialize in the process initialization case as well). If the process is a Wow64 process, then the Wow64 NTDLL is loaded and invoked to initialize the 32-bit NTDLL’s per-process state.

Otherwise, if the process is already initialized, then LdrpInitialize invokes LdrpInitializeThread to perform per-thread initialization, which primarily involves invoking DllMain and TLS callbacks for loaded modules while the loader lock is held. (This is the reason why it is not supported to wait on a new thread to run its thread initialization routine from DllMain, because the new thread will immediately become blocked upon the loader lock, waiting for the thread already in DllMain to complete its processing.) If the process is a Wow64 process, then there is again support for making a call to the 32-bit NTDLL for purposes of running 32-bit per-thread initialization code.

When the required process or thread initialization tasks have been completed, LdrpInitialize returns to LdrInitializeThunk, which then realizes the user-supplied thread start context with a call to the NtContinue system service.

One consequence of this architecture for process and thread initialization is that it becomes (slightly) more difficult to step through process initialization, because the thread that initializes the process is not the first thread created in the process, but rather the first thread that executes LdrInitializeThunk. This means that one cannot simply create a process as suspended and attach a debugger to the process in order to step through process initialization, as the debugger break-in thread will run the very process initialization code that one wishes to step through before executing the break-in thread breakpoint instruction!

There is, fortunately, support for debugging this scenario built in to the Windows debugger package. By setting the debugger to break on the ‘create process’ event, it is possible to manually set a breakpoint on a new process (created by the debugger) or a child process (if child process debugging is enabled) before LdrInitializeThunk is run. This support is activated by breaking on the cpr (Create Process) event. For example, using the following ntsd command line, it is possible to examine the target and set breakpoints before any user mode code runs:

ntsd.exe -xe cpr C:\windows\system32\cmd.exe

Note that as the loaded module list will not have been initialized at this point, symbols will not be functional, so it is necessary to resolve the offset of LdrInitializeThunk manually in order to set a breakpoint. The process can be continued with the typical “g” command.

Update: Pavel Labedinsky points out that you can use “-xe ld:ntdll.dll” to achieve the same effect, but with the benefit that symbols for NTDLL are available. This is a better option as it alleviates the necessity to manually resolve the addresses of locations that you wish to set a breakpoint on..

Next up: Examining RtlUserThreadStart (present on Windows Vista and beyond).

A catalog of NTDLL kernel mode to user mode callbacks, part 5: KiUserCallbackDispatcher

November 21st, 2007

Last time, I briefly outlined the operation of KiRaiseUserExceptionDispatcher, and how it is used by the NtClose system service to report certain classes of handle misuse under the debugger.

All of the NTDLL kernel mode to user mode “callbacks” that I have covered thus far have been, for the most part fairly “passive” in nature. By this, I mean that the kernel does not explicitly call any of these callbacks, at least in the usual nature of making a function call. Instead, all of the routines that we have discussed thus far are only invoked instead of the normal return procedure for a system call or interrupt, under certain conditions. (Conceptually, this is similar in some respects to returning to a different location using longjmp.)

In contrast to the other routines that we have discussed thus far, KiUserCallbackDispatcher breaks completely out of the passive callback model. The user mode callback dispatcher is, as the name implies, a trampoline that is used to make full-fledged calls to user mode, from kernel mode. (It is complemented by the NtCallbackReturn system service, which resumes execution in kernel mode following a user mode callback’s completion. Note that this means that a user mode callback can make auxiliary calls into the kernel without “returning” back to the original kernel mode caller.)

Calling user mode to kernel mode is a very non-traditional approach in the Windows world, and for good reason. Such calls are typically dangerous and need to be implemented very carefully in order to avoid creating any number of system reliability or integrity issues. Beyond simply validating any data returned to kernel mode from user mode, there are a far greater number of concerns with a direct kernel mode to user mode call model as supported by KiUserCallbackDispatcher. For example, a thread running in user mode can be freely suspended, delayed for a very long period of time due to a high priority user thread, or even terminated. These actions mean that any code spanning a call out to user mode must not hold locks, have acquired memory or other resources that might need to be released, or soforth.

From a kernel mode perspective, the way a user mode callback using KiUserCallbackDispatcher works is that the kernel saves the current processor state on the current kernel stack, alters the view of the top of the current kernel stack to point after the saved register state, sets a field in the current thread (CallbackStack) to point to the stack frame containing the saved register state (the previous CallbackStack value is saved to allow for recursive callbacks), and then executes a return to user mode using the standard return mechanism.

The user mode return address is, of course, set to the feature NTDLL routine of this article, KiUserCallbackDispatcher. The way the user mode callback dispatcher operates is fairly simple. First, it indexes into an array stored in the PEB with an argument to the callback dispatcher that is used to select the function to be invoked. Then, the callback routine located in the array is invoked, and provided with a single pointer-sized argument from kernel mode (this argument is typically a structure pointer containing several parameters packaged up into one contiguous block of memory). The actual implementation of KiUserCallbackDispatcher is fairly simple, and I have posted a C representation of it.

In Win32, kernel mode to user mode callbacks are used exclusively by User32 for windowing related aspects, such as calling a window procedure to send a WM_NCCREATE message during the creation of a new window on behalf of a user mode caller that has invoked NtUserCreateWindowEx. For example, during window creation processing, if we set a breakpoint on KiUserCallbackDispatcher, we might see the following:

Breakpoint 1 hit
ntdll!KiUserCallbackDispatch:
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`775851ca ntdll!KiUserCallbackDispatch
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0x27c
00000000`77585550 USER32!CreateWindowEx+0x3fe
000007fe`fddfa5b5 USER32!CreateWindowExW+0x70
000007fe`fde221d3 ole32!InitMainThreadWnd+0x65
000007fe`fde2150c ole32!wCoInitializeEx+0xfa
00000000`ff7e6db0 ole32!CoInitializeEx+0x18c
00000000`ff7ecf8b notepad!WinMain+0x5c
00000000`7746cdcd notepad!IsTextUTF8+0x24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

If we step through this call a bit more, we’ll see that it eventually ends up in a function by the name of user32!_fnINLPCREATESTRUCT, which eventually calls user32!DispatchClientMessage with the WM_NCCREATE window message, allowing the window procedure of the new window to participate in the window creation process, despite the fact that win32k.sys handles the creation of a window in kernel mode.

Callbacks are, as previously mentioned, permitted to be nested (or even recursively made) as well. For example, after watching calls to KiUserCallbackDispatcher for a time, we’ll probably see something akin to the following:

Breakpoint 1 hit
ntdll!KiUserCallbackDispatch:
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`7758b45a ntdll!KiUserCallbackDispatch
00000000`7758b4a4 USER32!NtUserMessageCall+0xa
00000000`7758e55a USER32!RealDefWindowProcWorker+0xb1
000007fe`fca62118 USER32!RealDefWindowProcW+0x5a
000007fe`fca61fa1 uxtheme!_ThemeDefWindowProc+0x298
00000000`7758b992 uxtheme!ThemeDefWindowProcW+0x11
00000000`ff7e69ef USER32!DefWindowProcW+0xe6
00000000`7758e25a notepad!NPWndProc+0x217
00000000`7758cbaf USER32!UserCallWinProcCheckWow+0x1ad
00000000`77584e1c USER32!DispatchClientMessage+0xc3
00000000`77692016 USER32!_fnINOUTNCCALCSIZE+0x3c
00000000`775851ca ntdll!KiUserCallbackDispatcherContinue
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0x27c
00000000`77585550 USER32!CreateWindowEx+0x3fe
00000000`ff7e9525 USER32!CreateWindowExW+0x70
00000000`ff7e6e12 notepad!NPInit+0x1f9
00000000`ff7ecf8b notepad!WinMain+0xbe
00000000`7746cdcd notepad!IsTextUTF8+0x24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd

This support for recursive callbacks is a large factor in why threads that talk to win32k.sys often have so-called “large kernel stacks”. The kernel mode dispatcher for user mode calls will attempt to convert the thread to a large kernel stack when a call is made, as the typical sized kernel stack is not large enough to support the number of recursive kernel mode to user mode calls present in a many complicated window messaging calls.

If the process is a Wow64 process, then the callback array in the PEB is prepointed to an array of conversion functions inside the Wow64 layer, which map the callback argument to a version compatible with the 32-bit user32.dll, as appropriate.

Next up: Taking a look at LdrInitializeThunk, where all user mode threads really begin their execution.

A catalog of NTDLL kernel mode to user mode callbacks, part 4: KiRaiseUserExceptionDispatcher

November 20th, 2007

The previous post in this series outlined how KiUserApcDispatcher operates for the purposes of enabling user mode APCs. Unlike KiUserExceptionDispatcher, which is expected to modify the return information from the context of an interrupt (or exception) in kernel mode, KiUserApcDispatcher is intended to operate on the return context of an active system call.

In this vein, the third kernel mode to user mode NTDLL “callback” that we shall be investigating, KiRaiseUserExceptionDispatcher, is fairly similar. Although in some respects akin to KiUserExceptionDispatcher, at least relating to raising an exception in user mode on the behalf of a kernel mode caller, the KiRaiseUserExceptionDispatcher “callback” really has more in common with KiUserApcDispatcher from a usage and implementation standpoint. It also so happens that the implementation of this callback routine, which is quite simple, is completely representable in C, as a standard calling convention is used.

KiRaiseUserExceptionDispatcher is used if a system call wishes to raise an exception in user mode instead of simply return an NTSTATUS, as is the standard convention. It simply constructs a standard exception record using a supplied status code (which must be written to a well-known location in the current thread’s TEB beforehand), and passes the exception to RtlRaiseException (the same routine that is used internally by the Win32 RaiseException API).

This is a fairly atypical scenario, as most errors (including bad pointer parameters) will simply result in an appropriate status code being returned by the system call, such as STATUS_ACCESS_VIOLATION.

The one place the does currently use the services of KiRaiseUserExceptionDispatcher is NtClose, the system service responsible for closing handles (which implements the Win32 CloseHandle API). When a debugger is attached to a process and a protected handle (as set by a call to SetHandleInformation with HANDLE_FLAG_PROTECT_FROM_CLOSE) is passed to NtClose, then a STATUS_HANDLE_NOT_CLOSABLE exception is raised to user mode via KiRaiseUserExceptionDispatcher. For example, one might see the following when trying to close a protected handle with a debugger attached:

(2600.1994): Unknown exception - code c0000235 (first chance)
(2600.1994): Unknown exception - code c0000235 (second chance)
ntdll!KiRaiseUserExceptionDispatcher+0x3a:
00000000`776920e7 8b8424c0000000  mov eax,dword ptr [rsp+0C0h]
0:000> !error c0000235
Error code: (NTSTATUS) 0xc0000235 (3221226037) - NtClose was
called on a handle that was protected from close via
NtSetInformationObject.
0:000> k
RetAddr           Call Site
00000000`7746dadd ntdll!KiRaiseUserExceptionDispatcher+0x3a
00000000`01001955 kernel32!CloseHandle+0x29
00000000`01001e60 TestApp!wmain+0x35
00000000`7746cdcd TestApp!__tmainCRTStartup+0x120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

The exception can be manually continued in a debugger to allow the program to operate after this occurs, though such a bad handle reference is typically indicative of a serious bug.

A similar behavior is activated if a program is being debugged under Application Verifier and an invalid handle is closed. This behavior of raising an exception on a bad handle closure attempt is intended as a debugging aid, since most programs do not bother to check the return value of CloseHandle.

Incidentally, bad kernel handle closure references like this in kernel mode will result in a bugcheck instead of simply raising an exception that can be caught. Drivers do not have the “luxury” of continuing when they touch a bad handle, except when probing a user handle, of course. (Old versions of Process Explorer used to bring down the box with a bugcheck due to this, for instance, if you tried to close a protected handle. This bug has fortunately since been fixed.)

Next time, we’ll take a look at KiUserCallbackDispatcher, which is used to a great degree by win32k.sys.