Debugger tricks: Find all probable CONTEXT records in a crash dump

March 30th, 2009

If you’ve debugged crash dumps for awhile, then you’ve probably ran into a situation where the initial dump context provided by the debugger corresponds to a secondary exception that happened while processing an initial exception that’s likely closer to the original underlying problem in the issue you’re investigating.

This can be annoying, as the “.ecxr” command will point you at the site of the secondary failure exception, and not the original exception context itself. However, in most cases the original, primary exception context is still there on the stack; one just needs to know how to find it.

There are a couple of ways to go about this:

  • For hardware generated exceptions (such as access violations), one can look for ntdll!KiUserExceptionDispatcher on the stack, which takes a PCONTEXT and an PEXCEPTION_RECORD as arguments.
  • For software-generated exceptions (such as C++ exceptions), things get a bit dicier. One can look for ntdll!RtlDispatchException being called on the stack, and from there, grab the PCONTEXT parameter.

This can be a bit tedious if stack unwinding fails, or you’re dealing with one of those dumps where exceptions on multiple threads at the same time, typically due to crash dump writing going haywire (I’m looking at you, Outlook…). It would be nice if the debugger could automate this process a little bit.

Fortunately, it’s actually not hard to do this with a little bit of a brute-force approach. Specifically, just a plain old “dumb” memory scan for something common to most all CONTEXT records. It’s not exactly a finesse approach, but it’s usually a lot faster than manually groveling through the stack, especially if multiple threads or multiple nested exceptions are involved. While there may be false-positives, it’s usually immediately obvious as to what makes sense to be involved with a live exception or not. Sometimes, however, quick-and-dirty brute force type solutions really end up doing the trick, though.

In order to find CONTEXT records based on a memory search, though, we need some common data points that are typically the same for all CONTEXT structures, and, preferably, contiguous (for ease of use with the “s” command, the debugger’s memory search support). Fortunately, it turns out that this exists in the form of the segment registers of a CONTEXT structure:

0:000> dt ntdll!_CONTEXT
+0×000 ContextFlags : Uint4B
[...]

+0×08c SegGs : Uint4B
+0×090 SegFs : Uint4B
+0×094 SegEs : Uint4B
+0×098 SegDs : Uint4B

[...]

Now, it turns out that for all threads in a given process will almost always have the same segment selector values, excluding exotic and highly out of the ordinary cases like VDM processes. (The same goes for the segment selector values on x64 as well.) Four non-zero 32-bit values (actually, 16-bit values with zero padding to 32-bits) are enough to be able to reasonably pull a search off without being buried in false positives. Here’s how to do it with the infamous WinDbg debugger script (also applicable to other DbgEng-enabled programs, such as kd):

.foreach ( CxrPtr { s -[w1]d 0 l?ffffffff @gs @fs @es @ds } ) { .cxr CxrPtr - 8c }

This is a bit of a long-winded command, so let’s break it down into the individual components. First, we have a “.foreach” construct, which according to the debugger documentation, follows this convention:

.foreach [Options] ( Variable { InCommands } ) { OutCommands }

The .foreach command (actually one of the more versitle debugger-scripting commands, once one gets used to using it) basically takes a series of input strings generated by an input command (InCommands) and invokes an command to process that output (OutCommands), with the results of the input command being subsituted in as a macro specified by the Variable argument. It’s ugly and operates based on text parsing (and there’s support for skipping every X inputs, among other things; see the debugger documentation), but it gets the job done.

The next part of this operation is the s command, which instructs the debugger to search for a pattern in-memory in the target. The arguments supplied here instruct the debugger to search only writable memory (w), output only the address of each match (1), scan for DWORD (4-byte) sized quanitites (d) in the lower 4GB of the address space (0 l?ffffffff); in this case, we’re assuming that the target is a 32-bit process (which might be hosted on Wow64, hence 4GB instead of 3GB used). The remainder of the command specifies the search pattern to look for; the segment register values of the current thread. The “s” command sports a plethora of other options (with a rather unwieldy and convoluted syntax, unfortunately); the debugger documentation runs through the gamut of the other capabilities.

The final part of this command string is the output command, which simply directs the debugger to set the current context to the input command output replacement macro’s value at an offset of 0×8c. (If one recalls, 0×8c is the offset from the start of a struct _CONTEXT to the SegGs member, which is the first value we search for; as a result, the addresses returned by the “s” command will be the address of the SegGs member.) Remember that we restricted the output of the “s” command to just being the address itself, which lets us easily pass that on to a different command (which might give one the idea that the “s” and “.foreach” commands were made to work together).

Putting the command string together, it directs the debugger to search for a sequence of four 32-bit values (the gs, fs, es, and ds segment selector values for the current thread) in contiguous memory, and display the containing CONTEXT structure for each match.

You may find some other CONTEXT records aside from exception-related ones while executing this comamnd (in particular, initial thread contexts are common), but the ones related to a fault are usually pretty obvious and self-evident. Of course, this method isn’t foolproof, but it lets the debugger do some of the hard work for you (which beats manually groveling in corrupted stacks across multiple threads just to pull out some CONTEXT records).

Naturally, there are a number of other uses for both the “.foreach” and “s” commands; don’t be afraid to experiment with them. There are other helpers for automating certain tasks (!for_each_frame, !for_each_local, !for_each_module, !for_each_process, and !for_each_thread, to name a few) too, aside from the general “.foreach“. The debugger scripting support might not be the prettiest to look at, but it can be quite handy at speeding up common, repetitive tasks.

One parting tip with “.foreach” (well, actually two parting tips): The variable replacement macro only works if you separate it from other symbols with a space. This can be a problem in some cases (where you need to perform some arithmetic on the resulting expanded macro in particular, such as subtracting 0×8c in this case), however, as the spaces remain when the macro symbol is expanded. Some commands, such as “dt“, don’t follow the standard expression parsing rules (much to my eternal irritation), and choke if they’ve given arguments with spaces.

All is not lost for these commands, however; one way to work around this issue is to store the macro replacement into a pseudo-register (say, “r @$t0 = ReplacementVariableMacro - 0×8c“) and use that pseudo-register in the actual output command, as you can issue multiple, semi-colon delimited commands in the output commands section.

Examining kernel stacks on Vista/Srv08 using kdbgctrl -td, even when you haven’t booted /DEBUG

March 13th, 2009

Starting with Vista/Srv08, local kernel debugging support has been locked down to require that the system was booted /DEBUG before-hand.

For me, this has been the source of great annoyance, as if something strange happens to a Vista/Srv08 box that requires peering into kernel mode (usually to get a stack trace), then one tends to hit a brick wall if the box wasn’t booted with /DEBUG. (The only supported option usually available would be manually bugcheck the box and hope that the crash dump, if the system was configured to write one, has useful data. This is, of course, not a particularly great option; especially as in the default configuration for dumps on Vista/Srv08, it’s likely that user mode memory won’t be captured.)

However, it turns out that there’s limited support in the operating system to capture some kernel mode data for debugging purposes, even if you haven’t booted /DEBUG. Specifically, kdbgctrl.exe (a tool shipping with the WinDbg distribution) supports the notion of capturing a “kernel triage dump”, which basically takes a miniature snapshot of a given process and its active user mode threads. To use this feature, you need to have SeDebugPrivilege (i.e. you need to be an administrative-equivalently-privileged user).

Unlike conventional local KD support, triage dump writing doesn’t give unrestricted access to kernel mode memory on a live system. Instead, it instructs the kernel to create a small dump file that contains a limited, pre-set amount of information (mostly, just the process and associated threads, and if possible, the stacks of said threads). As a result, you can’t use it for general kernel memory examination. However, it’s sometimes enough to do the trick if you just need to capture what the kernel mode side stack trace of a particular thread is.

To use this feature, you can invoke “kdbgctrl.exe -td pid dump-filename“, which takes a snapshot of the process identified by a pid, and writes it out to a dump file named by dump-filename. This support is very tersely documented if you invoke kdbgctrl.exe on the command line with no arguments:

kdbgctrl
Usage: kdbgctrl <options>
Options:
[...]
-td <pid> <file> - Get a kernel triage dump

Now, the kernel’s support for writing out a triage dump isn’t by any means directly comparable to the power afforded by kd.exe -kl. As previously mentioned, the dump is extremely minimalistic, only containing information about a process object and its associated threads and the bare minimums allowing the debugger to function enough to pull this information from the dump. Just about all that you can reasonably expect to do with it is examine the stacks of the process that was captured, via the “!process -1 1f” command. (In addition, some extra information, such as the set of pending APCs attached to each thread, is also saved - enough to allow “!apc thread” to function.) Sometimes, however, it’s just enough to be able to figure out what something is blocking on kernel mode side (especially when troubleshooting deadlocks), and the triage dump functionality can aid with that.

However, there are some limitations with triage dumps, even above the relatively terse amount of data they convey. The triage dump support needs to be careful not to crash the system when writing out the dump, so all of this information is captured within the normal confines of safe kernel mode operation. In particular, this means that the triage dump writing logic doesn’t just blithely walk the thread or process lists like kd -kl would, or perform other potentially unsafe operations.

Instead, the triage dump creation process operates with the cooperation of all threads involved. This is immediately apparent when taking a look at the thread stacks captured in a triage dump:

BugCheck 69696969, {0, 0, 0, 0}

Probably caused by : ntkrnlmp.exe ( nt!IopKernSnapAPCMiniDump+55 )

Followup: MachineOwner
———

2: kd> k
Call Site
nt!IopKernSnapAPCMiniDump+0×55
nt!IopKernSnapSpecialApc+0×50
nt!KiDeliverApc+0×1e2
nt!KiSwapThread+0×491
nt!KeWaitForSingleObject+0×2da
nt!KiSuspendThread+0×29
nt!KiDeliverApc+0×420
nt!KiSwapThread+0×491
nt!KeWaitForMultipleObjects+0×2d6
nt!ObpWaitForMultipleObjects+0×26e
nt!NtWaitForMultipleObjects+0xe2
nt!KiSystemServiceCopyEnd+0×13
0×7741602a

(Note that the system doesn’t actually bugcheck as a result of creating a triage dump. Kernel dumps implicitly need a bug check parameters record, however, so one is synthesized for the dump file.)

As is immediately obvious from the call stack, thread stack collection in triage dumps operates (in Vista/Srv08) by tripping a kernel APC on the target thread, which then (as it’s in-thread-context) safely collects data about that thread’s stack.

This, of course, presents a limitation: If a thread is sufficiently wedged such that it can’t run kernel APCs, then a triage dump won’t be able to capture full information about that thread. However, for some classes of scenarios where simply capturing a kernel mode stack is sufficient, triage dumps can sometimes fill in the gap in conjuction with user mode debugging, even if the system wasn’t booted with /DEBUG.

Understanding the kernel address space on 32-bit Windows Vista

February 23rd, 2009

[Warning: The dynamic kernel address space is subject to future changes in future releases. In fact, the set of possible address space regions types on public 32-bit Win7 drops is different from the original release of the dynamic kernel address space logic in Windows Vista. You should not make hardcoded assumptions about this logic in production code, as the basic premises outlined in this post are subject to change without warning on future operating system revisions.]

(If you like, you may skip the history lesson and background information and skip to the gory details if you’re already familiar with the subject.)

With 32-bit Windows Vista, significant changes were made to the way the kernel mode address space was laid out. In previous operating system releases, the kernel address space was divvied up into large, relatively fixed-size regions ahead of time. For example, the paged- and non-paged pools resided within fixed virtual address ranges that were, for the most part, calculated at boot time. This meant that the memory manager needed to make a decision up front about how much address space to dedicate to the paged pool versus the non-paged pool, or how much address space to devote to the system cache, and soforth. (The address space is not strictly completely fixed on a 32-bit legacy system. There are several registry options that can be used to manually trade-off address space from one resource to another. However, these settings are only taken into account at memory manager initialization time during system boot.)

As a consequence of these mostly static (after boot time) calculations, the memory manager had limited flexibility to respond to differing memory usage workloads at runtime. In particular, because the memory manager needed to dedicate large address space regions up front to one of several different resource sets (i.e. paged pool, non-paged pool, system cache), situations may arise wherein one of these resources may be heavily utilized to the point of exhaustion of the address range carved out for it (such as the paged pool), but another “peer” resource (such as the non-paged pool) may be experiencing only comparatively light utilization. Because the memory manager has no way to “take back” address space from the non-paged pool, for example, and hand it off to the paged pool if it so turns out allocations from the paged pool (for example) may fail while there’s plenty of comparatively unused address space that has all been blocked off for the exclusive usage of the non-paged pool.

On 64-bit platforms, this is usually not a significant issue, as the amount of kernel address space available is orders of magnitude beyond the total amount of physical memory available in even the largest systems available (for today, anyway). However, consider that on 32-bit systems, the kernel only has 2GB (and sometimes only 1GB, if the system is booted with /3GB) of address space available to play around with. In this case, the fact that the memory manager needs to carve off hundreds of megabytes of address space (out of only 2 or even 1GB total) up-front for pools and the system cache becomes a much more eye-catching issue. In fact, the scenario I described previously is not even that difficult to achieve in practice, as the size of the non-paged or paged pools are typically quite small compared to the amount of physical memory available to even baseline consumer desktop systems nowadays. Throw in a driver or workload that heavily utilizes paged- or non-paged pool for some reason, and this sort of “premature” pool exhaustion may occur much sooner than one might naively expect.

Now, while it’s possible to manually address this issue to a degree on 32-bit system using the aforementioned registry knobs, determining the optimum values for these knobs on a particular workload is not necessarily an easy task. (Most sysadmins I know of tend to have their eyes glaze over when step one includes “Attach a kernel debugger to the computer.”)

To address this growing problem, the memory manager was restructured by the Windows Vista timeframe to no longer treat the kernel address space as a large (or, not so large, depending upon how one wishes to view it) slab to carve up into very large, fixed-size regions at boot time. Hence, the concept of the dynamic kernel address space was born. The basic idea is that instead of carving the relatively meagre kernel address space available on 32-bit systems up into large chunks at boot time, the memory manager “reserves its judgement” until there is need for additional address space for a partial resource (such as the paged- or non-paged pool), and only then hands out another chunk of address space to the component in question (such as the kernel pool allocator).

While this has been mentioned publicly for some time, to the best of my knowledge, nobody had really sat down and talked about how this feature really worked under the hood. This is relevant for a couple of reasons:

  1. The !address debugger extension doesn’t understand the dynamic kernel address space right now. This means that you can’t easily figure out where an address came from in the debugger on 32-bit Windows Vista systems (or later systems that might use a similar form of dynamic kernel address space).
  2. Understanding the basic concepts of how the feature works provides some insight into what it can (and cannot do) to help alleviate the kernel address space crunch.
  3. While the kernel address space layout has more or less been a relatively well understood concept for 32-bit systems to date, much of that knowledge doesn’t really directly translate to Windows Vista (and potentially later) systems.


Please note that future implementations may not necessarily function the same under the hood with respect to how the kernel address space operates.

Internally, the way the memory manager provides address space to resources such as the non-paged or paged- pools has been restructured such that each of the large address space resources act as clients to a new kernel virtual address space allocator. Wherein these address space resources previously received their address ranges in the form of large, pre-calculated regions at boot time, instead, each now calls upon the memory manager’s internal kernel address space allocator (MiObtainSystemVa) to request additional address space.

The kernel address space allocator can be thought of as maintaining a logical pool of unused virtual address space regions. It’s important to note that the address space allocator doesn’t actually allocate any backing store for the returned address space, nor any other management structures (such as PTEs); it simply reserves a chunk of address space exclusively for the use of the caller. This architecture is required due to the fact that everything from driver image mapping to the paged- and non-paged pool backends have been converted to use the kernel address space allocator routines. Each of these components has very different views of what they’ll actually want to use the address space for, but they all commonly need an address space to work in.

Similarly, if a component has obtained an address space region that it no longer requires, then it may return it to the memory manager with a call to the internal kernel address space allocator routine MiReturnSystemVa.

To place things into perspective, this system is conceptually analogous to reserving a large address space region using VirtualAlloc with MEM_RESERVE in user mode. A MEM_RESERVE reservation doesn’t commit any backing store that allows data to reside at a particular place in the address space, but it does grant exclusive usage of an address range of a particular size, which can then be used in conjunction with whatever backing store the caller requires. Likewise, it is similarly up to the caller to decide how they wish to back the address space returned by MiObtainSystemVa.

The address space chunks dealt with by the kernel address space allocator, similarly to the user mode address space reservation system, don’t necessarily need to be highly granular. Instead, a granularity that is convenient for the memory management is chosen. This is because the clients of the kernel address space allocator will then subdivide the address ranges they receive for their own usage. (The exact granularity of a kernel address space allocation is, of course, subject to change in future releases based upon the whims of what is preferable for the memory manager)

For example, if a driver requests an allocation from the non-paged pool, and there isn’t enough address space assigned to the non-paged pool to handle the request, then the non-paged pool allocator backend will call MiObtainSystemVa to retrieve additional address space. If successful, it will conceptually add this address space to its free address space list, and then return a subdivided chunk of this address space (mated with physical memory backing store, as in this example, we are speaking of the non-paged pool) to the driver. The next request for memory from the non-paged pool might then come from the same address range previously obtained by the non-paged pool allocator backend. This behavior is, again, conceptually similar to how the user mode heap obtains large amounts of address space from the memory manager and then subdivides these address regions into smaller allocations for a caller of, say, operator new.

All of this happens transparently to the existing public clients of the memory manager. For example, drivers don’t observe any particularly different behavior from ExAllocatePoolWithTag. However, because the address space isn’t pre-carved into large regions ahead of time, the memory manager no longer has its hands proverbially tied with respect to allowing one particular consumer of address space to obtain comparatively much more address space than usual (of course, at a cost to the available address space to other components). In effect, the memory manager is now much more able to self-tune for a wide variety of workloads without the user of the system needing to manually discover how much address space would be better allocated to the paged pool versus the system cache, for example.

In addition, the dynamic kernel address space infrastructure has had other benefits as well. As the address spans for the various major consumers of kernel address space are now demand-allocated, so too are PTE and other paging-related structures related to those address spans. This translates to reduced boot-time memory usage. Prior to the dynamic kernel address space’s introduction, the kernel would reduce the size of the various mostly-static address regions based on the amount of physical memory on the system for small systems. However, for large systems, and especially large 64-bit systems, paging-related structures potentially describing large address space regions were pre-allocated at boot time.

On small 64-bit systems, the demand-allocation of paging related structures also removes address space limits on the sizes of individual large address space regions that were previously present to avoid having to pre-allocate vast ranges of paging-related structures.

Presently, on 64-bit systems, many, though not all of the major kernel address space consumers still have their own dedicated address regions assigned internally by the kernel address space allocator, although this could easily change as need be in a future release. However, demand-creation of paging-describing structures is still realized (with the reduction in boot-time memory requirements as described above,

I mentioned earlier that the !address extension doesn’t understand the dynamic kernel address space as of the time of this writing. If you need to determine where a particular address came from while debugging on a 32-bit system that features a dynamic kernel address space, however, you can manually do this by looking into the memory manager’s internal tracking structures to discern for which reason a particular chunk of address space was checked out. (This isn’t quite as granular as !address as, for example, kernel stacks don’t have their own dedicated address range (in Windows Vista). However, it will at least allow you to quickly tell if a particular piece of address space is part of the paged- or non-paged pool, whether it’s part of a driver image section, and soforth.)

In the debugger, you can issue the following (admittedly long and ugly) command to ask the memory manager what type of address a particular virtual address region is. Replace <<ADDRESS>> with the kernel address that you wish to inquire about:

?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))[ @@masm( ( <<ADDRESS>> - poi(nt!MmSystemRangeStart)) / (@$pagesize *

@$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )

Here’s a couple of examples:


kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( 89445008 - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaNonPagedPool (5)

kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( ndis - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaDriverImages (12)

kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( win32k - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaSessionGlobalSpace (11)

The above technique will not work on 64-bit systems that utilize a dynamic (demand-allocated) kernel address space, as the address space tracking is performed differently internally.

(Many thanks to Andrew Rogers and Landy Wang, who were gracious enough to spend some time divulging insights on the subject.)

Recovering a process from a hung debugger

February 21st, 2009

One of the more annoying things that can happen while debugging processes that deal with network traffic is happening to attach to something that is in the “critical path” for accessing the debugger’s active symbol path.

In such a scenario, the debugger will usually deadlock itself trying to request symbols, which requires going through some code path that involves the debuggee, which being frozen by the debugger, never gets to run. Usually, one would think that this means a lost repro (and, depending on the criticality of the process attached to the debugger, possibly a forced reboot as well), neither of which happen to be particularly fun outcomes.

It turns out that you’re not actually necessarily hosed if this happens, though. If you can still start a new process on the computer, then there’s actually a way to steal the process back from the debugger (on Windows XP and later), with the new-fangled fancy kernel debug object-based debugging support. Here’s what you need to do:

  1. Attach a new debugger to the process, with symbol support disabled. (You will be attaching to the debuggee and not the debugger.)

    Normally, you can’t attach a debugger to process while it’s already being debugged. However, there’s an option in windbg (and ntsd/cdb as well) that allows you to do this: the “-pe” debugger command-line parameter (documented in the debugger documentation), which forcibly attaches to the target process despite the presence of the hung debugger.

    Of course, force-attaching to the process won’t do any good if the new debugger process will just deadlock right away. As a result, you should make sure that the debugger won’t try and do any symbol-loading activity that might engender a deadlock. This is the command line that I usually use for that purpose, which disables _NT_SYMBOL_PATH appending (”-sins“), disables CodeView pdb pointer following (”-sicv“), and resets the symbol path to a known good value (”-y .“):

    ntsd -sicv -sins -y . -pe -p hung_debuggee_pid

    I recommend using ntsd and not WinDbg for this purpose in order to reduce the chance of a symbol path that might be stored in a WinDbg workspace from causing the debugger to deadlock itself again.

  2. Kill the hung debugger.

    After successfully attaching with “-pe“, you can safely kill the hung debugger (by whatever means necessary) without causing the former debuggee to get terminated along with it.

  3. Resume all threads in the target.

    The suspend count of most threads in the target is likely to be wrong. You can correct this by issuing the “~*M” command set several times (which resumes all threads in the process with the “~M” command).

    To determine the suspend count of all processes in the thread, you can use the “~” command. For example, you might see the following:

    0:001> ~
       0  Id: 18c4.16f0 Suspend: 2 Teb: 7efdd000 Unfrozen
    .  1  Id: 18c4.1a04 Suspend: 2 Teb: 7efda000 Unfrozen
    

    You should issue the “~*M” command enough times to bring the suspend count of all threads down to zero. (Don’t worry if you need to resume a thread more times than it is suspended.) Typically, this would be two times, for the common case, but by checking the suspend count of active threads, you can be certain of the number of times that you need you need to resume all threads in the process.

  4. Detach the new debugger from the debuggee.

    After resuming all threads in the target, use the “qd” command to detach the debugger. Do not attempt to resume the debugger with the “g” command (as it will stay suspended), or quit the debugger without attaching (as that would cause the debuggee to get terminated).

    If you needed to keep a particular thread in the debuggee suspended so that you can re-attach the debugger without losing your place, you can leave that thread with its suspend count above zero.

Voila, the debuggee should return back to life. Now, you should be able to re-attach a debugger (hopefully, with a safe symbol path this time), or not, as desired.

Hotpatching MS08-067

October 24th, 2008

If you have been watching the Microsoft security bulletins lately, then you’ve likely noticed yesterday’s bulletin, MS08-067. This is a particularly nasty bug, as it doesn’t require authentication to exploit in the default configuration for Windows Server 2003 and earlier systems (assuming that an attacker can talk over port 139 or port 445 to your box).

The usual mitigation for this particular vulnerability is to block off TCP/139 (NetBIOS) and TCP/445 (Direct hosted SMB), thus cutting off remote access to the srvsvc pipe, a prerequisite for exploiting the vulnerability in question. In my case, however, I had a box that I really didn’t want to reboot immediately. In addition, for the box in question, I did not wish to leave SMB blocked off remotely.

Given that I didn’t want to assume that there’d be no possible way for an untrusted user to be able to establish a TCP/139 or TCP/445, this left me with limited options; either I could simply hope that there wasn’t a chance for the box to get compromised before I had a chance for it to be convenient to reboot, or I could see if I could come up with some form of alternative mitigation on my own. After all, a debugger is the software development equivalent of a swiss army knife and duct-tape; I figured that it would be worth a try seeing if I could cobble together some sort of mitigation by manually patching the vulnerable netapi32.dll. To do this, however, it would be necessary to gain enough information about the flaw in question in order to discern what the fix was, in the hope of creating some form of alternative countermeasure for the vulnerability.

The first stop for gaining more information about the bug in question would be the Microsoft advisory. As usual, however, the bulletin released for the MS08-067 issue was lacking in sufficiently detailed technical information as required to fully understand the flaw in question to the degree necessary down to the level of what functions were patched, aside from the fact that the vulnerability resided somewhere in netapi32.dll (the Microsoft rationale behind this policy is that providing that level of technical detail would simply aid the creation of exploits). However, as Pusscat presented at Blue Hat Fall ‘07, reverse engineering most present-day Microsoft security patches is not particularly insurmountable.

The usual approach to the patch reverse engineering process is to use a program called bindiff (an IDA plugin) that analyzes two binaries in order to discover the differences between the two. In my case, however, I didn’t have a copy of bindiff handy (it’s fairly pricey). Fortunately (or unfortunately, depending on your persuasion), there already existed a public exploit for this bug, as well as some limited public information from persons who had already reverse engineered the patch to a degree. To this end, I had a particular function in the affected module (netapi32!NetpwPathCanonicalize) which I knew was related to the vulnerability in some form.

At this point, I brought up a copy of the unpatched netapi32.dll in IDA, as well as a patched copy of netapi32.dll, then started looking through and comparing disassembly one page at a time until an interesting difference popped up in a subfunction of netapi32!NetpwPathCanonicalize:

Unpatched code:

.text:000007FF7737AF90 movzx   eax, word ptr [rcx]
.text:000007FF7737AF93 xor     r10d, r10d
.text:000007FF7737AF96 xor     r9d, r9d
.text:000007FF7737AF99 cmp     ax, 5Ch
.text:000007FF7737AF9D mov     r8, rcx
.text:000007FF7737AFA0 jz      loc_7FF7737515E

Patched code:

.text:000007FF7737AFA0 mov     r8, rcx

.text:000007FF7737AFA3 xor     eax, eax

.text:000007FF7737AFA5 mov     [rsp+arg_10], rbx
.text:000007FF7737AFAA mov     [rsp+arg_18], rdi
.text:000007FF7737AFAF jmp     loc_7FF7738E5D6

.text:000007FF7738E5D6 mov     rcx, 0FFFFFFFFFFFFFFFFh
.text:000007FF7738E5E0 mov     rdi, r8
.text:000007FF7738E5E3 repne scasw

.text:000007FF7738E5E6 movzx   eax, word ptr [r8]
.text:000007FF7738E5EA xor     r11d, r11d

.text:000007FF7738E5ED not     rcx

.text:000007FF7738E5F0 xor     r10d, r10d

.text:000007FF7738E5F3 dec     rcx

.text:000007FF7738E5F6 cmp     ax, 5Ch

.text:000007FF7738E5FA lea     rbx, [r8+rcx*2+2]

.text:000007FF7738E5FF jnz     loc_7FF7737AFB4

Now, without even really understanding what’s going on here on the function as a whole, it’s pretty obvious that here’s where (at least one) modification is being made; the new code involved the addition of an inline wcslen call. Typically, security fixes for buffer overrun conditions involve the creation of previously missing boundary checks, so a new call to a string-length function such as wcslen is a fairly reliable indicator that one’s found the site of the fix for the the vulnerability in question.

(The repne scasw instruction set scans a memory region two-bytes at a time until a particular value (in rax) is reached, or the maximum count (in rcx, typically initialized to (size_t)-1) is reached. Since we’re scanning two bytes at a time, and we’ve initialized rax to zero, we’re looking for an 0×0000 value in a string of two-byte quantities; in other words, an array of WCHARs (or a null terminated Unicode string). The resultant value on rcx after executing the repne scasw can be used to derive the length of the string, as it will have been decremented based on the number of WCHARs encountered before the 0×0000 WCHAR.)

My initial plan was, assuming that the fix was trivial, to simply perform a small opcode patch on the unpatched version of netapi32.dll in the Server service process on the box in question. In this particular instance, however, there were a number of other changes throughout the patched function that made use of the additional length check. As a result, a small opcode patch wasn’t ideal, as large parts of the function would need to be rewritten to take advantage of the extra length check.

Thus, plan B evolved, wherein the Microsoft-supplied patched version of netapi32.dll would be injected into an already-running Server service process. From there, the plan was to detour buggy_netapi32!NetpwPathCanonicalize to fixed_netapi32!NetpwPathCanonicalize.

As it turns out, netapi32!NetpwPathCanonicalize and all of its subfunctions are stateless with respect to global netapi32 variables (aside from the /GS cookie), which made this approach feasible. If the call tree involved a dependancy on netapi32 global state, then simply detouring the entire call tree wouldn’t have been a valid option, as the globals in the fixed netapi32.dll would be distinct from the globals in the buggy netapi32.dll.

This approach also makes the assumption that the only fixes made for the patch were in netapi32!NetpwPathCanonicalize and its call tree; as far as I know, this is the case, but this method is (of course) completely unsupported by Microsoft. Furthermore, as x64 binaries are built without hotpatch nop stubs at their prologue, the possibility for atomic patching in of a detour appeared to be out, so this approach has a chance of failing in the (unlikely) scenario where the first few instructions of netapi32!NetpwPathCanonicalize were being executed at the time of the detour.

Nonetheless, the worst case scenario would be that the box went down, in which case I’d be rebooting now instead of later. As the whole point of this exercise was to try and delay rebooting the system in question, I decided that this was an acceptable risk in my scenario, and elected to proceed. For the first step, I needed a program to inject a DLL into the target process (SDbgExt does not support !loaddll on 64-bit targets, sadly). The program that I came up with is certainly quick’n'dirty, as it fudges the thread start routine in terms of using kernel32!LoadLibraryA as the initial start address (which is a close enough analogue to LPTHREAD_START_ROUTINE to work), but it does the trick in this particular case.

The next step was to actually load the app into the svchost instance containing the Server service instance. To determine which svchost process this happens to be, one can use “tasklist /svc” from a cmd.exe console, which provides a nice formatted view of which services are hosted in which processes:

C:\WINDOWS\ms08-067-hotpatch>tasklist /svc
[...]
svchost.exe 840 AeLookupSvc, AudioSrv, BITS, Browser,
CryptSvc, dmserver, EventSystem, helpsvc,
HidServ, IAS, lanmanserver,
[...]

That being done, the next step was to inject the DLL into the process. Unfortunately, the default security descriptor on svchost.exe instances doesn’t allow Administrators the access required to inject a thread. One way to solve this problem would have been to write the code to enable the debug privilege in the injector app, but I elected to simply use the age-old trick of using the scheduler service (at.exe) to launch the program in question as LocalSystem (this, naturally, requires that you already be an administrator in order to succeed):

C:\WINDOWS\ms08-067-hotpatch>at 21:32 C:\windows\ms08-067-hotpatch\testapp.exe 840 C:\windows\ms08-067-hotpatch\netapi32.dll
Added a new job with job ID = 1

(21:32 was one minute from the time when I entered that command, minutes being the minimum granularity for the scheduler service.)

Roughly one minute later, the debugger (attached to the appropriate svchost instance) confirmed that the patched DLL was loaded successfully:

ModLoad: 00000000`04ff0000 00000000`05089000
C:\windows\ms08-067-hotpatch\netapi32.dll

Stepping back for a moment, attaching WinDbg to an svchost instance containing services in the symbol server lookup code path is risky business, as you can easily deadlock the debugger. Proceed with care!

Now that the patched netapi32.dll was loaded, it was time to detour the old netapi32.dll to refer to the new netapi32.dll. Unfortunately, WinDbg doesn’t support assembling amd64 instructions very well (64-bit addresses and references to the extended registers don’t work properly), so I had to use a separate assembler (HIEW, [Hacker's vIEW]) and manually patch in the opcode bytes for the detour sequence (mov rax, <absolute addresss> ; jmp rax):

0:074> eb NETAPI32!NetpwPathCanonicalize 48 C7 C0 40 AD FF 04 FF E0
0:074> u NETAPI32!NetpwPathCanonicalize
NETAPI32!NetpwPathCanonicalize:
000007ff`7737ad30 48c7c040adff04
mov rax,offset netapi32_4ff0000!NetpwPathCanonicalize
000007ff`7737ad37 ffe0            jmp     rax
0:076> bp NETAPI32!NetpwPathCanonicalize
0:076> g

This said and done, all that remained was to set a breakpoint on netapi32!NetpwPathCanonicalize and give the proof of concept exploit a try against my hotpatched system (it survived). Mission accomplished!

The obvious disclaimer: This whole procedure is a complete hack, and not recommended for production use, for reasons that should be relatively obvious. Additionally, MS08-067 did not come built as officially “hotpatch-enabled” (i.e. using the Microsoft supported hotpatch mechanism); “hotpatch-enabled” patches do not entail such a sequence of hacks in the deployment process.

(Thanks to hdm for double-checking some assumptions for me.)

Why hooking system services is more difficult (and dangerous) than it looks

July 26th, 2008

System service hooking (and kernel mode hooking in general) is one of those near-and-dear things to me which I’ve got mixed feelings about. (System service hooking refers to intercepting system calls, like NtCreateFile, in kernel mode using a custom driver.)

On one hand, hooking things like system services can be extremely tricky at best, and outright dangerous at worst. There are a number of ways to write code that is almost right, but will fail in rare edge cases. Furthermore, there are even more ways to write code that will behave correctly as long as system service callers “play nicely” and do the right thing, but sneaks in security holes that only become visible once somebody in user mode tries to “bend the rules” a bit.

On the other hand, though, there are certainly some interesting things that one can do in kernel mode, for which there really aren’t any supported or well-defined extensibility interfaces without “getting one’s hands dirty” with a little hooking, here or there.

Microsoft’s policy on hooking in kernel mode (whether patching system services or otherwise) is pretty clear: They would very much rather that nobody even thought about doing anything remotely like code patching or hooking. Having seen virtually every anti-virus product under the sun employ hooking in one way or another at some point during their lifetime (and subsequently fail to do it correctly, usually with catastrophic consequences), I can certainly understand where they are coming from. Then again, I have also been on the other side of the fence once or twice, having been blocked from implementing a particular useful new capability due to a certain anti-patching system on recent Windows versions that shall remain nameless, at least for this posting.

Additionally, in further defense of Microsoft’s position, the vast majority of code I have seen in the field that has performed some sort of kernel mode hooking has done some out of a lack of understanding of the available and supported kernel mode extensibility interfaces. There are, furthermore, a number of things where hooking system services won’t even get the desired result (even if it weren’t otherwise fraught with problems); for instance, attempting to catch file I/O by hooking NtReadFile/NtReadFileScatter/NtWriteFile/NtWriteFileGather will cause one to completely miss any I/O that is performed using a mapped section view.

Nonetheless, system service hooking seems to be an ever-increasing trend, one that doesn’t seem to be likely to go away any time soon (at least for 32-bit Windows). There are also, of course, bits of malicious code out there which attempt to do system service hooking as well, though in my experience, once again, the worst (and most common) offenders have been off-the-shelf software that was likely written by someone who didn’t really understand the consequences of what they were doing.

Rather than weigh in on the side of “Microsoft should allow all kernel mode hooking”, or “All code hooking is bad”, however, I thought that a different viewpoint might be worth considering on this subject. At the risk of incurring Don Burn’s wrath by even mentioning the subject (an occurrence that is quite common on the NTDEV list (if one happens to follow that), I figured that it might be instructive to provide an example of just how easy it is to slip up in a seemingly innocent way while writing a system service hook. Of course, “innocent slip-ups” in kernel mode often equate to bugchecks at their best, or security holes at their worst.

Thus, the following, an example of how a perfectly honest attempt at hooking a system service can go horribly wrong. This is not intended to be a tutorial in how to hook system services, but rather an illustration of how one can easily create all sorts of nasty bugs even while trying to be careful.

The following code assumes that the system service in question (NtCreateSection) has had a hook “safely” installed (BuggyDriverHookNtCreateSection). This posting doesn’t even touch on the many and varied problems of safely trying to hook (or even worse, unhook) a system service, which are also typically done wrong in my experience. Even discounting the diciness of those problems, there’s plenty that can go wrong here.

I’ll post a discussion of some of the worst bugs in this code later on, after people have had a chance to mull it over for a little while. This routine is a fairly standard attempt to post-process a system service by doing some additional work after it completes. Feel free to post a comment if you think you have found one of the problems (there are several). Bonus points if you can identify why something is broken and not just that it is broken (e.g. provide a scenario where the existing code breaks). Even more bonus points if you can explain how to fix the problem you have found without introducing yet another problem or otherwise making things worse - by asking that, I am more wanting to show that there are a great deal of subleties at play here, instead of simply showing how to operate an NtCreateSection hook correctly. Oh, and there’s certainly more than one bug here to be found, as well.

N.B. I haven’t run any of this through a compiler, so excuse any syntax errors that I missed.

Without further adeu, here’s the code:

//
// Note: This code has bugs. Please don’t actually try this at home!
//

NTSTATUS
NTAPI
BuggyDriverNtCreateSection(
 OUT PHANDLE SectionHandle,
 IN ACCESS_MASK DesiredAccess,
 IN POBJECT_ATTRIBUTES ObjectAttributes,
 IN PLARGE_INTEGER SectionSize OPTIONAL,
 IN ULONG Protect,
 IN ULONG Attributes,
 IN HANDLE FileHandle
 )
{
 NTSTATUS Status;
 HANDLE Section;
 SECTION_IMAGE_INFORMATION ImageInfo;
 ULONG ReturnLength;
 PVOID SectionObject;
 PVOID FileObject;
 BOOLEAN SavedSectionObject;
 HANDLE SectionKernelHandle;
 LARGE_INTEGER LocalSectionSize;

 SavedSectionObject = FALSE;

 //
 // Let’s call the original NtCreateSection, as we only care about successful
 // calls with valid parameters.
 //

 Status = RealNtCreateSection(
  SectionHandle,
  DesiredAccess,
  ObjectAttributes,
  SectionSize,
  Protect,
  Attributes,
  FileHandle
  );

 //
 // Failed? We’ll bail out now.
 //

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // Okay, we’ve got a successful call made, let’s do our work.
 // First, capture the returned section handle. Note that we do not need to
 // do a probe as that was already done by NtCreateSection, but we still do
 // need to use SEH.
 //

 __try
 {
  Section = *SectionHandle;
 }
 __except( EXCEPTION_EXECUTE_HANDLER )
 {
  Status = (NTSTATUS)GetExceptionCode();
 }

 //
 // The user unmapped our buffer, let’s bail out.
 //

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // We need a pointer to the section object for our work. Let’s grab it now.
 //

 Status = ObReferenceObjectByHandle(
  Section,
  0,
  NULL,
  KernelMode,
  &SectionObject,
  NULL
  );

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // Just for fun, let’s check if the section was an image section, and if so,
 // we’ll do special work there.
 //

 Status = ZwQuerySection(
  Section,
  SectionImageInformation,
  &ImageInfo,
  sizeof( SECTION_IMAGE_INFORMATION ),
  &ReturnLength
  );

 //
 // If we are an image section, then let’s save away a pointer to to the
 // section object for our own use later.
 //

 
 if (NT_SUCCESS(Status))
 {
  //
  // Save pointer away for something that we might do with it later. We might
  // want to care about the section image information for some unspecified
  // reason, so we will copy that and save it in our tracking list. For
  // example, maybe we want to map a view of the section into the initial
  // system process from a worker thread.
  //

  Status = SaveImageSectionObjectInList(
   SectionObject,
   &ImageInfo
   );

  if (!NT_SUCCESS( Status ))
  {
   ObDereferenceObject( SectionObject );
   return Status;
  }

  SavedSectionObject = TRUE;
 }

 //
 // Let’s also grab a kernel handle for the file object so that we can do some
 // sort of work with it later on.
 //

 Status = ObReferenceObjectByHandle(
  Section,
  0,
  NULL,
  KernelMode,
  &FileObject,
  NULL
  );

 if (!NT_SUCCESS( Status ))
 {
  if (SavedSectionObject)
   DeleteImageSectionObjectInList( SectionObject );

  ObDereferenceObject( SectionObject );

  return Status;
 }

 //
 // Save the file object away, as well as maximum size of the section object.
 // We need the size of the section object for a length check when accessing
 // the section later.
 //

 if (SectionSize)
 {
  __try
  {
   LocalSectionSize = *SectionSize;
  }
  __except( EXCEPTION_EXECUTE_HANDLER )
  {
   Status = (NTSTATUS)GetExceptionCode();

   ObDereferenceObject( FileObject );

   if (SavedSectionObject)
    DeleteImageSectionObjectInList( SectionObject );

   ObDereferenceObject( SectionObject );

   return Status;
  }
 }
 else
 {
  //
  // Ask the file object for it’s length, this could be done by any of the
  // usual means to do that.
  //

  Status = QueryAllocationLengthFromFileObject(
   FileObject,
   &LocalSectionSize
   );

  if (!NT_SUCCESS( Status ))
  {
   ObDereferenceObject( FileObject );

   if (SavedSectionObject)
    DeleteImageSectionObjectInList( SectionObject );

   ObDereferenceObject( SectionObject );

   return Status;
  }
 }

 //
 // Save the file object + section object + section length away for future
 // reference.
 //

 Status = SaveSectionFileInfoInList(
  FileObject,
  SectionObject,
  &LocalSectionSize
  );

 if (!NT_SUCCESS( Status ))
 {
  ObDereferenceObject( FileObject );

  if (SavedSectionObject)
   DeleteImageSectionObjectInList( SectionObject );

  ObDereferenceObject( SectionObject );

  return Status;
 }

 //
 // All done. Lose our references now. Assume that the Save*InList routines
 // took their own references on the objects in question. Return to the caller
 // successfully.
 //

 ObDereferenceObject( FileObject );
 ObDereferenceObject( SectionObject );

 return STATUS_SUCCESS;
}

Why does every heap trace in UMDH get stuck at “malloc”?

February 21st, 2008

One of the more useful tools for tracking down memory leaks in Windows is a utility called UMDH that ships with the WinDbg distribution. Although I’ve previously covered what UMDH does at a high level, and how it functions, the basic principle for it, in a nutshell, is that it uses special instrumentation in the heap manager that is designed to log stack traces when heap operations occur.

UMDH utilizes the heap manager’s stack trace instrumentation to associate call stacks with outstanding allocations. More specifically, UMDH is capable of taking a “snapshot” of the current state of all heaps in a process, associating like-sized allocations from like-sized callstacks, and aggregrating them in a useful form.

The general principle of operation is that UMDH is typically run two (or more times), once to capture a “baseline” snapshot of the process after it has finished initializing (as there are expected to always be a number of outstanding allocations while the process is running that would not be normally expected to be freed until process exit time, for example, any allocations used to build the command line parameter arrays provided to the main function of a C program, or any other application-derived allocations that would be expected to remain checked out for the lifetime of the program.

This first “baseline” snapshot is essentially intended to be a means to filter out all of these expected, long-running allocations that would otherwise show up as useless noise if one were to simply take a single snapshot of the heap after the process had leaked memory.

The second (and potentially subsequent) snapshots are intended to be taken after the process has leaked a noticeable amount of memory. UMDH is then run again in a special mode that is designed to essentially do a logical “diff” between the “baseline” snapshot and the “leaked” snapshot, filtering out any allocations that were present in both of them and returning a list of new, outstanding allocations, which would generally include any leaked heap blocks (although there may well be legitimate outstanding allocations as well, which is why it is important to ensure that the “leaked” snapshot is taken only after a non-trivial amount of memory has been leaked, if at all possible).

Now, this is all well and good, and while UMDH proves to be a very effective tool for tracking down memory leaks with this strategy, taking a “before” and “after” diff of a problem and analyzing the two to determine what’s gone wrong is hardly a new, ground-breaking concept.

While the theory behind UMDH is sound, however, there are some situations where it can work less than optimally. The most common failure case of UMDH in my experience is not actually so much related to UMDH itself, but rather the heap manager instrumentation code that is responsible for logging stack traces in the first place.

As I had previously discussed, the heap manager stack trace instrumentation logic does not have access to symbols, and on x86, “perfect” stack traces are not generally possible, as there is no metadata attached with a particular function (outside of debug symbols) that describes how to unwind past it.

The typical approach taken on x86 is to assume that all functions in the call stack do not use frame pointer omission (FPO) optimizations that allow the compiler to eliminate the usage of ebp for a function entirely, or even repurpose it for a scratch register.

Now, most of the libraries that ship with the operating system in recent OS releases have FPO explicitly turned off for x86 builds, with the sole intent of allowing the built-in stack trace instrumentation logic to be able to traverse through system-supplied library functions up through to application code (after all, if every heap stack trace dead-ended at kernel32!HeapAlloc, the whole concept of heap allocation traces would be fairly useless).

Unfortunately, there happens to be a notable exception to this rule, one that actually came around to bite me at work recently. I was attempting to track down a suspected leak with UMDH in one of our programs, and noticed that all of the allocations were grouped into a single stack trace that dead-ended in a rather spectacularly unhelpful way. Digging in a bit deeper, in the individual snapshot dumps from UMDH contained scores of allocations with the following backtrace logged:

00000488 bytes in 0x1 allocations
   (@ 0x00000428 + 0x00000018) by: BackTrace01786

        7C96D6DC : ntdll!RtlDebugAllocateHeap+000000E1
        7C949D18 : ntdll!RtlAllocateHeapSlowly+00000044
        7C91B298 : ntdll!RtlAllocateHeap+00000E64
        211A179A : program!malloc+0000007A

This particular outcome happened to be rather unfortunate, as in the specific case of the program I was debugging at work, virtually all memory allocations in the program (including the ones I suspected of leaking) happened to ultimately get funneled through malloc.

Obviously, getting told that “yes, every leaked memory allocation goes through malloc” isn’t really all that helpful if (most) every allocation in the program in question happened to go through malloc. The UMDH output begged the question, however, as to why exactly malloc was breaking the stack traces. Digging in a bit deeper, I discovered the following gem while disassembling the implementation of malloc:

0:011> u program!malloc
program!malloc
[f:\sp\vctools\crt_bld\self_x86\crt\src\malloc.c @ 155]:
211a1720 55              push    ebp
211a1721 8b6c2408        mov     ebp,dword ptr [esp+8]
211a1725 83fde0          cmp     ebp,0FFFFFFE0h
[...]

In particular, it would appear that the default malloc implementation on the static link CRT on Visual C++ 2005 not only doesn’t use a frame pointer, but it trashes ebp as a scratch register (here, using it as an alias register for the first parameter, the count in bytes of memory to allocate). Disassembling the DLL version of the CRT revealed the same problem; ebp was reused as a scratch register.

What does this all mean? Well, anything using malloc that’s built with Visual C++ 2005 won’t be diagnosable with UMDH or anything else that relies on ebp-based stack traces, at least not on x86 builds. Given that many things internally go through malloc, including operator new (at least in the default implementation), this means that in the default configuration, things get a whole lot harder to debug than they should be.

One workaround here would be to build your own copy of the CRT with /Oy- (force frame pointer usage), but I don’t really consider building the CRT a very viable option, as that’s a whole lot of manual work to do and get up and running correctly on every developer’s machine, not to mention all the headaches that service releases that will require rebuilds will bring with such an approach.

For operator new, it’s fortunately relatively doable to overload it in a relatively supported way to be implemented against a different allocation strategy. In the case of malloc, however, things don’t really have such a happy ending; one is either forced to re-alias the name using preprocessor macro hackery to a custom implementation that does not suffer from a lack of frame pointer usage, or otherwise change all references to malloc/free to refer to a custom allocator function (perhaps implemented against the process heap directly instead of the CRT heap a-la malloc).

So, the next time you use UMDH and get stuck scratching your head while trying to figure out why your stack traces are all dead-ending somewhere less than useful, keep in mind that the CRT itself may be to blame, especially if you’re relying on CRT allocators. Hopefully, in a future release of Visual Studio, the folks responsible for turning off FPO in the standard OS libraries can get in touch with the persons responsible for CRT builds and arrange for the same to be done, if not for the entire CRT, then at least for all the code paths in the standard heap routines. Until then, however, these CRT allocator routines remain roadblocks for effective leak diagnosis, at least when using the better tools available for the job (UMDH).

Power management observations with my XV6800

December 11th, 2007

As I previously mentioned, I recently switched to a full-blown Windows Mobile-based phone. One of the interesting conundrums that I’ve ran into along the way is feeling out what’s actually killer on the battery life for the device and what isn’t.

It’s actually kind of surprising how seemingly innocent little things can have a dramatic impact on battery life in a situation like an always-on handheld phone. For example, one of the main power saving features of modern CDMA cell networks and packet switched data connections is the ability for the device to go into dormant mode, a state in which the device relinquishes it’s active radio link resources and essentially goes into a standby mode that is in some respects akin to how the device operates when it is otherwise idle and waiting for a call. This is beneficial for the device’s battery life (because while in dormant mode, much of the logic related to maintaining the active high-rate radio link can be powered off while the network link is idle). Dormant mode is also beneficial for network operators, in that it allows the resources previously dedicated to a particular device to be used by other users that are actively sending or receiving data.

When data is ready to be sent on either end of the link, the device will “wake up” and exit dormant mode in order to send/receive data, keeping the link fully active until the session has been idle for enough time for the dormant idle timer to expire. (Entering and exiting dormant mode is mostly seamless, but may involve delays of 1-2 seconds, at least in my experience, so doing so on every packet would be highly undesirable.)

What all of this means is that there’s some significant gains to be had in terms of radio power consumption if network traffic can be limited to where it is really necessary. There are a couple of things that I had initially running on my device that didn’t really fit this criteria. For example, I have an always-active SSH connection to a screen session that, among other things, runs my console SILC client, which is based off of irssi (an IRC client). One gotcha that I encountered there is that in the default configuration, this client has a clock on the ncurses console UI. The clock is updated every minute with the current time, which is (normally) handy as you can at-a-glance compare the current timestamp with message timestamps in the backlog.

However, this becomes problematic in my case, because the clock forces the device out of dormant mode every minute in order to repaint that portion of the console UI. (Aside from the clock, the remainder of the SILC/irssi UI is static until there is actual chat activity, which would otherwise make it relatively well suited for this situation.) Fortunately, it turned out to be relatively easy to remove the clock from the client’s UI, which prevents the SSH session from causing network activity every minute while I’ve got the SILC client up.

Another network power management “gotcha” that I ran into was the built-in L2TP/IPsec VPN client. It turns out that while an L2TP/IPsec link is active, the device was, in my case, unable to enter dormant mode for any non-trivial length of time. I suspect that this is caused by the default behavior of L2TP to send keep-alive messages every minute to ensure that the session is still alive (at the very least, packet captures VPN-server-side did not seem to indicate any traffic destined to the device’s projected IP Address).

In any case, this unpleasant side effect of the L2TP/IPsec client ruled out using it for full network connectivity in an always-on fashion. Unfortunately, dialing the VPN link on-demand has it’s own share of pitfalls, as the device seems to make the VPN link the default gateway. When I configured IMAPv4 mail to be fetched every so often via demand-dialing the VPN link (which would alleviate the difficulty with dormant mode not operating as desired with the VPN link always on), my SSH session would often die if either side happened to try and send data while Pocket Outlook was polling IMAP mail.

My solution in this case was to fall back to using SSH port-forwards through PocketPuTTY for IMAP mail (after I fixed SSH port forwards to work reliably, that is, of course). Unlike the L2TP/IPsec link, the SSH link doesn’t inherently force the device out of dormant mode without there being real activity, which allows me to continue to leave the SSH connection always on without sacrificing my battery life.

These two changes seem to have fixed the worst offenders in terms of battery drain, but there’s still plenty of room for improvement. For example, PocketPuTTY will tell the remote end of the session that the window has resized when you switch the device from portrait to landscape mode (or vice versa), such as when pulling out the device’s built in keyboard. While this is a very handy feature if you’re actually using the SSH session (as it allows the remote console to resize itself appropriately, which is how SILC/irssi magically reshapes itself to fit the screen when the device keyboard is engaged), it does present the annoyance that pulling out the keyboard while PocketPuTTY is not active will still cause the packet radio link to be fully established to send the terminal resize notification and receive the resulting redraw data.

This behavior could, for instance, be further optimized to defer sending such notifications until the PocketPuTTY window becomes the foreground window (potentially aborting the resize notification entirely if the screen size is switched back to how it previously was before PocketPuTTY is activated).

One of the nice things about having a (relatively) easily end-user-programmable device instead of a locked-down, DRM’ified BREW-based handset is that there’s actually opportunity to change these sorts of things instead of just blithely accepting what’s offered with the device with no real hope of improving it.

How to not write code for a mobile device

December 10th, 2007

Earlier this week, I got a shiny, brand-new XV6800 to replace my aging (and rather limited in terms of running user-created programs) BREW-ified phone.

After setting up ActiveSync, IPsec, and all of the other usual required configuration settings, the next step was, of course, to install the baseline minimum set of applications on the device to get by. Close to the top of this list is an SSH client (being able to project arbitrary console programs over SSH and screen to the device is, at the very least, a workable enough solution in the interm for programs that are not feasible to run directly on the device, such as my SILC client). I’ve found PuTTY a fairly acceptable client for “big Windows”, and there just so happened to be a Windows Mobile port of it. Great, or so I thought…

While PocketPuTTY does technically work, I noticed a couple of deficiencies with it pretty quickly:

  1. The page up / page down keys are treated the same as the up arrow / down arrow keys when sent to the remote system. This sucks, because I can’t access my scrollback buffer in SILC easily without page up / page down. Other applications can tell the difference between the two keys (obviously, or there wouldn’t be much of a point to having the keys at all), so this one is clearly a PocketPuTTY bug.
  2. There is no way to restart a session once it is closed, or at least, not that I’ve found. Normally, on “big Windows”, there’s a “Restart Session” menu option in the window menu of the PuTTy session, but (as far as I can tell) there’s no such equivalent to the window menu on PocketPuTTY. There is a “Tools” menu, although it has some rather useless menu items (e.g. About) instead of some actually useful menu items, like “Restart Session”.
  3. Running PocketPuTTY seems to have a significantly negative effect on battery life. This is really unfortunate for me, since the expected use is to leave an SSH session to a terminal running screen for a long period of time. (Note that this was resolved, partially, by locating a slightly more maintained copy of PocketPuTTY.)
  4. SSH port forward support seems to be fairly broken, in that as soon as a socket is cleaned up, all receives in the process break until you restart it. This is annoying, but workable if one can go without SSH port forwards.

Most of these problems are actually not all that difficult to fix, and since the source code is available, I’m actually planning on trying my hand at doing just that, since I expect that this is an app that I’ll make fairly heavy use of.

The latter problem is one I really want to call attention to among these deficiencies, however. My intent here is not to bash the PocketPuTTY folks (and I’m certainly happy that they’ve at least gotten a (semi)-working port of PuTTY with source code out, so that other people can improve on it from there), but rather to point out some things that should really just not be done when you’re writing code that is intended to run on a mobile device (especially if the code is intended to run exclusively on a mobile device).

On a portable device, one of the things that most users really expect is long battery life. Though this particular point certainly holds true for laptops as well, it is arguably even more important that for converged mobile phone devices. After all, most people consider their phone an “always on” contact mechanism, and unexpectedly running out of battery life is extremely annoying in this aspect. Furthermore, if your mobile phone has the capability to run all sorts of useful programs on it, but doing so eats up your battery in a couple of hours, then there is really not that much point in having that capability at all, is there?

Returning to PocketPuTTY, one of the main problems (at least with the version I initially used) was, again, that PocketPuTTY would reduce battery life significantly. Looking around in the source code for possible causes, I noticed the following gem, which turned out to be the core of the network read loop for the program.

Yes, there really is a Sleep(1) spin loop in there, in a software port that is designed to run on battery powered devices. For starters, that blows any sort of processor power management completely out of the water. Mobile devices have a lot of different components vying for power, and the easiest (and most effective) way to save on power (and thus battery life) is to not turn those components on. Of course, it becomes difficult to do that for power hungry components like the 400MHz CPU in my XV6800 if there’s a program that has an always-ready-to-run thread…

Fortunately, there happened to be a newer revision of PocketPuTTY floating about with the issue fixed (although getting ahold of the source code for that version proved to be slightly more difficult, as the original maintainer of the project seems to have disappeared). I did eventually manage to get into contact with someone who had been maintaining their own set of improvements and grab a not-so-crusty-old source tree from them to do my own improvements, primarily for the purposes of fixing some of the annoyances I mentioned previously (thus beginning my initial forey into Windows CE development).

Mini USB charging for devices is a great idea

December 6th, 2007

One recent innovation that I recently stumbled across is the use of mini USB ports for power. I actually originally encountered a device doing this with the Morotola HT820 Bluetooth headphones I got some months ago, although at the time, I didn’t realize that the power brick that came with the device was mini USB and not yet another proprietary power cable connection.

For those unaware, mini USB is a small form factor of USB plug (otherwise essentially identical to standard USB). As with regular USB, mini USB devices can be off the bus. Now, standardizing on mini USB for power connectors has some really nice advantages:

  1. Everyone’s power bricks are now compatible. Normally, each gadget you own is going to have its own (incompatible) power transformer brick, which means that if you’re travelling, you’ll quite likely have to lug around several such power bricks (in addition to your laptop AC adapter), even if every device doesn’t really need to be plugged into “wall power” at the same time.
  2. You can charge these devices off of a laptop USB port. Of course, you’ll need a USB to mini USB cable, assuming your laptop doesn’t have any mini USB ports (I haven’t seen any that do). The really cool thing about being able to do this is that for all your mini USB, battery powered devices, you only need to bring one cable for all of them while on the move, since you can just plug the device into your laptop and charge it off of that.

Furthermore, because the connection is still USB, devices can use the same port to charge and transfer data to a computer at the same time.

Now, I didn’t really put all the pieces together about my HT820 being mini USB until I had to go and buy a replacement set of headphones (a different model this time, as Best Buy didn’t carry the HT820 anymore). The volume button on my set had died, which was rather inconvenient, to say the least.

Anyways, I noticed that the new headphones, from a different manufacturer, seemed to have a compatible power cable port from the AC adapter brick to the device’s charging port. This time, the power cable even had the ubiquitous USB logo on it (the Motorola’s charging cable didn’t, for one reason or another). Sure enough, I could use the Motorola charger with the new set of headphones, even though they weren’t of a Motorola make. Cool.

Recently, I finally ditched my old cell phone for a proper smart phone (an XV6800). This, too, is mini USB powered (and it can use the mini USB connection for data as well, at the same time). There’s a lot of other neat things about the XV6800, but the fact that I don’t need to (ever) worry about spare chargers or anything of that sort is really just the icing on the cake.

Thanks to this little advancement, I can now ditch two additional power bricks (one for the cell phone and one for the Bluetooth headset) when traveling, and just charge both of devices off of my laptop. I’m actually kind of surprised that nobody implemented this earlier, given the immediate advantages of standardizing on a uniform power source (especially one that is as easily multiplexed as USB is).

So, the next time you’re looking for a new gadget of some sort or another, look for one chargeable via mini USB and circumvent the gadget charger nightmare (or at least, begin to do so, one gadget at a time).