Archive for the ‘Debugging’ Category

Handy debugger tricks: Setting osloader options on a per-boot basis

Friday, July 30th, 2010

Sometimes, you’ll find yourself wishing that you could edit the boot options for a particular Windows instance just for a single boot (perhaps to enable debugging with non-default parameters, above and beyond what the F8 menus allows).

While the F8 boot menu has a lot of options, it’s sometimes the case that you need greater flexibility than what it provides (say, to try and enable USB2 debugging on the fly for a single boot — perhaps you need to use the debugger to rescue a system that won’t boot after a change you’ve made, for instance).

It turns out that there’s actually a (perhaps little-known) way to do this starting with Windows Vista and later; at boot time (whenever you could enter the F8 menu), you can strike F10 and find yourself at a prompt that allows you to directly edit the osloader options for the current boot. (Remember that the settings you enter here aren’t persisted across reboots.)

From here, can enable debugging or change any other osloader setting which you could permanently configure through bcdedit (but only for this boot). This capability is especially helpful if you need to debug setup with non-standard debugger connection setting, which otherwise presents a painful problem.

Remember that osloader options are the old-style options that you used to set in boot.ini (the debugger documentation and MSDN outline these). Don’t use the new-style bcdedit names, as those are only recognized by bcdedit; internally, the options passed on the osloader command line continue to be their old forms (i.e. /DEBUG /DEBUGPORT=1394 /CHANNEL=0).

Debugger tricks: Find all probable CONTEXT records in a crash dump

Monday, March 30th, 2009

If you’ve debugged crash dumps for awhile, then you’ve probably ran into a situation where the initial dump context provided by the debugger corresponds to a secondary exception that happened while processing an initial exception that’s likely closer to the original underlying problem in the issue you’re investigating.

This can be annoying, as the “.ecxr” command will point you at the site of the secondary failure exception, and not the original exception context itself. However, in most cases the original, primary exception context is still there on the stack; one just needs to know how to find it.

There are a couple of ways to go about this:

  • For hardware generated exceptions (such as access violations), one can look for ntdll!KiUserExceptionDispatcher on the stack, which takes a PCONTEXT and an PEXCEPTION_RECORD as arguments.
  • For software-generated exceptions (such as C++ exceptions), things get a bit dicier. One can look for ntdll!RtlDispatchException being called on the stack, and from there, grab the PCONTEXT parameter.

This can be a bit tedious if stack unwinding fails, or you’re dealing with one of those dumps where exceptions on multiple threads at the same time, typically due to crash dump writing going haywire (I’m looking at you, Outlook…). It would be nice if the debugger could automate this process a little bit.

Fortunately, it’s actually not hard to do this with a little bit of a brute-force approach. Specifically, just a plain old “dumb” memory scan for something common to most all CONTEXT records. It’s not exactly a finesse approach, but it’s usually a lot faster than manually groveling through the stack, especially if multiple threads or multiple nested exceptions are involved. While there may be false-positives, it’s usually immediately obvious as to what makes sense to be involved with a live exception or not. Sometimes, however, quick-and-dirty brute force type solutions really end up doing the trick, though.

In order to find CONTEXT records based on a memory search, though, we need some common data points that are typically the same for all CONTEXT structures, and, preferably, contiguous (for ease of use with the “s” command, the debugger’s memory search support). Fortunately, it turns out that this exists in the form of the segment registers of a CONTEXT structure:

0:000> dt ntdll!_CONTEXT
+0x000 ContextFlags : Uint4B
[…]

+0x08c SegGs : Uint4B
+0x090 SegFs : Uint4B
+0x094 SegEs : Uint4B
+0x098 SegDs : Uint4B

[…]

Now, it turns out that for all threads in a given process will almost always have the same segment selector values, excluding exotic and highly out of the ordinary cases like VDM processes. (The same goes for the segment selector values on x64 as well.) Four non-zero 32-bit values (actually, 16-bit values with zero padding to 32-bits) are enough to be able to reasonably pull a search off without being buried in false positives. Here’s how to do it with the infamous WinDbg debugger script (also applicable to other DbgEng-enabled programs, such as kd):

.foreach ( CxrPtr { s -[w1]d 0 l?ffffffff @gs @fs @es @ds } ) { .cxr CxrPtr – 8c }

This is a bit of a long-winded command, so let’s break it down into the individual components. First, we have a “.foreach” construct, which according to the debugger documentation, follows this convention:

.foreach [Options] ( Variable { InCommands } ) { OutCommands }

The .foreach command (actually one of the more versitle debugger-scripting commands, once one gets used to using it) basically takes a series of input strings generated by an input command (InCommands) and invokes an command to process that output (OutCommands), with the results of the input command being subsituted in as a macro specified by the Variable argument. It’s ugly and operates based on text parsing (and there’s support for skipping every X inputs, among other things; see the debugger documentation), but it gets the job done.

The next part of this operation is the s command, which instructs the debugger to search for a pattern in-memory in the target. The arguments supplied here instruct the debugger to search only writable memory (w), output only the address of each match (1), scan for DWORD (4-byte) sized quanitites (d) in the lower 4GB of the address space (0 l?ffffffff); in this case, we’re assuming that the target is a 32-bit process (which might be hosted on Wow64, hence 4GB instead of 3GB used). The remainder of the command specifies the search pattern to look for; the segment register values of the current thread. The “s” command sports a plethora of other options (with a rather unwieldy and convoluted syntax, unfortunately); the debugger documentation runs through the gamut of the other capabilities.

The final part of this command string is the output command, which simply directs the debugger to set the current context to the input command output replacement macro’s value at an offset of 0x8c. (If one recalls, 0x8c is the offset from the start of a struct _CONTEXT to the SegGs member, which is the first value we search for; as a result, the addresses returned by the “s” command will be the address of the SegGs member.) Remember that we restricted the output of the “s” command to just being the address itself, which lets us easily pass that on to a different command (which might give one the idea that the “s” and “.foreach” commands were made to work together).

Putting the command string together, it directs the debugger to search for a sequence of four 32-bit values (the gs, fs, es, and ds segment selector values for the current thread) in contiguous memory, and display the containing CONTEXT structure for each match.

You may find some other CONTEXT records aside from exception-related ones while executing this comamnd (in particular, initial thread contexts are common), but the ones related to a fault are usually pretty obvious and self-evident. Of course, this method isn’t foolproof, but it lets the debugger do some of the hard work for you (which beats manually groveling in corrupted stacks across multiple threads just to pull out some CONTEXT records).

Naturally, there are a number of other uses for both the “.foreach” and “s” commands; don’t be afraid to experiment with them. There are other helpers for automating certain tasks (!for_each_frame, !for_each_local, !for_each_module, !for_each_process, and !for_each_thread, to name a few) too, aside from the general “.foreach“. The debugger scripting support might not be the prettiest to look at, but it can be quite handy at speeding up common, repetitive tasks.

One parting tip with “.foreach” (well, actually two parting tips): The variable replacement macro only works if you separate it from other symbols with a space. This can be a problem in some cases (where you need to perform some arithmetic on the resulting expanded macro in particular, such as subtracting 0x8c in this case), however, as the spaces remain when the macro symbol is expanded. Some commands, such as “dt“, don’t follow the standard expression parsing rules (much to my eternal irritation), and choke if they’ve given arguments with spaces.

All is not lost for these commands, however; one way to work around this issue is to store the macro replacement into a pseudo-register (say, “r @$t0 = ReplacementVariableMacro – 0x8c“) and use that pseudo-register in the actual output command, as you can issue multiple, semi-colon delimited commands in the output commands section.

Examining kernel stacks on Vista/Srv08 using kdbgctrl -td, even when you haven’t booted /DEBUG

Friday, March 13th, 2009

Starting with Vista/Srv08, local kernel debugging support has been locked down to require that the system was booted /DEBUG before-hand.

For me, this has been the source of great annoyance, as if something strange happens to a Vista/Srv08 box that requires peering into kernel mode (usually to get a stack trace), then one tends to hit a brick wall if the box wasn’t booted with /DEBUG. (The only supported option usually available would be manually bugcheck the box and hope that the crash dump, if the system was configured to write one, has useful data. This is, of course, not a particularly great option; especially as in the default configuration for dumps on Vista/Srv08, it’s likely that user mode memory won’t be captured.)

However, it turns out that there’s limited support in the operating system to capture some kernel mode data for debugging purposes, even if you haven’t booted /DEBUG. Specifically, kdbgctrl.exe (a tool shipping with the WinDbg distribution) supports the notion of capturing a “kernel triage dump”, which basically takes a miniature snapshot of a given process and its active user mode threads. To use this feature, you need to have SeDebugPrivilege (i.e. you need to be an administrative-equivalently-privileged user).

Unlike conventional local KD support, triage dump writing doesn’t give unrestricted access to kernel mode memory on a live system. Instead, it instructs the kernel to create a small dump file that contains a limited, pre-set amount of information (mostly, just the process and associated threads, and if possible, the stacks of said threads). As a result, you can’t use it for general kernel memory examination. However, it’s sometimes enough to do the trick if you just need to capture what the kernel mode side stack trace of a particular thread is.

To use this feature, you can invoke “kdbgctrl.exe -td pid dump-filename“, which takes a snapshot of the process identified by a pid, and writes it out to a dump file named by dump-filename. This support is very tersely documented if you invoke kdbgctrl.exe on the command line with no arguments:

kdbgctrl
Usage: kdbgctrl <options>
Options:
[...]
-td <pid> <file> - Get a kernel triage dump

Now, the kernel’s support for writing out a triage dump isn’t by any means directly comparable to the power afforded by kd.exe -kl. As previously mentioned, the dump is extremely minimalistic, only containing information about a process object and its associated threads and the bare minimums allowing the debugger to function enough to pull this information from the dump. Just about all that you can reasonably expect to do with it is examine the stacks of the process that was captured, via the “!process -1 1f” command. (In addition, some extra information, such as the set of pending APCs attached to each thread, is also saved – enough to allow “!apc thread” to function.) Sometimes, however, it’s just enough to be able to figure out what something is blocking on kernel mode side (especially when troubleshooting deadlocks), and the triage dump functionality can aid with that.

However, there are some limitations with triage dumps, even above the relatively terse amount of data they convey. The triage dump support needs to be careful not to crash the system when writing out the dump, so all of this information is captured within the normal confines of safe kernel mode operation. In particular, this means that the triage dump writing logic doesn’t just blithely walk the thread or process lists like kd -kl would, or perform other potentially unsafe operations.

Instead, the triage dump creation process operates with the cooperation of all threads involved. This is immediately apparent when taking a look at the thread stacks captured in a triage dump:

BugCheck 69696969, {0, 0, 0, 0}

Probably caused by : ntkrnlmp.exe ( nt!IopKernSnapAPCMiniDump+55 )

Followup: MachineOwner
---------

2: kd> k
Call Site
nt!IopKernSnapAPCMiniDump+0x55
nt!IopKernSnapSpecialApc+0x50
nt!KiDeliverApc+0x1e2
nt!KiSwapThread+0x491
nt!KeWaitForSingleObject+0x2da
nt!KiSuspendThread+0x29
nt!KiDeliverApc+0x420
nt!KiSwapThread+0x491
nt!KeWaitForMultipleObjects+0x2d6
nt!ObpWaitForMultipleObjects+0x26e
nt!NtWaitForMultipleObjects+0xe2
nt!KiSystemServiceCopyEnd+0x13
0x7741602a

(Note that the system doesn’t actually bugcheck as a result of creating a triage dump. Kernel dumps implicitly need a bug check parameters record, however, so one is synthesized for the dump file.)

As is immediately obvious from the call stack, thread stack collection in triage dumps operates (in Vista/Srv08) by tripping a kernel APC on the target thread, which then (as it’s in-thread-context) safely collects data about that thread’s stack.

This, of course, presents a limitation: If a thread is sufficiently wedged such that it can’t run kernel APCs, then a triage dump won’t be able to capture full information about that thread. However, for some classes of scenarios where simply capturing a kernel mode stack is sufficient, triage dumps can sometimes fill in the gap in conjuction with user mode debugging, even if the system wasn’t booted with /DEBUG.

Understanding the kernel address space on 32-bit Windows Vista

Monday, February 23rd, 2009

[Warning: The dynamic kernel address space is subject to future changes in future releases. In fact, the set of possible address space regions types on public 32-bit Win7 drops is different from the original release of the dynamic kernel address space logic in Windows Vista. You should not make hardcoded assumptions about this logic in production code, as the basic premises outlined in this post are subject to change without warning on future operating system revisions.]

(If you like, you may skip the history lesson and background information and skip to the gory details if you’re already familiar with the subject.)

With 32-bit Windows Vista, significant changes were made to the way the kernel mode address space was laid out. In previous operating system releases, the kernel address space was divvied up into large, relatively fixed-size regions ahead of time. For example, the paged- and non-paged pools resided within fixed virtual address ranges that were, for the most part, calculated at boot time. This meant that the memory manager needed to make a decision up front about how much address space to dedicate to the paged pool versus the non-paged pool, or how much address space to devote to the system cache, and soforth. (The address space is not strictly completely fixed on a 32-bit legacy system. There are several registry options that can be used to manually trade-off address space from one resource to another. However, these settings are only taken into account at memory manager initialization time during system boot.)

As a consequence of these mostly static (after boot time) calculations, the memory manager had limited flexibility to respond to differing memory usage workloads at runtime. In particular, because the memory manager needed to dedicate large address space regions up front to one of several different resource sets (i.e. paged pool, non-paged pool, system cache), situations may arise wherein one of these resources may be heavily utilized to the point of exhaustion of the address range carved out for it (such as the paged pool), but another “peer” resource (such as the non-paged pool) may be experiencing only comparatively light utilization. Because the memory manager has no way to “take back” address space from the non-paged pool, for example, and hand it off to the paged pool if it so turns out allocations from the paged pool (for example) may fail while there’s plenty of comparatively unused address space that has all been blocked off for the exclusive usage of the non-paged pool.

On 64-bit platforms, this is usually not a significant issue, as the amount of kernel address space available is orders of magnitude beyond the total amount of physical memory available in even the largest systems available (for today, anyway). However, consider that on 32-bit systems, the kernel only has 2GB (and sometimes only 1GB, if the system is booted with /3GB) of address space available to play around with. In this case, the fact that the memory manager needs to carve off hundreds of megabytes of address space (out of only 2 or even 1GB total) up-front for pools and the system cache becomes a much more eye-catching issue. In fact, the scenario I described previously is not even that difficult to achieve in practice, as the size of the non-paged or paged pools are typically quite small compared to the amount of physical memory available to even baseline consumer desktop systems nowadays. Throw in a driver or workload that heavily utilizes paged- or non-paged pool for some reason, and this sort of “premature” pool exhaustion may occur much sooner than one might naively expect.

Now, while it’s possible to manually address this issue to a degree on 32-bit system using the aforementioned registry knobs, determining the optimum values for these knobs on a particular workload is not necessarily an easy task. (Most sysadmins I know of tend to have their eyes glaze over when step one includes “Attach a kernel debugger to the computer.”)

To address this growing problem, the memory manager was restructured by the Windows Vista timeframe to no longer treat the kernel address space as a large (or, not so large, depending upon how one wishes to view it) slab to carve up into very large, fixed-size regions at boot time. Hence, the concept of the dynamic kernel address space was born. The basic idea is that instead of carving the relatively meagre kernel address space available on 32-bit systems up into large chunks at boot time, the memory manager “reserves its judgement” until there is need for additional address space for a partial resource (such as the paged- or non-paged pool), and only then hands out another chunk of address space to the component in question (such as the kernel pool allocator).

While this has been mentioned publicly for some time, to the best of my knowledge, nobody had really sat down and talked about how this feature really worked under the hood. This is relevant for a couple of reasons:

  1. The !address debugger extension doesn’t understand the dynamic kernel address space right now. This means that you can’t easily figure out where an address came from in the debugger on 32-bit Windows Vista systems (or later systems that might use a similar form of dynamic kernel address space).
  2. Understanding the basic concepts of how the feature works provides some insight into what it can (and cannot do) to help alleviate the kernel address space crunch.
  3. While the kernel address space layout has more or less been a relatively well understood concept for 32-bit systems to date, much of that knowledge doesn’t really directly translate to Windows Vista (and potentially later) systems.


Please note that future implementations may not necessarily function the same under the hood with respect to how the kernel address space operates.

Internally, the way the memory manager provides address space to resources such as the non-paged or paged- pools has been restructured such that each of the large address space resources act as clients to a new kernel virtual address space allocator. Wherein these address space resources previously received their address ranges in the form of large, pre-calculated regions at boot time, instead, each now calls upon the memory manager’s internal kernel address space allocator (MiObtainSystemVa) to request additional address space.

The kernel address space allocator can be thought of as maintaining a logical pool of unused virtual address space regions. It’s important to note that the address space allocator doesn’t actually allocate any backing store for the returned address space, nor any other management structures (such as PTEs); it simply reserves a chunk of address space exclusively for the use of the caller. This architecture is required due to the fact that everything from driver image mapping to the paged- and non-paged pool backends have been converted to use the kernel address space allocator routines. Each of these components has very different views of what they’ll actually want to use the address space for, but they all commonly need an address space to work in.

Similarly, if a component has obtained an address space region that it no longer requires, then it may return it to the memory manager with a call to the internal kernel address space allocator routine MiReturnSystemVa.

To place things into perspective, this system is conceptually analogous to reserving a large address space region using VirtualAlloc with MEM_RESERVE in user mode. A MEM_RESERVE reservation doesn’t commit any backing store that allows data to reside at a particular place in the address space, but it does grant exclusive usage of an address range of a particular size, which can then be used in conjunction with whatever backing store the caller requires. Likewise, it is similarly up to the caller to decide how they wish to back the address space returned by MiObtainSystemVa.

The address space chunks dealt with by the kernel address space allocator, similarly to the user mode address space reservation system, don’t necessarily need to be highly granular. Instead, a granularity that is convenient for the memory management is chosen. This is because the clients of the kernel address space allocator will then subdivide the address ranges they receive for their own usage. (The exact granularity of a kernel address space allocation is, of course, subject to change in future releases based upon the whims of what is preferable for the memory manager)

For example, if a driver requests an allocation from the non-paged pool, and there isn’t enough address space assigned to the non-paged pool to handle the request, then the non-paged pool allocator backend will call MiObtainSystemVa to retrieve additional address space. If successful, it will conceptually add this address space to its free address space list, and then return a subdivided chunk of this address space (mated with physical memory backing store, as in this example, we are speaking of the non-paged pool) to the driver. The next request for memory from the non-paged pool might then come from the same address range previously obtained by the non-paged pool allocator backend. This behavior is, again, conceptually similar to how the user mode heap obtains large amounts of address space from the memory manager and then subdivides these address regions into smaller allocations for a caller of, say, operator new.

All of this happens transparently to the existing public clients of the memory manager. For example, drivers don’t observe any particularly different behavior from ExAllocatePoolWithTag. However, because the address space isn’t pre-carved into large regions ahead of time, the memory manager no longer has its hands proverbially tied with respect to allowing one particular consumer of address space to obtain comparatively much more address space than usual (of course, at a cost to the available address space to other components). In effect, the memory manager is now much more able to self-tune for a wide variety of workloads without the user of the system needing to manually discover how much address space would be better allocated to the paged pool versus the system cache, for example.

In addition, the dynamic kernel address space infrastructure has had other benefits as well. As the address spans for the various major consumers of kernel address space are now demand-allocated, so too are PTE and other paging-related structures related to those address spans. This translates to reduced boot-time memory usage. Prior to the dynamic kernel address space’s introduction, the kernel would reduce the size of the various mostly-static address regions based on the amount of physical memory on the system for small systems. However, for large systems, and especially large 64-bit systems, paging-related structures potentially describing large address space regions were pre-allocated at boot time.

On small 64-bit systems, the demand-allocation of paging related structures also removes address space limits on the sizes of individual large address space regions that were previously present to avoid having to pre-allocate vast ranges of paging-related structures.

Presently, on 64-bit systems, many, though not all of the major kernel address space consumers still have their own dedicated address regions assigned internally by the kernel address space allocator, although this could easily change as need be in a future release. However, demand-creation of paging-describing structures is still realized (with the reduction in boot-time memory requirements as described above,

I mentioned earlier that the !address extension doesn’t understand the dynamic kernel address space as of the time of this writing. If you need to determine where a particular address came from while debugging on a 32-bit system that features a dynamic kernel address space, however, you can manually do this by looking into the memory manager’s internal tracking structures to discern for which reason a particular chunk of address space was checked out. (This isn’t quite as granular as !address as, for example, kernel stacks don’t have their own dedicated address range (in Windows Vista). However, it will at least allow you to quickly tell if a particular piece of address space is part of the paged- or non-paged pool, whether it’s part of a driver image section, and soforth.)

In the debugger, you can issue the following (admittedly long and ugly) command to ask the memory manager what type of address a particular virtual address region is. Replace <<ADDRESS>> with the kernel address that you wish to inquire about:

?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))[ @@masm( ( <<ADDRESS>> - poi(nt!MmSystemRangeStart)) / (@$pagesize *

@$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )

Here’s a couple of examples:


kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( 89445008 - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaNonPagedPool (5)

kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( ndis - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaDriverImages (12)

kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( win32k - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaSessionGlobalSpace (11)

The above technique will not work on 64-bit systems that utilize a dynamic (demand-allocated) kernel address space, as the address space tracking is performed differently internally.

(Many thanks to Andrew Rogers and Landy Wang, who were gracious enough to spend some time divulging insights on the subject.)

Recovering a process from a hung debugger

Saturday, February 21st, 2009

One of the more annoying things that can happen while debugging processes that deal with network traffic is happening to attach to something that is in the “critical path” for accessing the debugger’s active symbol path.

In such a scenario, the debugger will usually deadlock itself trying to request symbols, which requires going through some code path that involves the debuggee, which being frozen by the debugger, never gets to run. Usually, one would think that this means a lost repro (and, depending on the criticality of the process attached to the debugger, possibly a forced reboot as well), neither of which happen to be particularly fun outcomes.

It turns out that you’re not actually necessarily hosed if this happens, though. If you can still start a new process on the computer, then there’s actually a way to steal the process back from the debugger (on Windows XP and later), with the new-fangled fancy kernel debug object-based debugging support. Here’s what you need to do:

  1. Attach a new debugger to the process, with symbol support disabled. (You will be attaching to the debuggee and not the debugger.)

    Normally, you can’t attach a debugger to process while it’s already being debugged. However, there’s an option in windbg (and ntsd/cdb as well) that allows you to do this: the “-pe” debugger command-line parameter (documented in the debugger documentation), which forcibly attaches to the target process despite the presence of the hung debugger.

    Of course, force-attaching to the process won’t do any good if the new debugger process will just deadlock right away. As a result, you should make sure that the debugger won’t try and do any symbol-loading activity that might engender a deadlock. This is the command line that I usually use for that purpose, which disables _NT_SYMBOL_PATH appending (“-sins“), disables CodeView pdb pointer following (“-sicv“), and resets the symbol path to a known good value (“-y .“):

    ntsd -sicv -sins -y . -pe -p hung_debuggee_pid

    I recommend using ntsd and not WinDbg for this purpose in order to reduce the chance of a symbol path that might be stored in a WinDbg workspace from causing the debugger to deadlock itself again.

  2. Kill the hung debugger.

    After successfully attaching with “-pe“, you can safely kill the hung debugger (by whatever means necessary) without causing the former debuggee to get terminated along with it.

  3. Resume all threads in the target.

    The suspend count of most threads in the target is likely to be wrong. You can correct this by issuing the “~*M” command set several times (which resumes all threads in the process with the “~M” command).

    To determine the suspend count of all processes in the thread, you can use the “~” command. For example, you might see the following:

    0:001> ~
       0  Id: 18c4.16f0 Suspend: 2 Teb: 7efdd000 Unfrozen
    .  1  Id: 18c4.1a04 Suspend: 2 Teb: 7efda000 Unfrozen
    

    You should issue the “~*M” command enough times to bring the suspend count of all threads down to zero. (Don’t worry if you need to resume a thread more times than it is suspended.) Typically, this would be two times, for the common case, but by checking the suspend count of active threads, you can be certain of the number of times that you need you need to resume all threads in the process.

  4. Detach the new debugger from the debuggee.

    After resuming all threads in the target, use the “qd” command to detach the debugger. Do not attempt to resume the debugger with the “g” command (as it will stay suspended), or quit the debugger without attaching (as that would cause the debuggee to get terminated).

    If you needed to keep a particular thread in the debuggee suspended so that you can re-attach the debugger without losing your place, you can leave that thread with its suspend count above zero.

Voila, the debuggee should return back to life. Now, you should be able to re-attach a debugger (hopefully, with a safe symbol path this time), or not, as desired.

Hotpatching MS08-067

Friday, October 24th, 2008

If you have been watching the Microsoft security bulletins lately, then you’ve likely noticed yesterday’s bulletin, MS08-067. This is a particularly nasty bug, as it doesn’t require authentication to exploit in the default configuration for Windows Server 2003 and earlier systems (assuming that an attacker can talk over port 139 or port 445 to your box).

The usual mitigation for this particular vulnerability is to block off TCP/139 (NetBIOS) and TCP/445 (Direct hosted SMB), thus cutting off remote access to the srvsvc pipe, a prerequisite for exploiting the vulnerability in question. In my case, however, I had a box that I really didn’t want to reboot immediately. In addition, for the box in question, I did not wish to leave SMB blocked off remotely.

Given that I didn’t want to assume that there’d be no possible way for an untrusted user to be able to establish a TCP/139 or TCP/445, this left me with limited options; either I could simply hope that there wasn’t a chance for the box to get compromised before I had a chance for it to be convenient to reboot, or I could see if I could come up with some form of alternative mitigation on my own. After all, a debugger is the software development equivalent of a swiss army knife and duct-tape; I figured that it would be worth a try seeing if I could cobble together some sort of mitigation by manually patching the vulnerable netapi32.dll. To do this, however, it would be necessary to gain enough information about the flaw in question in order to discern what the fix was, in the hope of creating some form of alternative countermeasure for the vulnerability.

The first stop for gaining more information about the bug in question would be the Microsoft advisory. As usual, however, the bulletin released for the MS08-067 issue was lacking in sufficiently detailed technical information as required to fully understand the flaw in question to the degree necessary down to the level of what functions were patched, aside from the fact that the vulnerability resided somewhere in netapi32.dll (the Microsoft rationale behind this policy is that providing that level of technical detail would simply aid the creation of exploits). However, as Pusscat presented at Blue Hat Fall ’07, reverse engineering most present-day Microsoft security patches is not particularly insurmountable.

The usual approach to the patch reverse engineering process is to use a program called bindiff (an IDA plugin) that analyzes two binaries in order to discover the differences between the two. In my case, however, I didn’t have a copy of bindiff handy (it’s fairly pricey). Fortunately (or unfortunately, depending on your persuasion), there already existed a public exploit for this bug, as well as some limited public information from persons who had already reverse engineered the patch to a degree. To this end, I had a particular function in the affected module (netapi32!NetpwPathCanonicalize) which I knew was related to the vulnerability in some form.

At this point, I brought up a copy of the unpatched netapi32.dll in IDA, as well as a patched copy of netapi32.dll, then started looking through and comparing disassembly one page at a time until an interesting difference popped up in a subfunction of netapi32!NetpwPathCanonicalize:

Unpatched code:

.text:000007FF7737AF90 movzx   eax, word ptr [rcx]
.text:000007FF7737AF93 xor     r10d, r10d
.text:000007FF7737AF96 xor     r9d, r9d
.text:000007FF7737AF99 cmp     ax, 5Ch
.text:000007FF7737AF9D mov     r8, rcx
.text:000007FF7737AFA0 jz      loc_7FF7737515E

Patched code:

.text:000007FF7737AFA0 mov     r8, rcx

.text:000007FF7737AFA3 xor     eax, eax

.text:000007FF7737AFA5 mov     [rsp+arg_10], rbx
.text:000007FF7737AFAA mov     [rsp+arg_18], rdi
.text:000007FF7737AFAF jmp     loc_7FF7738E5D6

.text:000007FF7738E5D6 mov     rcx, 0FFFFFFFFFFFFFFFFh
.text:000007FF7738E5E0 mov     rdi, r8
.text:000007FF7738E5E3 repne scasw

.text:000007FF7738E5E6 movzx   eax, word ptr [r8]
.text:000007FF7738E5EA xor     r11d, r11d

.text:000007FF7738E5ED not     rcx

.text:000007FF7738E5F0 xor     r10d, r10d

.text:000007FF7738E5F3 dec     rcx

.text:000007FF7738E5F6 cmp     ax, 5Ch

.text:000007FF7738E5FA lea     rbx, [r8+rcx*2+2]

.text:000007FF7738E5FF jnz     loc_7FF7737AFB4

Now, without even really understanding what’s going on here on the function as a whole, it’s pretty obvious that here’s where (at least one) modification is being made; the new code involved the addition of an inline wcslen call. Typically, security fixes for buffer overrun conditions involve the creation of previously missing boundary checks, so a new call to a string-length function such as wcslen is a fairly reliable indicator that one’s found the site of the fix for the the vulnerability in question.

(The repne scasw instruction set scans a memory region two-bytes at a time until a particular value (in rax) is reached, or the maximum count (in rcx, typically initialized to (size_t)-1) is reached. Since we’re scanning two bytes at a time, and we’ve initialized rax to zero, we’re looking for an 0x0000 value in a string of two-byte quantities; in other words, an array of WCHARs (or a null terminated Unicode string). The resultant value on rcx after executing the repne scasw can be used to derive the length of the string, as it will have been decremented based on the number of WCHARs encountered before the 0x0000 WCHAR.)

My initial plan was, assuming that the fix was trivial, to simply perform a small opcode patch on the unpatched version of netapi32.dll in the Server service process on the box in question. In this particular instance, however, there were a number of other changes throughout the patched function that made use of the additional length check. As a result, a small opcode patch wasn’t ideal, as large parts of the function would need to be rewritten to take advantage of the extra length check.

Thus, plan B evolved, wherein the Microsoft-supplied patched version of netapi32.dll would be injected into an already-running Server service process. From there, the plan was to detour buggy_netapi32!NetpwPathCanonicalize to fixed_netapi32!NetpwPathCanonicalize.

As it turns out, netapi32!NetpwPathCanonicalize and all of its subfunctions are stateless with respect to global netapi32 variables (aside from the /GS cookie), which made this approach feasible. If the call tree involved a dependancy on netapi32 global state, then simply detouring the entire call tree wouldn’t have been a valid option, as the globals in the fixed netapi32.dll would be distinct from the globals in the buggy netapi32.dll.

This approach also makes the assumption that the only fixes made for the patch were in netapi32!NetpwPathCanonicalize and its call tree; as far as I know, this is the case, but this method is (of course) completely unsupported by Microsoft. Furthermore, as x64 binaries are built without hotpatch nop stubs at their prologue, the possibility for atomic patching in of a detour appeared to be out, so this approach has a chance of failing in the (unlikely) scenario where the first few instructions of netapi32!NetpwPathCanonicalize were being executed at the time of the detour.

Nonetheless, the worst case scenario would be that the box went down, in which case I’d be rebooting now instead of later. As the whole point of this exercise was to try and delay rebooting the system in question, I decided that this was an acceptable risk in my scenario, and elected to proceed. For the first step, I needed a program to inject a DLL into the target process (SDbgExt does not support !loaddll on 64-bit targets, sadly). The program that I came up with is certainly quick’n’dirty, as it fudges the thread start routine in terms of using kernel32!LoadLibraryA as the initial start address (which is a close enough analogue to LPTHREAD_START_ROUTINE to work), but it does the trick in this particular case.

The next step was to actually load the app into the svchost instance containing the Server service instance. To determine which svchost process this happens to be, one can use “tasklist /svc” from a cmd.exe console, which provides a nice formatted view of which services are hosted in which processes:

C:\WINDOWS\ms08-067-hotpatch>tasklist /svc
[…]
svchost.exe 840 AeLookupSvc, AudioSrv, BITS, Browser,
CryptSvc, dmserver, EventSystem, helpsvc,
HidServ, IAS, lanmanserver,
[…]

That being done, the next step was to inject the DLL into the process. Unfortunately, the default security descriptor on svchost.exe instances doesn’t allow Administrators the access required to inject a thread. One way to solve this problem would have been to write the code to enable the debug privilege in the injector app, but I elected to simply use the age-old trick of using the scheduler service (at.exe) to launch the program in question as LocalSystem (this, naturally, requires that you already be an administrator in order to succeed):

C:\WINDOWS\ms08-067-hotpatch>at 21:32 C:\windows\ms08-067-hotpatch\testapp.exe 840 C:\windows\ms08-067-hotpatch\netapi32.dll
Added a new job with job ID = 1

(21:32 was one minute from the time when I entered that command, minutes being the minimum granularity for the scheduler service.)

Roughly one minute later, the debugger (attached to the appropriate svchost instance) confirmed that the patched DLL was loaded successfully:

ModLoad: 00000000`04ff0000 00000000`05089000 
C:\windows\ms08-067-hotpatch\netapi32.dll

Stepping back for a moment, attaching WinDbg to an svchost instance containing services in the symbol server lookup code path is risky business, as you can easily deadlock the debugger. Proceed with care!

Now that the patched netapi32.dll was loaded, it was time to detour the old netapi32.dll to refer to the new netapi32.dll. Unfortunately, WinDbg doesn’t support assembling amd64 instructions very well (64-bit addresses and references to the extended registers don’t work properly), so I had to use a separate assembler (HIEW, [Hacker’s vIEW]) and manually patch in the opcode bytes for the detour sequence (mov rax, <absolute addresss> ; jmp rax):

0:074> eb NETAPI32!NetpwPathCanonicalize 48 C7 C0 40 AD FF 04 FF E0
0:074> u NETAPI32!NetpwPathCanonicalize
NETAPI32!NetpwPathCanonicalize:
000007ff`7737ad30 48c7c040adff04
mov rax,offset netapi32_4ff0000!NetpwPathCanonicalize
000007ff`7737ad37 ffe0            jmp     rax
0:076> bp NETAPI32!NetpwPathCanonicalize
0:076> g

This said and done, all that remained was to set a breakpoint on netapi32!NetpwPathCanonicalize and give the proof of concept exploit a try against my hotpatched system (it survived). Mission accomplished!

The obvious disclaimer: This whole procedure is a complete hack, and not recommended for production use, for reasons that should be relatively obvious. Additionally, MS08-067 did not come built as officially “hotpatch-enabled” (i.e. using the Microsoft supported hotpatch mechanism); “hotpatch-enabled” patches do not entail such a sequence of hacks in the deployment process.

(Thanks to hdm for double-checking some assumptions for me.)

Why does every heap trace in UMDH get stuck at “malloc”?

Thursday, February 21st, 2008

One of the more useful tools for tracking down memory leaks in Windows is a utility called UMDH that ships with the WinDbg distribution. Although I’ve previously covered what UMDH does at a high level, and how it functions, the basic principle for it, in a nutshell, is that it uses special instrumentation in the heap manager that is designed to log stack traces when heap operations occur.

UMDH utilizes the heap manager’s stack trace instrumentation to associate call stacks with outstanding allocations. More specifically, UMDH is capable of taking a “snapshot” of the current state of all heaps in a process, associating like-sized allocations from like-sized callstacks, and aggregrating them in a useful form.

The general principle of operation is that UMDH is typically run two (or more times), once to capture a “baseline” snapshot of the process after it has finished initializing (as there are expected to always be a number of outstanding allocations while the process is running that would not be normally expected to be freed until process exit time, for example, any allocations used to build the command line parameter arrays provided to the main function of a C program, or any other application-derived allocations that would be expected to remain checked out for the lifetime of the program.

This first “baseline” snapshot is essentially intended to be a means to filter out all of these expected, long-running allocations that would otherwise show up as useless noise if one were to simply take a single snapshot of the heap after the process had leaked memory.

The second (and potentially subsequent) snapshots are intended to be taken after the process has leaked a noticeable amount of memory. UMDH is then run again in a special mode that is designed to essentially do a logical “diff” between the “baseline” snapshot and the “leaked” snapshot, filtering out any allocations that were present in both of them and returning a list of new, outstanding allocations, which would generally include any leaked heap blocks (although there may well be legitimate outstanding allocations as well, which is why it is important to ensure that the “leaked” snapshot is taken only after a non-trivial amount of memory has been leaked, if at all possible).

Now, this is all well and good, and while UMDH proves to be a very effective tool for tracking down memory leaks with this strategy, taking a “before” and “after” diff of a problem and analyzing the two to determine what’s gone wrong is hardly a new, ground-breaking concept.

While the theory behind UMDH is sound, however, there are some situations where it can work less than optimally. The most common failure case of UMDH in my experience is not actually so much related to UMDH itself, but rather the heap manager instrumentation code that is responsible for logging stack traces in the first place.

As I had previously discussed, the heap manager stack trace instrumentation logic does not have access to symbols, and on x86, “perfect” stack traces are not generally possible, as there is no metadata attached with a particular function (outside of debug symbols) that describes how to unwind past it.

The typical approach taken on x86 is to assume that all functions in the call stack do not use frame pointer omission (FPO) optimizations that allow the compiler to eliminate the usage of ebp for a function entirely, or even repurpose it for a scratch register.

Now, most of the libraries that ship with the operating system in recent OS releases have FPO explicitly turned off for x86 builds, with the sole intent of allowing the built-in stack trace instrumentation logic to be able to traverse through system-supplied library functions up through to application code (after all, if every heap stack trace dead-ended at kernel32!HeapAlloc, the whole concept of heap allocation traces would be fairly useless).

Unfortunately, there happens to be a notable exception to this rule, one that actually came around to bite me at work recently. I was attempting to track down a suspected leak with UMDH in one of our programs, and noticed that all of the allocations were grouped into a single stack trace that dead-ended in a rather spectacularly unhelpful way. Digging in a bit deeper, in the individual snapshot dumps from UMDH contained scores of allocations with the following backtrace logged:

00000488 bytes in 0x1 allocations
   (@ 0x00000428 + 0x00000018) by: BackTrace01786

        7C96D6DC : ntdll!RtlDebugAllocateHeap+000000E1
        7C949D18 : ntdll!RtlAllocateHeapSlowly+00000044
        7C91B298 : ntdll!RtlAllocateHeap+00000E64
        211A179A : program!malloc+0000007A

This particular outcome happened to be rather unfortunate, as in the specific case of the program I was debugging at work, virtually all memory allocations in the program (including the ones I suspected of leaking) happened to ultimately get funneled through malloc.

Obviously, getting told that “yes, every leaked memory allocation goes through malloc” isn’t really all that helpful if (most) every allocation in the program in question happened to go through malloc. The UMDH output begged the question, however, as to why exactly malloc was breaking the stack traces. Digging in a bit deeper, I discovered the following gem while disassembling the implementation of malloc:

0:011> u program!malloc
program!malloc
[f:\sp\vctools\crt_bld\self_x86\crt\src\malloc.c @ 155]:
211a1720 55              push    ebp
211a1721 8b6c2408        mov     ebp,dword ptr [esp+8]
211a1725 83fde0          cmp     ebp,0FFFFFFE0h
[...]

In particular, it would appear that the default malloc implementation on the static link CRT on Visual C++ 2005 not only doesn’t use a frame pointer, but it trashes ebp as a scratch register (here, using it as an alias register for the first parameter, the count in bytes of memory to allocate). Disassembling the DLL version of the CRT revealed the same problem; ebp was reused as a scratch register.

What does this all mean? Well, anything using malloc that’s built with Visual C++ 2005 won’t be diagnosable with UMDH or anything else that relies on ebp-based stack traces, at least not on x86 builds. Given that many things internally go through malloc, including operator new (at least in the default implementation), this means that in the default configuration, things get a whole lot harder to debug than they should be.

One workaround here would be to build your own copy of the CRT with /Oy- (force frame pointer usage), but I don’t really consider building the CRT a very viable option, as that’s a whole lot of manual work to do and get up and running correctly on every developer’s machine, not to mention all the headaches that service releases that will require rebuilds will bring with such an approach.

For operator new, it’s fortunately relatively doable to overload it in a relatively supported way to be implemented against a different allocation strategy. In the case of malloc, however, things don’t really have such a happy ending; one is either forced to re-alias the name using preprocessor macro hackery to a custom implementation that does not suffer from a lack of frame pointer usage, or otherwise change all references to malloc/free to refer to a custom allocator function (perhaps implemented against the process heap directly instead of the CRT heap a-la malloc).

So, the next time you use UMDH and get stuck scratching your head while trying to figure out why your stack traces are all dead-ending somewhere less than useful, keep in mind that the CRT itself may be to blame, especially if you’re relying on CRT allocators. Hopefully, in a future release of Visual Studio, the folks responsible for turning off FPO in the standard OS libraries can get in touch with the persons responsible for CRT builds and arrange for the same to be done, if not for the entire CRT, then at least for all the code paths in the standard heap routines. Until then, however, these CRT allocator routines remain roadblocks for effective leak diagnosis, at least when using the better tools available for the job (UMDH).

The default invalid parameter behavior for the VC8 CRT doesn’t break into the debugger

Thursday, November 8th, 2007

One of the problems that confuses people from time to time here at work is that if you happen to hit a condition that trips the “invalid parameter” handler for VC8, and you’ve got a debugger attached to the process that fails, then the process mysteriously exits without giving the debugger a chance to inspect the condition of the program in question.

For those unfamiliar with the concept, the “invalid parameter” handler is a new addition to the Microsoft CRT, which kills the process if various invalid states are encountered. For example, dereferencing a bogus iterator in a release build might trip the invalid parameter handler if you’re lucky (if not, you might see random memory corruption, of course).

The reason why there is no debugger interaction here is that the default CRT invalid parameter handler (present in invarg.c if you’ve got the CRT source code handy) invokes UnhandledExceptionFilter in an attempt to (presumably) give the debugger a crack at the exception. Unfortunately, in reality, UnhandledExceptionFilter will just return immediately if a debugger is attached to the process, assuming that this will cause the standard SEH dispatcher logic to pass the event to the debugger. Because the default invalid parameter handler doesn’t really go through the SEH dispatcher but is in fact simply a direct call to UnhandledExceptionFilter, this results in no notification to the debugger whatsoever.

This counter-intuitive behavior can be more than a little bit confusing when you’re trying to debug a problem, since from the debugger, all you might see in a case like a bad iterator dereference would be this:

0:000:x86> g
ntdll!NtTerminateProcess+0xa:
00000000`7759053a c3              ret

If we pull up a stack trace, then things become a bit more informative:

0:000:x86> k
RetAddr           
ntdll32!ZwTerminateProcess+0x12
kernel32!TerminateProcess+0x20
MSVCR80!_invoke_watson+0xe6
MSVCR80!_invalid_parameter_noinfo+0xc
TestApp!wmain+0x10
TestApp!__tmainCRTStartup+0x10f
kernel32!BaseThreadInitThunk+0xe
ntdll32!_RtlUserThreadStart+0x23

However, while we can get a stack trace for the thread that tripped the invalid parameter event in cases like this with a simple single threaded program, adding multiple threads will throw a wrench into the debuggability of this scenario. For example, with the following simple test program, we might see the following when running the process under the debugger after we continue the initial process breakpoint (this example is being run as a 32-bit program under Vista x64, though the same principle should apply elsewhere):

0:000:x86> g
ntdll!RtlUserThreadStart:
sub     rsp,48h
0:000> k
Call Site
ntdll!RtlUserThreadStart

What happened? Well, the last thread in the process here happened to be the newly created thread instead of the thread that called TerminateProcess. To make matters worse, the other thread (which was the one that caused the actual problem) is already gone, killed by TerminateProcess, and its stack has been blown away. This means that we can’t just figure out what’s happened by asking for a stack trace of all threads in the process:

0:000> ~*k

.  0  Id: 1888.1314 Suspend: -1 Unfrozen
Call Site
ntdll!RtlUserThreadStart

Unfortunately, this scenario is fairly common in practice, as most non-trivial programs use multiple threads for one reason or another. If nothing else, many OS-provided APIs internally create or make use of worker threads.

There is a way to make out useful information in a scenario like this, but it is unfortunately not easy to do after the fact, which means that you’ll need to have a debugger attached and at your disposal before the failure happens. The simplest way to catch the culprit red-handed here is to just breakpoint on ntdll!NtTerminateProcess. (A conditional breakpoint could be employed to check for NtCurrentProcess ((HANDLE)-1) in the first parameter if the process frequently calls TerminateProcess, but this is typically not the case and often it is sufficient to simply set a blind breakpoint on the routine.)

For example, in the case of the provided test program, we get much more useful results with the breakpoint in place:

0:000:x86> bp ntdll32!NtTerminateProcess
0:000:x86> g
Breakpoint 0 hit
ntdll32!ZwTerminateProcess:
mov     eax,29h
0:000:x86> k
RetAddr           
ntdll32!ZwTerminateProcess
kernel32!TerminateProcess+0x20
MSVCR80!_invoke_watson+0xe6
MSVCR80!_invalid_parameter_noinfo+0xc
TestApp!wmain+0x2e
TestApp!__tmainCRTStartup+0x10f
kernel32!BaseThreadInitThunk+0xe
ntdll32!_RtlUserThreadStart+0x23

That’s much more diagnosable than a stack trace for the completely wrong thread.

Note that from an error reporting perspective, it is possible to catch these errors by registering an invalid parameter handler (via _set_invalid_parameter_handler), which is rougly analogus to the mechanism one uses to register a custom handler for pure virtual function call failures.

I tend to prefer debugging with release builds instead of debug builds.

Friday, November 2nd, 2007

One of the things that I find myself espousing both at work and outside of work from time to time is the value of debugging using release builds of programs (for Windows applications, anyways). This may seem contradictory to some at first glance, as one would tend to believe that the debug build is in fact better for debugging (it is named the “debug build”, after all).

However, I tend to disagree with this sentiment, on several grounds:

  1. Debugging on debug builds only is an unrealistic situation. Most of the “interesting” problems that crop up in real life tend to be with release builds on customer sites or production environments. Many of the time, we do not have the luxury of being able to ship out a debug build to a customer or production environment.

    There is no doubt that debugging using the debug build can be easier, but I am of the opinion that it is disadvantageous to be unable to effectively debug release builds. Debugging with release builds all the time ensures that you can do this when you’ve really got no choice, or when it is not feasible to try and repro a problem using a debug build.

  2. Debug builds sometimes interfere with debugging. This is a highly counterintuitive concept initially, one that many people seem to be surprised at. To see what I mean, consider the scenario where one has a random memory corruption bug.

    This sort of problem is typically difficult and time consuming to track down, so one would want to use all available tools to help in this process. One most useful tool in the toolkit of any competent Windows debugger should be page heap, which is a special mode of the RTL heap (which implements the Win32 heap as exposed by APIs such as HeapAlloc).

    Page heap places a guard page at the end (or before, depending on its configuration) of every allocation. This guard page is marked inaccessible, such that any attempt to write to an allocation that exceeds the bounds of the allocated memory region will immediately fault with an access violation, instead of leaving the corruption to cause random failures at a later time. In effect, page heap allows one to catch the guility party “red handed” in many classes of heap corruption scenarios.

    Unfortunately, the debug build greatly diminishes the ability of page heap to operate. This is because when the debug version of the C runtime is used, any memory allocations that go through the CRT (such as new, malloc, and soforth) have special check and fill patterns placed before and after the allocation. These fill patterns are intended to be used to help detect memory corruption problems. When a memory block is returned using an API such as free, the CRT first checks the fill patterns to ensure that they are intact. If a discrepancy is found, the CRT will break into the debugger and notify the user that memory corruption has occured.

    If one has been following along thus far, it should not be too difficult to see how this conflicts with page heap. The problem lies in the fact that from the heap’s perspective, the debug CRT per-allocation metadata (including the check and fill patterns) are part of the user allocation, and so the special guard page is placed after (or before, if underrun protection is enabled) the fill patterns. This means that some classes of memory corruption bugs will overwrite the debug CRT metadata, but won’t trip page heap up, meaning that the only indication of memory corruption will be when the allocation is released, instead of when the corruption actually occured.

  3. Local variable and source line stepping are unreliable in release builds. Again, as with the first point, it is dangerous to get into a pattern of relying on these conveniences as they simply do not work correctly (or in the expected fashion) in release builds, after the optimizer has had its way with the program. If you get used to always relying on local variable and source line support, when used in conjunction with debug builds, then you’re going to be in for a rude awakening when you have to debug a release build. More than once at work I’ve been pulled in to help somebody out after they had gone down a wrong path when debugging something because the local variable display showed the wrong contents for a variable in a release build.

    The moral of the story here is to not rely on this information from the debugger, as it is only reliable for debug builds. Even then, local variable display will not work correctly unless you are stepping in source line mode, as within a source line (while stepping in assembly mode), local variables may not be initialized in the way that the debugger expects given the debug information.

Now, just to be clear, I’m not saying that anyone should abandon debug builds completely. There are a lot of valuable checks added by debug builds (assertions, the enhanced iterator validation in the VS2005 CRT, and stack variable corruption checks, just to name a few). However, it is important to be able to debug problems with release builds, and it seems to me that always relying on debug builds is detrimental to being able to do this. (Obviously, this can vary, but this is simply speaking on my personal experience.)

When I am debugging something, I typically only use assembly mode and line number information, if available (for manually matching up instructions with source code). Source code is still of course a useful time saver in many instances (if you have it), but I prefer not relying on the debugger to “get it right” with respect to such things, having been burned too many times in the past with incorrect results being returned in non-debug builds.

With a little bit of practice, you can get the same information that you would out of local variable display and the like with some basic reading of disassembly text and examination of the stack and register contents. As an added bonus, if you can do this in debug builds, you should by definition be able to do so in release builds as well, even when the debugger is unable to track locals correctly due to limitations in the debug information format.

VMKD 1.1.1.7 released

Sunday, October 28th, 2007

I have posted an update to VMKD VMKD (1.1.1.7). Since the last release (1.1.1.4), the following things have changed (there is a changelog included with the package):

  1. Fixed an assert that kdvmware.sys was tripping on checked builds of the kernel (whoops). There was a bug in the code that was reprotecting kdcom.dll as a part of assuming control over the KD I/O routines.
  2. Added a potential fix for occasional difficulties resynchronizing with the guest across a reboot if DbgEng is not restarted. If you are still seeing synchronization problems from time to time, I’d be interested to see debug output from vmxpatch (available by attaching a debugger to vmware-vmx.exe) and DbgEng itself (available with CTRL-D/CTRL-ALT-D in kd.exe or WinDbg.exe, respectively).
  3. Added support for partial checked builds in a rather limited fashion. Any checked kernel that is used with VMKD ought to be named “krnltest.exe” in the guest. This seemingly arbitrary limitation is present because the file name specified via /KERNEL= is the actual name that appears in the loaded module list, and VMKD uses string comparisons on loaded module list file names to find the kernel image in-memory. There are certainly “better” ways to do this, but the current approach is fairly simple and aside from checked builds, tends to be the most reliable and officially supported way across a wide range of OS versions. Any file name may be specified for the checked HAL module in a partial checked build configuration.

    In the future, I may update the check to be more clever about finding the kernel so as to not rely on string comparisons, but it does not really appear to be worth the time for most purposes at this point.

Additionally, it has been confirmed that VMKD works with VMware Server 1.0.4 (no changes were required on VMKD’s end, and previous releases will work with VMware Server 1.0.4 as well). I still have not gotten around to verifying the operation on VMware Workstation, as for most purposes I have moved my VMware usage almost completely over to VMware Server.

Now, back to your regularly scheduled coverage on the depths of thread local storage on Windows…