Archive for the ‘Programming’ Category

I tend to prefer debugging with release builds instead of debug builds.

Friday, November 2nd, 2007

One of the things that I find myself espousing both at work and outside of work from time to time is the value of debugging using release builds of programs (for Windows applications, anyways). This may seem contradictory to some at first glance, as one would tend to believe that the debug build is in fact better for debugging (it is named the “debug build”, after all).

However, I tend to disagree with this sentiment, on several grounds:

  1. Debugging on debug builds only is an unrealistic situation. Most of the “interesting” problems that crop up in real life tend to be with release builds on customer sites or production environments. Many of the time, we do not have the luxury of being able to ship out a debug build to a customer or production environment.

    There is no doubt that debugging using the debug build can be easier, but I am of the opinion that it is disadvantageous to be unable to effectively debug release builds. Debugging with release builds all the time ensures that you can do this when you’ve really got no choice, or when it is not feasible to try and repro a problem using a debug build.

  2. Debug builds sometimes interfere with debugging. This is a highly counterintuitive concept initially, one that many people seem to be surprised at. To see what I mean, consider the scenario where one has a random memory corruption bug.

    This sort of problem is typically difficult and time consuming to track down, so one would want to use all available tools to help in this process. One most useful tool in the toolkit of any competent Windows debugger should be page heap, which is a special mode of the RTL heap (which implements the Win32 heap as exposed by APIs such as HeapAlloc).

    Page heap places a guard page at the end (or before, depending on its configuration) of every allocation. This guard page is marked inaccessible, such that any attempt to write to an allocation that exceeds the bounds of the allocated memory region will immediately fault with an access violation, instead of leaving the corruption to cause random failures at a later time. In effect, page heap allows one to catch the guility party “red handed” in many classes of heap corruption scenarios.

    Unfortunately, the debug build greatly diminishes the ability of page heap to operate. This is because when the debug version of the C runtime is used, any memory allocations that go through the CRT (such as new, malloc, and soforth) have special check and fill patterns placed before and after the allocation. These fill patterns are intended to be used to help detect memory corruption problems. When a memory block is returned using an API such as free, the CRT first checks the fill patterns to ensure that they are intact. If a discrepancy is found, the CRT will break into the debugger and notify the user that memory corruption has occured.

    If one has been following along thus far, it should not be too difficult to see how this conflicts with page heap. The problem lies in the fact that from the heap’s perspective, the debug CRT per-allocation metadata (including the check and fill patterns) are part of the user allocation, and so the special guard page is placed after (or before, if underrun protection is enabled) the fill patterns. This means that some classes of memory corruption bugs will overwrite the debug CRT metadata, but won’t trip page heap up, meaning that the only indication of memory corruption will be when the allocation is released, instead of when the corruption actually occured.

  3. Local variable and source line stepping are unreliable in release builds. Again, as with the first point, it is dangerous to get into a pattern of relying on these conveniences as they simply do not work correctly (or in the expected fashion) in release builds, after the optimizer has had its way with the program. If you get used to always relying on local variable and source line support, when used in conjunction with debug builds, then you’re going to be in for a rude awakening when you have to debug a release build. More than once at work I’ve been pulled in to help somebody out after they had gone down a wrong path when debugging something because the local variable display showed the wrong contents for a variable in a release build.

    The moral of the story here is to not rely on this information from the debugger, as it is only reliable for debug builds. Even then, local variable display will not work correctly unless you are stepping in source line mode, as within a source line (while stepping in assembly mode), local variables may not be initialized in the way that the debugger expects given the debug information.

Now, just to be clear, I’m not saying that anyone should abandon debug builds completely. There are a lot of valuable checks added by debug builds (assertions, the enhanced iterator validation in the VS2005 CRT, and stack variable corruption checks, just to name a few). However, it is important to be able to debug problems with release builds, and it seems to me that always relying on debug builds is detrimental to being able to do this. (Obviously, this can vary, but this is simply speaking on my personal experience.)

When I am debugging something, I typically only use assembly mode and line number information, if available (for manually matching up instructions with source code). Source code is still of course a useful time saver in many instances (if you have it), but I prefer not relying on the debugger to “get it right” with respect to such things, having been burned too many times in the past with incorrect results being returned in non-debug builds.

With a little bit of practice, you can get the same information that you would out of local variable display and the like with some basic reading of disassembly text and examination of the stack and register contents. As an added bonus, if you can do this in debug builds, you should by definition be able to do so in release builds as well, even when the debugger is unable to track locals correctly due to limitations in the debug information format.

How does one retrieve the 32-bit context of a Wow64 program from a 64-bit process on Windows Server 2003 x64?

Thursday, November 1st, 2007

Recently, Jimmy asked me what the recommended way to retrieve the 32-bit context of a Wow64 application on Windows XP x64 / Windows Server 2003 x64 was.

I originally responded that the best way to do this was to use Wow64GetThreadContext, but Jimmy mentioned that this doesn’t exist on Windows XP x64 / Windows Server 2003 x64. Sure enough, I checked and it’s really not there, which is rather a bummer if one is trying to implement a 64-bit debugger process capable of debugging 32-bit processes on pre-Vista operating systems.

Normally, I don’t typically recommend using undocumented implementation details in production code, but in this case, there seems to be little choice as there’s no documented mechanism to perform this operation prior to Vista. Because Vista introduces a documented way to perform this task, going an undocumented route is at least slightly less questionable, as there’s an upper bound on what operating systems need to be supported, and major changes to the implementation of things on downlevel operating systems are rarer than with new operating system releases.

Clearly, this is not always the case; Windows XP Service Pack 2 changed an enormous amount of things, for instance. However, as a general rule, service packs tend to be relatively conservative with this sort of thing. That’s not that one has carte blanche with using undocumented implementation details on downlevel platforms, but perhaps one can sleep a bit easier at night knowing that things are less likely to break than in the next Windows release.

I had previously mentioned that the Wow64 layer takes a rather unexpected approach to how to implement GetThreadContext and SetThreadContext. While I mentioned at a high level what was going on, I didn’t really go into the details all that much.

The basic implementation of these routines is to determine whether the thread is running in 64-bit mode or not (determined by examining the SegCs value of the 64-bit context record for the thread as returned by NtGetContextThread). If the thread is running in 64-bit mode, and the thread is a Wow64 thread, then an assumption can be made that the thread is in the middle of a callout to the Wow64 layer (say, a system call).

In this case, the 32-bit context is saved at a well-known location by the process that translates from running in 32-bit mode to running in 64-bit mode for system calls and other voluntary, user mode “32-bit break out” events. Specifically, the Wow64 layer repurposes the second TLS slot of each 64-bit thread (that is, Teb->TlsSlots[ 1 ]) to point to a structure of the following layout:

typedef struct _WOW64_THREAD_INFO
{
   ULONG UnknownPrefix;
   WOW64_CONTEXT Wow64Context;
   ULONG UnknownSuffix;
} WOW64_THREAD_INFO, * PWOW64_THREAD_INFO;

(The real structure name is not known..)

Normally, system components do not use the TLS array, but the Wow64 layer is an exception. Because there is not normally any third party 64-bit code running in a Wow64 process, the Wow64 layer is free to do what it wants with the TlsSlots array of the 64-bit TEB for a Wow64 thread. (Each Wow64 thread has its own, separate 32-bit TEB, so this does not interfere with the operation of TLS by the 32-bit program that is currently executing.)

In the case where the requested Wow64 is in a 64-bit Wow64 callout, all one needs to do is to retrieve the base address of the 64-bit TEB of the thread in question, read the second entry in the TlsSlots array, and then read the WOW64_CONTEXT structure out of the memory block referred to by the second 64-bit TLS slot.

The other case that is significant is that where the Wow64 thread is running 32-bit code and is not in a Wow64 callout. In this case, because Wow64 runs x86 code natively, one simply needs to capture the 64-bit context of the desired thread and truncate all of the 64-bit registers to their 32-bit counterparts.

Setting the context of a Wow64 thread works exactly like retrieving the context of a Wow64 thread, except in reverse; one either modifies the 64-bit thread context if the thread is running 32-bit code, or one modifies the saved context record based off of the 64-bit TEB of the desired thread (which will be restored when the thread resumes execution).

I have posted a basic implementation of a version of Wow64­GetThreadContext that operates on pre-Windows-Vista platforms. Note that this implementation is incomplete; it does not translate floating point registers, nor does it only act on the subset of registers requested by the caller in CONTEXT::ContextFlags. The provided code also does not implement Wow64­SetThreadContext; implementing the “set” operation and extending the “get” operation to fully conform to GetThreadContext semantics are left as an exercise for the reader.

This code will operate on Vista x64 as well, although I would strongly recommend using the documented API on Vista and later platforms instead.

Note that the operation of Wow64 on IA64 platforms is completely different from that on x64. This information does not apply in any way to the IA64 version of Wow64.

Thread Local Storage, part 4: Accessing __declspec(thread) data

Thursday, October 25th, 2007

Yesterday, I outlined how the compiler and linker cooperate to support TLS. However, I didn’t mention just what exactly goes on under the hood when one declares a __declspec(thread) variable and accesses it.

Before the inner workings of a __declspec(thread) variable access can be explained, however, it is necessary to discuss several more special variables in tlssup.c. These special variables are referenced by _tls_used to create the TLS directory for the image.

The first variable of interest is _tls_index, which is implicitly referenced by the compiler in the per-thread storage resolution mechanism any time a thread local variable is referenced (well, almost every time; there’s an exception to this, which I’ll mention later on). _tls_index is also the only variable declared in tlssup.c that uses the default allocation storage class. Internally, it represents the current module’s TLS index. The per-module TLS index is, in principal, similar to a TLS index returned by TlsAlloc. However, the two are not compatible, and there exists significantly more work behind the per-module TLS index and its supporting code. I’ll cover all of that later as well; for now, just bear with me.

The definitions of _tls_start and _tls_end appear as so in tlssup.c:

#pragma data_seg(".tls")

#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls")
#endif
char _tls_start = 0;

#pragma data_seg(".tls$ZZZ")

#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls$ZZZ")
#endif
char _tls_end = 0;

This code creates the two variables and places them at the start and end of the “.tls” section. The compiler and linker will automatically assume a default allocation section of “.tls” for all __declspec(thread) variables, such that they will be placed between _tls_start and _tls_end in the final image. The two variables are used to tell the linker the bounds of the TLS storage template section, via the image’s TLS directory (_tls_used).

Now that we know how __declspec(thread) works from a language level, it is necessary to understand the supporting code the compiler generates for an access to a __declspec(thread) variable. This supporting code is, fortunately, fairly straightforward. Consider the following test program:

__declspec(thread) int threadedint = 0;

int __cdecl wmain(int ac,
   wchar_t **av)
{
   threadedint = 42;

   return 0;
}

For x64, the compiler generated the following code:

mov	 ecx, DWORD PTR _tls_index
mov	 rax, QWORD PTR gs:88
mov	 edx, OFFSET FLAT:threadedint
mov	 rax, QWORD PTR [rax+rcx*8]
mov	 DWORD PTR [rdx+rax], 42

Recall that the gs segment register refers to the base address of the TEB on x64. 88 (0x58) is the offset in the TEB for the ThreadLocalStoragePointer member on x64 (more on that later):

   +0x058 ThreadLocalStoragePointer : Ptr64 Void

If we examine the code after the linker has run, however, we’ll notice something strange:

mov     ecx, cs:_tls_index
mov     rax, gs:58h
mov     edx, 4
mov     rax, [rax+rcx*8]
mov     dword ptr [rdx+rax], 2Ah ; 42
xor     eax, eax

If you haven’t noticed it already, the offset of the “threadedint” variable was resolved to a small value (4). Recall that in the pre-link disassembly, the “mov edx, 4” instruction was “mov edx, OFFSET FLAT:threadedint”.

Now, 4 isn’t a very flat address (one would expect an address within the confines of the executable image to be used). What happened?

Well, it turns out that the linker has some tricks up its sleeve that were put into play here. The “offset” of a __declspec(thread) variable is assumed to be relative to the base of the “.tls” section by the linker when it is resolving address references. If one examines the “.tls” section of the image, things begin to make a bit more sense:

0000000001007000 _tls segment para public 'DATA' use64
0000000001007000      assume cs:_tls
0000000001007000     ;org 1007000h
0000000001007000 _tls_start        dd 0
0000000001007004 ; int threadedint
0000000001007004 ?threadedint@@3HA dd 0
0000000001007008 _tls_end          dd 0

The offset of “threadedint” from the start of the “.tls” section is indeed 4 bytes. But all of this still doesn’t explain how the instructions the compiler generated access a variable that is instanced per thread.

The “secret sauce” here lies in the following three instructions:

mov     ecx, cs:_tls_index
mov     rax, gs:58h
mov     rax, [rax+rcx*8]

These instructions fetch ThreadLocalStoragePointer out of the TEB and index it by _tls_index. The resulting pointer is then indexed again with the offset of threadedint from the start of the “.tls” section to form a complete pointer to this thread’s instance of the threadedint variable.

In C, the code that the compiler generated could be visualized as follows:

// This represents the ".tls" section
struct _MODULE_TLS_DATA
{
   int tls_start;
   int threadedint;
   int tls_end;
} MODULE_TLS_DATA, * PMODULE_TLS_DATA;

PTEB Teb;
PMODULE_TLS_DATA TlsData;

Teb     = NtCurrentTeb();
TlsData = Teb->ThreadLocalStoragePointer[ _tls_index ];

TlsData->threadedint = 42;

This should look familiar if you’ve used explicit TLS before. The typical paradigm for explicit TLS is to place a structure pointer in a TLS slot, and then to access your thread local state, the per thread instance of the structure is retrieved and the appropriate variable is then referenced off of the structure pointer. The difference here is that the compiler and linker (and loader, more on that later) cooperated to save you (the programmer) from having to do all of that explicitly; all you had to do was declare a __declspec(thread) variable and all of this happens magically behind the scenes.

There’s actually an additional curve that the compiler will sometimes throw with respect to how implicit TLS variables work from a code generation perspective. You may have noticed how I showed the x64 version of an access to a __declspec(thread) variable; this is because, by default, x86 builds of a .exe involve a special optimization (/GA (Optimize for Windows Application, quite possibly the worst name for a compiler flag ever)) that eliminates the step of referencing the special _tls_index variable by assuming that it is zero.

This optimization is only possible with a .exe that will run as the main process image. The assumption works in this case because the loader assigns per-module TLS index values on a sequential basis (based on the loaded module list), and the main process image should be the second thing in the loaded module list, after NTDLL (which, now that this optimization is being used, can never have any __declspec(thread) variables, or it would get TLS index zero instead of the main process image). It’s worth noting that in the (extremely rare) case that a .exe exports functions and is imported by another .exe, this optimization will cause random corruption if the imported .exe happens to use __declspec(thread).

For reference, with /GA enabled, the x86 build of the above code results in the following instructions:

mov     eax, large fs:2Ch
mov     ecx, [eax]
mov     dword ptr [ecx+4], 2Ah ; 42

Remember that on x86, fs points to the base address of the TEB, and that ThreadLocalStoragePointer is at offset +0x2C from the base of the x86 TEB.

Notice that there is no reference to _tls_index; the compiler assumes that it will take on the value zero. If one examines a .dll built with the x86 compiler, the /GA optimization is always disabled, and _tls_index is used as expected.

The magic behind __declspec(thread) extends beyond just the compiler and linker, however. Something still has to set up the storage for each module’s per-thread state, and that something is the loader. More on how the loader plays a part in this complex process next time.

Thread Local Storage, part 3: Compiler and linker support for implicit TLS

Wednesday, October 24th, 2007

Last time, I discussed the mechanisms by which so-called explicit TLS operates (the TlsGetValue, TlsSetValue and other associated supporting routines).

Although explicit TLS is certainly fairly heavily used, many of the more “interesting” pieces about how TLS works in fact relate to the work that the loader does to support implicit TLS, or __declspec(thread) variables (in CL). While both TLS mechanisms are designed to provide a similar effect, namely the capability to store information on a per-thread basis, many aspects of the implementations of the two different mechanisms are very different.

When you declare a variable with the __declspec(thread) extended storage class, the compiler and linker cooperate to allocate storage for the variable in a special region in the executable image. By convention, all variables with the __declspec(thread) storage class are placed in the .tls section of a PE image, although this is not technically required (in fact, the thread local variables do not even really need to be in their own section, merely contiguous in memory, at least from the loader’s perspective). On disk, this region of memory contains the initializer data for all thread local variables in a particular image. However, this data is never actually modified and references to a particular thread local variable will never refer to an address within this section of the PE image; the data is merely a “template” to be used when allocating storage for thread local variables after a thread has been created.

The compiler and linker also make use of several special variables in the context of implicit TLS support. Specifically, a variable by the name of _tls_used (of the type IMAGE_TLS_DIRECTORY) is created by a portion of the C runtime that is static linked into every program to represent the TLS directory that will be used in the final image (references to this variable should be extern “C” in C++ code for name decoration purposes, and storage for the variable need not be allocated as the supporting CRT stub code already creates the variable). The TLS directory is a part of the PE header of an executable image which describes to the loader how the image’s thread local variables are to be managed. The linker looks for a variable by the name of _tls_used and ensures that in the on-disk image, it overlaps with the actual TLS directory in the final image.

The source code for the particular section of C runtime logic that declares _tls_used lives in the tlssup.c file (which comes with Visual Studio), making the variable pseudo-documented. The standard declaration for _tls_used is as so:

_CRTALLOC(".rdata$T")
const IMAGE_TLS_DIRECTORY _tls_used =
{
 (ULONG)(ULONG_PTR) &_tls_start, // start of tls data
 (ULONG)(ULONG_PTR) &_tls_end,   // end of tls data
 (ULONG)(ULONG_PTR) &_tls_index, // address of tls_index
 (ULONG)(ULONG_PTR) (&__xl_a+1), // pointer to callbacks
 (ULONG) 0,                      // size of tls zero fill
 (ULONG) 0                       // characteristics
};

The CRT code also provides a mechanism to allow a program to register a set of TLS callbacks, which are functions with a similar prototype to DllMain that are called when a thread starts or exits (cleanly) in the current process. (These callbacks can even be registered for a main process image, where there is no DllMain routine.) The callbacks are typed as PIMAGE_TLS_CALLBACK, and the TLS directory points to a null-terminated array of callbacks (called in sequence).

For a typical image, there will not exist any TLS callbacks (in practice, almost everything uses DllMain to perform per-thread initialization tasks). However, the support is retained and is fully functional. To use the support that the CRT provides for TLS callbacks, one needs to declare a variable that is stored in the specially named “.CRT$XLx” section, where x is a value between A and Z. For example, one might write the following code:

#pragma section(".CRT$XLY",long,read)

extern "C" __declspec(allocate(".CRT$XLY"))
  PIMAGE_TLS_CALLBACK _xl_y  = MyTlsCallback;

The strange business with the special section names is required because the in-memory ordering of the TLS callback pointers is significant. To understand what is happening with this peculiar looking declaration, it is first necessary to understand a bit about the compiler and linker organize data in the final PE image that is produced.

Non-header data in a PE image is placed into one or more sections, which are regions of memory with a common set of attributes (such as page protection). The __declspec(allocate(“section-name”)) keyword (CL-specific) tells the compiler that a particular variable is to be placed in a specific section in the final executable. The compiler additionally has support for concatenating similarly-named sections into one larger section. This support is activated by prefixing a section name with a $ character followed by any other text. The compiler concatenates the resulting section with the section of the same name, truncated at the $ character (inclusive).

The compiler alphabetically orders individual sections when concatenating them (due to the usage of the $ character in the section name). This means that in-memory (in the final executable image), a variable in the “.CRT$XLB” section will be after a variable in the “.CRT$XLA” section but before a variable in “.CRT$XLZ” section. The C runtime uses this quirk of the compiler to create an array of null terminated function pointers to TLS callbacks (with the pointer stored in the “.CRT$XLZ” section being the null terminator). Thus, in order to ensure that the declared function pointer resides within the confines of the TLS callback array being referenced by _tls_used, it is necessary place in a section of the form “.CRT$XLx“.

The creation of the TLS directory is, however, only one portion of how the compiler and linker work together to support __declspec(thread) variables. Next time, I’ll discuss just how the compiler and linker manage accesses to such variables.

Update: Phil mentions that this support for TLS callbacks does not work before the Visual Studio 2005 release. Be warned if you are still using an old compiler package.

Thread Local Storage, part 1: Overview

Monday, October 22nd, 2007

Windows, like practically any other mainstream multithreading operating system, provides a mechanism to allow programmers to efficiently store state on a per-thread basis. This capability is typically known as Thread Local Storage, and it’s quite handy in a number of circumstances where global variables might need to be instanced on a per-thread basis.

Although the usage of TLS on Windows is fairly well documented, the implementation details of it are not so much (though there are a smattering of pieces of third party documentation floating out there).

Conceptually, TLS is in principal not all that complicated (famous last words), at least from a high level. The general design is that all TLS accesses go through either a pointer or array that is present on the TEB, which is a system-defined data structure that is already instanced per thread.

The “per-thread” resolution of the TEB is fairly well documented, but for the benefit of those that are unaware, the general idea is that one of the segment registers (fs on x86, gs on x64) is repurposed by the OS to point to the base address of the TEB for the current thread. This allows, say, an access to fs:[0x0] (or gs:[0x0] on x64) to always access the TEB allocated for the current thread, regardless of other threads in the address space. The TEB does really exist in the flat address space of the process (and indeed there is a field in the TEB that contains the flat virtual address of it), but the segmentation mechanism is simply used to provide a convenient way to access the TEB quickly without having to search through a list of thread IDs and TEB pointers (or other relatively slow mechanisms).

On non-x86 and non-x64 architectures, the underlying mechanism by which the TEB is accessed varies, but the general theme is that there is a register of some sort which is always set to the base address of the current thread’s TEB for easy access.

The TEB itself is probably one of the best-documented undocumented Windows structures, primarily because there is type information included for the debugger’s benefit in all recent ntdll and ntoskrnl.exe builds. With this information and a little disassembly work, it is not that hard to understand the implementation behind TLS.

Before we can look at the implementation of how TLS works on Windows, however, it is necessary to know the documented mechanisms to use it. There are two ways to accomplish this task on Windows. The first mechanism is a set of kernel32 APIs (comprising TlsGetValue, TlsSetValue, TlsAlloc, and TlsFree that allows explicit access to TLS. The usage of the functions is fairly straightforward; TlsAlloc reserves space on all threads for a pointer-sized variable, and TlsGetValue can be used to read this per-thread storage on any thread (TlsSetValue and TlsFree are conceptually similar).

The second mechanism by which TLS can be accessed on Windows is through some special support from the loader (residing ntdll) and the compiler and linker, which allow “seamless”, implicit usage of thread local variables, just as one would use any global variable, provided that the variables are tagged with __declspec(thread) (when using the Microsoft build utilities). This is more convenient than using the TLS APIs as one doesn’t need to go and call a function every time you want to use a per-thread variable. It also relieves the programmer of having to explicitly remember to call TlsAlloc and TlsFree at initialization time and deinitialization time, and it implies an efficient usage of per-thread storage space (implicit TLS operates by allocating a single large chunk of memory, the size of which is defined by the sum of all per-thread variables, for each thread so that only one index into the implicit TLS array is used for all variables in a module).

With the advantages of implicit TLS, why would anyone use the explicit TLS API? Well, it turns out that prior to Windows Vista, there are some rather annoying limitations baked into the loader’s implicit TLS support. Specifically, implicit TLS does not operate when a module using it is not being loaded at process initialization time (during static import resolution). In practice, this means that it is typically not usable except by the main process image (.exe) of a process, and any DLL(s) that are guaranteed to be loaded at initialization time (such as DLL(s) that the main process image static links to).

Next time: Taking a closer look at explicit TLS and how it operates under the hood.

Fast kernel debugging for VMware, part 4: Communicating with the VMware VMM

Tuesday, October 9th, 2007

Yesterday, I outlined some of the general principles behind how guest to host communication in VMs work, and why the virtual serial port isn’t really all that great of a way to talk to the outside world from a VM. Keeping this information in mind, it should be possible to do much better in a VM, but it is first necessary to develop a way to communicate with the outside world from within a VMware guest.

It turns out that, as previously mentioned, that there happen to be a lot of things already built-in to VMware that need to escape from the guest in order to notify the host of some special event. Aside from the enhanced (virtualization-aware) hardware drivers that ship with VMware Tools (the VMware virtualization-aware addon package for VMware guests), for example, there are a number of “convenience features” that utilize specialized back-channel communication interfaces to talk to code running in the VMM.

While not publicly documented by VMware, these interfaces have been reverse engineered and pseudo-documented publicly by enterprising third parties. It turns out the VMware has a generalized interface (a “fake” I/O port) that can be accessed to essentially call a predefined function running in the VMM, which performs the requested task and returns to the VM. This “fake” I/O port does not correspond to how other I/O ports work (in particular, additional registers are used). Virtually all (no pun intended) of the VMware Tools “convenience features”, from mouse pointer tracking to host to guest time synchronization use the VMware I/O port to perform their magic.

Because there is already information publicly available regarding the I/O port, and because many of the tasks performed using it are relatively easy to find host-side in terms of the code that runs, the I/O port is an attractive target for a communication mechanism. The mechanisms by which to use it guest-side have been publicly documented enough to be fairly easy to use from a code standpoint. However, there’s still the problem of what happens once the I/O port is triggered, as there isn’t exactly a built-in command that does anything like take data and magically send it to the kernel debugger.

For this, as alluded to previously, it is necessary to do a bit of poking around in the VMware VMM in order to locate a handler for an I/O port command that would be feasible to replace for purposes of shuttling data in and out of the VM to the kernel debugger. Although the VMware Tools I/O port interface is likely not designed (VMM-side, anyway) for high performance, high speed data transfers (at least compared to the mechanisms that, say, the virtualization-aware NIC driver might use), it is at the very least orders of magnitude better than the virtual serial port, certainly enough to provide serious performance improvements with respect to kernel debugging, assuming all goes according to plan.

Looking through the list of I/O port commands that have been publicly documented (if somewhat unofficially), there are one or two that could possibly be replaced without any real negative impacts on the operation of the VM itself. One of these commands (0x12) is designed to pop-up the “Operating System Not Found” dialog. This command is actually used by the VMware BIOS code in the VM if it can’t find a bootable OS, and not typically by VMware Tools itself. Since any VM that anyone would possibly be kernel debugging must by definition have a bootable operating system installed, axing the “OS Not Found” dialog is certainly no great loss for that case. As an added bonus, because this I/O port command displays UI and accesses string resources, the handler for it happened to be fairly easy to locate in the VMM code.

In terms of the VMM code, the handler for the OS Not Found dialog command looks something like so:

int __cdecl OSNotFoundHandler()
{
   if (!IsVMInPrivilegedMode()) /* CPL=0 check */
   {
     log error message;
     return -1;
   }

   load string resources;
   display message box;

   return somefunc();
}

Our mission here is really to just patch out the existing code with something that knows how to talk to take data from the guest and move it to the kernel debugger, and vice versa. A naive approach might be to try and access the guest’s registers and use them to convey the data words to transfer (it would appear that many of the I/O port handlers do have access to the guest’s registers as many of the I/O port commands modify the register data of the guest), but this approach would incur a large number of VM exits and therefore be suboptimal.

A better approach would be to create some sort of shared memory region in the VM and then simply use the I/O port command as a signal that data is ready to be sent or that the VM is now waiting for data to be received. (The execution of the VM, or at least the current virtual CPU, appears to be suspended while the I/O port handler is running. In the case of the kernel debugger, all but one CPU would be halted while a KdSendPacket or KdReceivePacket call is being made, making the call essentially one that blocks execution of the entire VM until it returns.)

There’s a slight problem with this approach, however. There needs to be a way to communicate the address of the shared memory region from the guest to the modified VMM code, and then the modified VMM code needs to be able to translate the address supplied by the guest to an address in the VMM’s address space host-side. While the VMware VMM most assuredly has some sort of capability to do this, finding it and using it would make the (already somewhat invasive) patches to the VMM even more likely to break across VMware versions, making such an address translation approach undesirable from the perspective of someone doing this without the help of the actual vendor.

There is, however, a more unsophisticated approach that can be taken: The code running in the guest can simply allocate non-paged physical memory, fill it with a known value, and then have the host-side code (in the VMM) simply scan the entire VMM virtual address space for the known value set by the guest in order to locate the shared memory region in host-relative virtual address space. The approach is slow and about the farthest thing from elegant, but it does work (and it only needs to be done once per boot, assuming the VMM doesn’t move pinned physical pages around in its virtual address space). Even if the VMM does occasionally move pages around, it is possible to compensate for this, and assuming such moves are infrequent still achieve acceptable performance.

The astute reader might note that this introduces a slight hole whereby a user mode caller in the VM could spoof the signature used to locate the shared memory block and trick the VMM-side code into talking to it instead of the KD logic running in kernel mode (after creating the spoofed signature, a malicious user mode process would wait for the kernel mode code to try and contact the VMM, and hope that its spoofed region would be located first). This could certainly be solved by tighter integration with the VMM (and could be quite easily eliminated by having the guest code pass an address in a register which the VMM could translate instead of doing a string hunt through virtual address space), but in the interest of maintaining broad compatibility across VMware VMMs, I have not chosen to go this route for the initial release.

As it turns out, spoofing the link with the kernel debugger is not really all that much of a problem here, as due the way VMKD is designed, it is up to the guest-side code to actually act on the data that is moved into the shared memory region, and a non-privileged user mode process would have limited ability to do so. It could certainly attempt to confuse the kernel debugger, however.

After the guest-side virtual address of the shared memory region is established, the guest and the host-side code running in the VMM can now communicate by filling the shared memory region with data. The guest can then send the I/O port command in order to tell the host-side code to send the data in the shared memory region, and/or wait for and copy in data destined from a remote kernel debugger to the code running in the guest.

With this model, the guest is entirely responsible for driving the kernel debugger connection in that the VMM code is not allowed to touch the shared memory region unless it has exclusive access (which is true if and only if the VM is currently waiting on an I/O port call to the patched handler in the VMM). However, as the low-level KD data transmission model is synchronous and not event-driven, this does not pose a problem for our purposes, thus allowing a fairly simple and yet relatively performant mechanism to connect the KD stub in the kernel to the actual kernel debugger.

Now that data can be both received from the guest and sent to the guest by means of the I/O port interface combined with the shared memory region, all that remains is to interface the debugger (DbgEng.dll) with the patched I/O port handler running in the VMM.

It turns out that there’s a couple of twists relating to this final step that make it more difficult than what one might initially expect, however. Expect to see details on that (and more) on the next installment of the VMKD series…

The beginning of the end of the single-processor era

Tuesday, July 10th, 2007

I came across a quote on CNet that stuck with me yesterday:

It’s hard to see how there’s room for single-core processors when prices for nearly half of AMD’s dual-core Athlon 64 X2 chips have crept well below the $100 mark.

I think that that this sentiment is especially true nowadays (at least for conventional PC-style computers – not counting embedded things). Multiprocessor (at least pseudo-multiprocessor, in the form of Intel’s HyperThreading) has been available on end-user computers for some time now. Furthermore, full multiprocessor, in terms of multi-core chips, is now mainstream. What I mean by that is that by now, most of that computers you’ll get from Dell, Best Buy, and the likes will be MP, whether via HyperThreading or multi-core.

To give you an idea, I recently got a 4-way server (a single quad core chip) recently, for ~$2300 or so (though it was also reasonably equipped other than in the CPU department). At work, we got an 8-way box (2x dual core chips) for under under ~$3000 or so as well, for running VMs for our quality assurance department. Just a few years ago, getting an 8-way box “just like that” would have been unheard of (and ridiculously expensive), and yet here we are, with medium-level servers that Dell ships coming with that kind of multiprocessing “out of the box”.

Even laptops are coming with multicore chips in today’s day and age, and laptops have historically not been exactly performance leaders due to size, weight, and battery life constraints. All but the most entry-level laptops Dell ships nowadays are dual core, for instance (and this is hardly limited to Dell either; Apple is shipping dual-core Intel Macs as well for their laptop systems, and has been for some time in fact.)

Microsoft seems to have recognized this as well; for instance, there is no single processor kernel shipping with Windows Vista, Windows Server 2008, or future Windows versions. That doesn’t mean that Windows doesn’t support single processor systems, but just that there is no longer an optimized single processor kernel (e.g. replacing spinlocks with a simple KeRaiseIrql(DISPATCH_LEVEL) call) anymore. The reason is that for new systems, of which are expected to be the vast majority of Vista/Server 2008 installs, multiprocessing capability is just so common that it’s not worth maintaining a separate kernel and HAL just for the odd single processor system’s benefit anymore.

What all this means is that if as developers, you haven’t been really paying attention to the multiprocessor scene, now’s the time to start – it’s a good bet that within a few years, even on very low end systems, single processor boxes are going to become very rare. For intensive applications, the capability to take advantage of MP is going to start being a defining point now, especially as chip makers have realized that they can’t just indefinitely increase clock rates and have accordingly began to transition to multiprocessing as an alternative way to increase performance.

Microsoft isn’t the only company that’s taking notice of MP becoming mainstream, either. For instance, VMware now fully supports multiprocessor virtual machines (even on its free VMware Server product), as a way to boost performance on machines with true multiprocessing capability. (And to their credit, it’s actually not half-bad as long as you aren’t loading the processors down completely, at which point it seems to turn into a slowdown – perhaps due to VMs competing with eachother for scheduling while waiting on spinlocks, though I didn’t dig in deeper.)

(Sorry if I sound a bit like Steve when talking about MP, but it really is here, now, and now’s the time to start modifying your programs to take advantage of it. That’s not to say that we’re about to see 100+ core computers becoming mainstream tommorow, but small-scale multiprocessing is very rapidly becoming the standard in all but the most low cost systems.)

Why doesn’t the publicly available kernrate work on Windows x64? (and how to fix it)

Monday, June 4th, 2007

Previously, I wrote up an introduction to kernrate (the Windows kernel profiler). In that post, I erroneously stated that the KrView distribution includes a version of kernrate that works on x64. Actually, it supports IA64 and x86, but not x64. A week or two ago, I had a problem on an x64 box of mine that I wanted to help track down using kernrate. Unfortunately, I couldn’t actually find a working kernrate for Windows x64 (Srv03 / Vista).

It turns out that nowhere is there a published kernrate that works on any production version of Windows x64. KrView ships with an IA64 version, and the Srv03 resource kit only ships with an x86 version only. (While you can run the x86 version of kernrate in Wow64, you can’t use it to profile kernel mode; for that, you need a native x64 version.)

A bit of digging turned up a version of kernrate labeled ‘AMD64’ in the Srv03 SP0 DDK (the 3790.1830 / Srv03 SP1 DDK doesn’t include kernrate at all, only documentation for it, and the 6000 WDK omits even the documentation, which is really quite a shame). Unfortunately, that version of kernrate, while compiled as native x64 and theoretically capable of profiling the x64 kernel, doesn’t actually work. If you try to run it, you’ll get the following error:

NtQuerySystemInformation failed status c0000004
KERNRATE: Failed to get SYSTEM_BASIC_INFORMATION

So, no dice even with the Srv03 SP0 DDK version of kernrate, or so I thought. Ironically, the various flavors of x86 (including 3790.0) kernrate I could find all worked on Srv03 x64 SP1 (3790.1830) / Vista x64 SP0 (6000.0), but as I mentioned earlier, you can’t use the profiling APIs to profile kernel mode from a Wow64 program. So, I had a kernrate that worked but couldn’t profile kernel mode, and a kernrate that should have worked and been able to profile kernel mode except that it bombed out right away.

After running into that wall, I decided to try and get ahold of someone at Microsoft who I suspected might be able to help, Andrew Rogers from the Windows Serviceability group. After trading mails (and speculation) back and forth a bit, we eventually determined just what was going on with that version of kernrate.

To understand what the deal is with the 3790.0 DDK x64 version of kernrate, a little history lesson is in order. When the production (“RTM”) version of Windows Server 2003 was released, it was supported on two platforms: x86, and IA64 (“Windows Server 2003 64-bit”). This continued until the Srv03 Service Pack 1 timeframe, when support for another platform “went gold” – x64 (or AMD64). Now, this means that the production code base for Srv03 x64 RTM (3790.1830) is essentially comparable to Srv03 x86 SP1 (3790.1830). While normally, there aren’t “breaking changes” (or at least, these are tried to kept to a minimum) from service pack to service pack, Srv03 x64 is kind of a special case.

You see, there was no production / RTM release of Srv03 x64 3790.0, or a “Service Pack 0” for the x64 platform. As a result, in a special case, it was acceptable for there to be breaking changes from 3790.0/x64 to 3790.1830/x64, as pre-3790.1830 builds could essentially be considered beta/RC builds and not full production releases (and indeed, they were not generally publicly available as in a normal production release).

If you’re following me this far, you’re might be thinking “wait a minute, didn’t he say that the 3790.0 x86 build of kernrate worked on Srv03 3790.1830?” – and in fact, I did imply that (it does work). The breaking change in this case only relates to 64-bit-specific parts, which for the most part excludes things visible to 32-bit programs (such as the 3790.0 x86 kernrate).

In this particular case, it turns out that part of the SYSTEM_BASIC_INFORMATION structure was changed from the 3790.0 timeframe to the 3790.1830 timeframe, with respect to x64 platforms.

The 3790.0 structure, as viewed from x64 builds, is approximately like this:

typedef struct _SYSTEM_BASIC_INFORMATION_3790 {
    ULONG Reserved;
    ULONG TimerResolution;
    ULONG PageSize;
    ULONG_PTR NumberOfPhysicalPages;
    ULONG_PTR LowestPhysicalPageNumber;
    ULONG_PTR HighestPhysicalPageNumber;
    ULONG AllocationGranularity;
    ULONG_PTR MinimumUserModeAddress;
    ULONG_PTR MaximumUserModeAddress;
    KAFFINITY ActiveProcessorsAffinityMask;
    CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION_3790, *PSYSTEM_BASIC_INFORMATION_3790;

However, by the time 3790.1830 (the “RTM” version of Srv03 x64 / XP x64) was released, a subtle change had been made:

typedef struct _SYSTEM_BASIC_INFORMATION {
	ULONG Reserved;
	ULONG TimerResolution;
	ULONG PageSize;
	ULONG NumberOfPhysicalPages;
	ULONG LowestPhysicalPageNumber;
	ULONG HighestPhysicalPageNumber;
	ULONG AllocationGranularity;
	ULONG_PTR MinimumUserModeAddress;
	ULONG_PTR MaximumUserModeAddress;
	KAFFINITY ActiveProcessorsAffinityMask;
	CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION, *PSYSTEM_BASIC_INFORMATION;

Essentially, three ULONG_PTR fields were “contracted” from 64-bits to 32-bits (by changing the type to a fixed-length type, such as ULONG). The cause for this relates again to the fact that the first 64-bit RTM build of Srv03 was for IA64, and not x64 (in other words, “64-bit Windows” originally just meant “Windows for Itanium”, something that is to this day still unfortunately propagated in certain MSDN documentation).

According to Andrew, the reasoning behind the difference in the two structure versions is that those three fields were originally defined as either a pointer-sized data type (such as SIZE_T or ULONG_PTR – for 32-bit builds), or a 32-bit data type (such as ULONG – for 64-bit builds) in terms of the _IA64_ preprocessor constant, which indicates an Itanium build. However, 64-bit builds for x64 do not use the _IA64_ constant, which resulted in the structure fields being erroneously expanded to 64-bits for the 3790.0/x64 builds. By the time Srv03 x64 RTM (3790.1830) had been released, the page counts had been fixed to be defined in terms of the _WIN64 preprocessor constant (indicating any 64-bit build, not just an Itanium build). As a result, the fields that were originally 64-bits long for the prerelease 3790.0/x64 build became only 32-bits long for the RTM 3790.1830/x64 production build. Note that due to the fact the page counts are expressed in terms of a count of physical pages, which are at least 4096 bytes for x86 and x64 (and significantly more for large physical pages on those platforms), keeping them as a 32-bit quantity is not a terribly limiting factor on total physical memory given today’s technology, and that of the foreseeable future, at least relating to systems that NT-based kernels will operate on).

The end result of this change is that the SYSTEM_BASIC_INFORMATION structure that kernrate 3790.0/x64 tries to retrieve is incompatible with the 3790.1830/x64 (RTM) version of Srv03, hence the call failing with c0000004 (otherwise known as STATUS_INFO_LENGTH_MISMATCH). The culimination of this is that the 3790.0/x64 version of kernrate will abort on production builds of Srv03, as the SYSTEM_BASIC_INFORMATION structure format is incompatible with the version kernrate is expecting.

Normally, this would be pretty bad news; the program was compiled against a different layout of an OS-supplied structure. Nonetheless, I decided to crack open kernrate.exe with IDA and HIEW to see what I could find, in the hopes of fixing the binary to work on RTM builds of Srv03 x64. It turned out that I was in luck, and there was only one place that actually retrieved the SYSTEM_BASIC_INFORMATION structure, storing it in a global variable for future reference. Unfortunately, there were a great many references to that global; too many to be practical to fix to reflect the new structure layout.

However, since there was only one place where the SYSTEM_BASIC_INFORMATION structure was actually retrieved (and even more, it was in an isolated, dedicated function), there was another option: Patch in some code to retrieve the 3790.1830/x64 version of SYSTEM_BASIC_INFORMATION, and then convert it to appear to kernrate as if it were actually the 3790.0/x64 layout. Unfortunately, a change like this involves adding new code to an existing binary, which means that there needs to be a place to put it. Normally, the way you do this when patching a binary on-disk is to find some padding between functions, or at the end of a PE section that is marked as executable but otherwise unused, and place your code there. For large patches, this may involve “spreading” the patch code out among various different “slices” of padding, if there is no one contiguous block that is long enough to contain the entire patch.

In this case, however, due to the fact that the routine to retrieve the SYSTEM_BASIC_INFORMATION structure was a dedicated, isolated routine, and that it had some error handling code for “unlikely” situations (such as failing a memory allocation, or failing the NtQuerySystemInformation(…SystemBasicInformation…) call (although, it would appear the latter is not quite so “unlikely” in this case), a different option presented itself: Dispense with the error checking, most of which would rarely be used in a realistic situation, and use the extra space to write in some code to convert the structure layout to the version expected by kernrate 3790.0. Obviously, while not a completely “clean” solution per se, the idea does have its merits when you’re already into patch-the-binary-on-disk-land (which pretty much rules out the idea of a “clean” solution altogether at that point).

The structure conversion is fairly straightforward code, and after a bit of poking around, I had a version to try out. Lo and behold, it actually worked; it turns out that the only thing that was preventing the prerelease 3790.0/x64 kernrate from doing basic kernel profiling on production x64 kernels was the layout of the SYSTEM_BASIC_INFORMATION structure.

For those so inclined, the patch I created effectively rewrites the original version of the function, as shown below after translated to C:

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformation(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 NTSTATUS                       Status;

 Sbi = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));

 if (!Sbi)
 {
  fprintf(stderr, "Buffer allocation failed"
   " for SystemInformation in "
   "GetSystemBasicInformation\\n");
  exit(1);
 }

 if (!NT_SUCCESS((Status = NtQuerySystemInformation(
  SystemBasicInformation,
  Sbi,
  sizeof(SYSTEM_BASIC_INFORMATION_3790),
  0))))
 {
  fprintf(stderr, "NtQuerySystemInformation failed"
   " status %08lx\\n",
   Status);
  free(Sbi);

  Sbi = 0;
 }

 return Sbi;
}

Conceptually represented in C, the modified (patched) function would appear something like the following (note that the error checking code has been removed to make room for the structure conversion logic, as previously mentioned; the substantial new changes are colored in red):

(Note that the patch code was not compiler-generated and is not truly a function written in C; below is simply how it would look if it were translated from assembler to C.)

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformationFixed(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 PSYSTEM_BASIC_INFORMATION      SbiReal;

 Sbi     = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));
 SbiReal = (PSYSTEM_BASIC_INFORMATION)Sbi;

 NtQuerySystemInformation(
  SystemBasicInformation,
  SbiReal,
  sizeof(SYSTEM_BASIC_INFORMATION),
  0);

 Sbi->NumberOfProcessors           =
  SbiReal->NumberOfProcessors;
 Sbi->ActiveProcessorsAffinityMask =
  SbiReal->ActiveProcessorsAffinityMask;
 Sbi->MaximumUserModeAddress       =
  SbiReal->MaximumUserModeAddress;
 Sbi->MinimumUserModeAddress       =
  SbiReal->MinimumUserModeAddress;
 Sbi->AllocationGranularity        =
  SbiReal->AllocationGranularity;
 Sbi->HighestPhysicalPageNumber    =
  SbiReal->HighestPhysicalPageNumber;
 Sbi->LowestPhysicalPageNumber     =
  SbiReal->LowestPhysicalPageNumber;
 Sbi->NumberOfPhysicalPages        =
  SbiReal->NumberOfPhysicalPages;

 return Sbi;
}

(The structure can be converted in-place due to how the two versions are laid out in memory; the “expected” version is larger than the “real” version (and more specifically, the offsets of all the fields we care about are greater in the “expected” version than the “real” version), so structure fields can safely be copied from the tail of the “real” structure to the tail of the “expected” structure.)

If you find yourself in a similar bind, and need a working kernrate for x64 (until if and when Microsoft puts out a new version that is compatible with production x64 kernels), I’ve posted the patch (assembler, with opcode diffs) that I made to the “amd64” release of kernrate.exe from the 3790.0 DDK. Any conventional hex editor should be sufficient to apply it (as far as I know, third parties aren’t authorized to redistribute kernrate in its entirety, so I’m not posting the entire binary). Note that DDKs after the 3790.0 DDK don’t include any kernrate for x64 (even the prerelease version, broken as it was), so you’ll also need the original 3790.0 DDK to use the patch. Hopefully, we may see an update to kernrate at some point, but for now, the patch suffices in a pinch if you really need to profile a Windows x64 system.

Beware GetThreadContext on Wow64

Friday, May 11th, 2007

If you’re planning on running a 32-bit program of yours under Wow64, one of the things that you may need to watch out for is a subtle change in how GetThreadContext / SetThreadContext operate for Wow64 processes.

Specifically, these operations require additional access rights when operating on a Wow64 context (as is the case of a Wow64 process calling Get/SetThreadContext). This is hinted at in MSDN in the documentation for GetThreadContext, with the following note:

WOW64: The handle must also have THREAD_QUERY_INFORMATION access.

However, the reality of the situation is not completely documented by MDSN.

Under the hood, the Wow64 context (e.g. x86 context) for a thread on Windows x64 is actually stored at a relatively well-known location in the virtual address space of the thread’s process, in user-mode. Specifically, one of the TLS slots in thread’s TLS array is repurposed to point to a block of memory that contains the Wow64 context for the thread (used to read or write the context when the Wow64 “portion” of the thread is not currently executing). This is presumably done for performance reasons, as the Wow64 thunk layer needs to be able to quickly transition to/from x86 mode and x64 mode. By storing the x86 context in user mode, this transition can be managed without a kernel mode call. Instead, a simple far call is used to make the transition from x86 to x64 (and an iretq is used to transition from x64 to x86). For example, the following is what you might see when stepping into a Wow64 layer call with the 64-bit debugger while debugging a 32-bit process:

ntdll_777f0000!ZwWaitForMultipleObjects+0xe:
00000000`7783aebe 64ff15c0000000  call    dword ptr fs:[0C0h]
0:000:x86> t
wow64cpu!X86SwitchTo64BitMode:
00000000`759c31b0 ea27369c753300  jmp     0033:759C3627
0:012:x86> t
wow64cpu!CpupReturnFromSimulatedCode:
00000000`759c3627 67448b0424      mov     r8d,dword ptr [esp]

A side effect of the Wow64 register context being stored in user mode, however, is that it is not so easily accessible to a remote process. In order to access the Wow64 context, what needs to occur is that the TEB for a thread in question must be located, then read out of the thread’s process’s address space. From there, the TLS array is processed to locate the pointer to the Wow64 context structure, which is then read (or written) from the thread’s process’s address space.

If you’ve been following along so far, you might see the potential problem here. From the perspective of GetThreadContext, there is no handle to the process associated with the thread in question. In other words, we have a missing link here: In order to retrieve the Wow64 context of the thread whose handle we are given, we need to perform a VM read operation on its process. However, to do that, we need a process handle, but we’ve only got a thread handle.

The way that the Wow64 layer solves this problem is to query the process ID associated with the requested thread, open a new handle to the process with the required access rights, and then performs the necessary VM read / write operations.

Now, normally, this works fine; if you can get a handle to the thread of a process, then you should almost always be able to get a handle to the process itself.

However, the devil is in the details, here. There are situations where you might not have access to open a handle to a process, even though you have a handle to the thread (or even an existing process handle, but you can’t open a new one). These situations are relatively rare, but they do occur from time to time (in fact, we ran into one at work here recently).

The most likely scenario for this issue is when you are dealing with (e.g. creating) a process that is operating under a different security context than your process. For example, if you are creating a process operating as a different user, or are working with a process that has a higher integrity level than the current process, then you might not have access to open a new handle to the process (even if you might already have an existing handle returned by a CreateProcess* family routine.

This turns into a rather frustrating problem, especially if you have a handle to both the process and a thread in the process that you’re modifying; in that case, you already have the handle that the Wow64 layer is trying to open, but you have no way to communicate it to Wow64 (and Wow64 will try and fail to open the handle on its own when you make the GetThreadContext / SetThreadContext call).

There are two effective solutions to this problem, neither of which are particularly pretty.

First, you could reverse engineer the Wow64 layer and figure out the exact specifics behind how it locates the PWOW64_CONTEXT, and implement that logic inline (using your already-existing process handle instead of creating a new one). This has the downside that you’re way into undocumented implementation details land, so there isn’t a guarantee that your code will continue to operate on future Windows versions.

The other option is to temporarily modify the security descriptor of the process to allow you to open a second handle to it for the duration of the GetThreadContext / SetThreadContext calls. Although this works, it’s definitely a pain to have to go muddle around with security descriptors just to get the Wow64 layer to work properly.

Note that if you’re a native x64 process on x64 Windows, and you’re setting the 64-bit context of a 64-bit process, this problem does not apply. (Similarly, if you’re a 32-bit process on 32-bit Windows, things work “as expected” as well.)

So, in case you’ve been getting mysterious STATUS_ACCESS_DENIED errors out of GetThreadContext / SetThreadContext in a Wow64 program, now you know why.

Don’t perform complicated tasks in your unhandled exception filter

Thursday, May 10th, 2007

When it comes to crash reporting, the mechanism favored by many for globally catching “crashes” is the unhandled exception filter (as set by SetUnhandledExceptionFilter).

However, many people tend to go wrong with this mechanism with respect to what actions they take when the unhandled exception filter is called. To understand what I am talking about, it’s necessary to define the conditions under which an unhandled exception filter is executed.

The unhandled exception filter is, by definition, called when an unhandled exception occurs in any Win32 thread in a process. This sort of event is virtually always caused by some sort of corruption of the process state somewhere, such that something eventually probably touched a non-allocated page somewhere and caused an unhandled access violation (or some other similarly severe problem).

In other words, in the context of the unhandled exception filter, you don’t really know what lead up to the current unhandled exception, and more importantly, you don’t know what you can rely on in the process. For example, if you get an AV that bubbles up to your UEF, it might have been caused by corruption in the process heap, which would mean that you probably can’t safely perform heap allocations or you’re risking running into the same problem that caused the original crash in the first place. Or perhaps the problem was an unhandled allocation failure, and another attempt by your unhandled exception filter to allocate memory might just similarly fail.

Actually, the problem gets a bit worse, because you aren’t even guaranteed anything about what the other threads in the process are doing when the crash occurs (in fact, if there are any other threads in the process at the time of the crash, chances are that they’re still running when your unhandled exception filter is called – there is no magical logic to suspend all other activity in the process while your filter is called). This has a couple of other implications for you:

  1. You can’t really rely on the state of synchronization objects in the process. For all you know, the thread that crashed owned a lock that will cause a deadlock if you try to acquire a second lock, which might be owned by a thread that is waiting on the lock owned by the crashed thread.
  2. You can’t with 100% certainty assume that a secondary failure (precipitated by the “original” crash) won’t occur in a different thread, causing your exception filter to be entered by an additional thread at the same time as it is already processing an event from the “first” crash.

In fact, it would be safe to say that there is even less that you can safely do in an unhandled exception filter than under the infamous loader lock in DllMain (which is saying something indeed).

Given these rather harsh conditions, performing actions like heap allocations or writing minidumps from within the current process are likely to fail. Essentially, as whatever kind of recovery action you take from the UEF grows more complicated, it becomes extremely more likely to fail (possibly causing secondary failures that obscure the original problem) as a result of corruption that caused (or is caused by) the original crash. Even something as seemingly innocuous as creating a new process is potentially dangerous (are you sure that nothing in CreateProcess will ever touch the process heap? What about acquire the loader lock? I’ll give you a hint – the latter is definitely not true, such as in the case where Software Restriction Policies are defined).

If you’ve ever taken a look at the kernel32 JIT debugger support in Windows XP, you may have noticed that it doesn’t even follow these rules – it calls CreateProcess, after all. This is part of the reason why sometimes you’ll have programs silently crash even with a JIT debugger. For programs where you want truly robust crash reporting, I would recommend putting as much of the crash reporting logic into a separate process that is started before a crash occurs (e.g. during program initialization), instead of following the JIT launch-reporting-process approach. This “watchdog process” can then sit idle until it is signaled by the process it is watching over that a crash has occured.

This signaling mechanism should, ideally, be pre-constructed during initialization so that the actual logic within the unhandled exception filter just signals the watchdog process that an event occurs, with a pointer to the exception/context information. The filter should then wait for the watchdog process to signal that it is finished before exiting the process.

The mechanism that we use here at work with our programs to communicate between the “guarded” process and the watchdog is simply a file mapping that is mapped into both processes (for passing information between the two, such as the address of the exception record and context record for an active exception event) and a pair of events that are used to communicate “exception occured” and “dump writing completed” between the two processes. With this configuration, all the exception filter needs to do is to store some data in the file mapping view (already mapped ahead of time) and call SetEvent to notify the watchdog process to wake up. It then waits for the watchdog process to signal completion before terminating the process. (This particular mechanism does not address the issue of multiple crashes occuring at the same time, which is something that I deemed acceptable in this case.) The watchdog process is responsible for all of the “heavy lifting” of the crash reporting process, namely, writing the actual dump with MiniDumpWriteDump.

An alternative to this approach is to have the watchdog process act as a debugger on the guarded process; however, I do not typically recommend this as acting as a debugger has a number of adverse side effects (notably, that a great many events cause the process to be suspended entirely while the debugger/watchdog process can inspect the state of the process for a particular debugger event). The watchdog process mechanism is more performant (if ever so slightly less robust), as there is virtually no run-time overhead in the guarded process unless an unhandled exception occurs.

So, the moral of the story is: keep it simple (at least with respect to your unhandled exception filters). I’ve dealt with mechanisms that try to do the error reporting logic in-process, and those that punt the hard work off to a watchdog process in a clean state, and the latter is significantly more reliable in real world cases. The last thing you want to be happening with your crash reporting mechanism is that it causes secondary problems that hide the original crash, so spend the bit of extra work to make your reporting logic that much more reliable and save yourself the headaches later on.