Archive for the ‘NT Internals’ Category

Why are certain DLLs required to be at the same base address system-wide?

Tuesday, November 13th, 2007

There are several Windows DLLs that are, for various reasons, required to be at the same base address system-wide (though for several of these, a case could be made that alternate base addresses could be used in different Terminal Server sessions). Although not explicitly documented by Microsoft (as far as I know), a number of programs rely on these “fixed base” DLLs.

That is not to say that the base addresses of these DLLs cannot change, but that while the system is running, all processes will have these DLLs mapped at the same base address (if they are indeed mapped at all).

The current set of DLLs that require the same base address system-wide includes NTDLL, kernel32, and user32, though the reasons for this requirement vary a bit between each DLL.

NTDLL must be at the same address system wide because there are a number of routines it exports that are used by the kernel to arrange for (indirect) calls to user mode. For example, ntdll!LdrInitializeThunk is the true start address of every user mode thread, and ntdll!KiUserApcDispatcher is used to invoke a user mode APC if one is ready to be processed while a thread is in a user mode wait. The kernel resolves the address of these (and other) “special” exports at system initialization time, and then uses these addresses when it needs to arrange for user mode code execution. Because the kernel caches these resolved function pointers, NTDLL cannot typically be based at a different address from process to process, as the kernel will always reference the same address for these special exports across all processes.

Additionally, some special NTDLL exports are used by user mode code in such a way that they are assumed to be at the same base address cross-process. For example, ntdll!DbgUiRemoteBreakIn is used by the debugger to break in to a process, and the debugger assumes that the local address of DbgUiRemoteBreakIn matches the remote address of DbgUiRemoteBreakIn in the target process.

Kernel32 is required to be at the same base address because there are a number of internal kernel32 routines that, similar to ntdll!DbgUiRemoteBreakIn, are used in cross-process thread injection. One example of this used to be the console control event handler In the case of console events, during kernel32.dll initialization, the address of the Ctrl-C event dispatcher is passed to WinSrv.dll (in CSRSS space).

Originally, WinSrv simply cached the dispatcher pointer after the first process was created (thus requiring kernel32 to be at the same base address across all processes in the session). On modern systems, however, WinSrv now tracks the client’s kernel32 dispatcher pointer on a per-process basis to account for the fact that the dispatcher is at a different address in the 32-bit kernel32 (versus the 64-bit kernel32). Ironically, the developer who made this change to WinSrv actually forgot to add support for using the current process’s dispatcher pointer in several corner cases (such as kernel32!SetLastConsoleActiveEvent and the corresponding winsrv!SrvConsoleNotifyLastClose and winsrv!RemoveConsole CSRSS-side routines). In the cases where WinSrv still incorrectly passes the cached (64-bit) Ctrl-C dispatcher value to CreateRemoteThread, wow64.dll has a special hack (wow64!MapContextAddress64TO32) that cleans up after WinSrv and fixes the thread start address to refer to the 32-bit kernel32.dll.

By the time this change to WinSrv and Ctrl-C processing was made, though, the application compatibility impact of removing the kernel32 base address to be the same system-wide would have been too severe to eliminate the restriction (virtually all third party code injection code now relies heavily on this assumption). Thus, for this (and other) reasons, kernel32 still remains with the restriction that it may not be relocated to a different base address cross-process.

User32 is required to be at the same address cross-process because there is an array of user32 function addresses provided to win32k.sys for the built-in window class window procedures (among other things). This function pointer array is captured via a call to NtUserInitializeClientPfnArrays at session start up time, when WinSrv is initializing win32k during CSRSS initialization. Wow64win.dll, the NtUser/win32k Wow64 support library, provides support for mapping 32-bit to 64-bit (and vice versa) for these function addresses, as necessary for the support of 32-bit processes on 64-bit platforms.

The user32 and kernel32 requirements could arguably be relaxed to only apply within a Terminal Server session, although the Windows XP (and later) cross-session debugging support muddies the waters with respect to kernel32 due to a necessity to support debugger break-in with debuggers that utilize DebugBreak for their break-in threads. (The Wow64 layer provides translation assistance for mapping DebugBreak to a 32-bit address if a 64-bit thread is created at the address of the 64-bit kernel32 DebugBreak export.)

Note that the ASLR support in Windows Vista does not run afoul of these restrictions, as Vista’s ASLR always picks the same base address for a given DLL per operating system boot. Thus, even though, say, NTDLL can have its base address randomized at boot time under Windows Vista, the particular randomized base address that was chosen is still used by all processes until the operating system is restarted.

Update: skape points out that I (somehow) neglected to mention the most important restriction on kernel32 base addressing, that being on Windows Server 2003 and earlier operating systems, internal kernel32 routines are used as the start address of new threads by CreateRemoteThread and CreateProcess.

Viridian guest hypercall interface published

Friday, November 9th, 2007

Recently, Microsoft made a rather uncharacteristic move and (mostly) freely published the specifications for the Viridian hypercall interface (otherwise known as “Windows Server virtualization”). Publishing this documentation is, to be clear, a great thing for Microsoft to have done (in my mind, anyway).

The hypercall interface is in some respects analogus to the “native API” of the Windows kernel. Essentially, the hypercall interface is the mechanism by a privileged, virtualization-aware component running in a hypervisor partition can request assistance from the hypervisor for a particular task. In that respect, a hypercall is to a hypervisor as what a system call is to an operating system kernel.

It’s important to note that the documentation attempts to outline the hypercall interface from the perspective of documenting what one would need to implement a compatible hypervisor, and not from the perspective of how the Microsoft hypervisor implements said hypercall interface. However, it’s still worth a read and provides valuable insight into many aspects of how Viridian is architected at a high level.

I’m still working on digesting the whole specification (as the document is 241 pages long), but one thing that caught my eye was that there is special support in the hypervisor for debugging (in other words, kernel debugging). This support is implemented in the form of the HvPostDebugData, HvRetrieveDebugData, and HvResetDebugSession hypercalls (documented in the hypercall specification).

While I’m certainly happy to see that Microsoft is considering kernel debugging when it comes to the Viridian hypervisor, some aspects of how the Viridian hypercall interface works seem rather odd to me. After (re)reading the documentation for the debugging hypercalls a couple of times, I arrived at the conclusion that the Viridian debugging support is more oriented towards simply virtualizing and multiplexing an unreliable physical debugger link. The goal of this approach would seem to me to be that multiple partitions (operating system instances running under the hypervisor) would share the same physical connection between the physical machine hosting the hypervisor and the kernel debugger machine. Additionally, the individual partitions would be insulated from what actual physical medium the kernel debugger connection operates over (for example, 1394 or serial cable), such that only one kernel debugger transport module is needed per partition, regardless of what physical connection is used to connect the partition to the kernel debugger.

While this is a huge step forward from where shipping virtualization products are today with respect to kernel debugging (serial port debugging only), I think that this approach still falls short of completely ideal. There are a number of aspects of the debugging hypercalls that still carry much of the baggage of a physical, machine-to-machine kernel debugging interface, baggage that is arguably unnecessary and undesirable from a virtualization perspective. Besides the possibility of further improving the performance of the virtualized kernel debugger connection, it is possible to support partition-to-partition kernel debugging in a more convenient fashion than Viridian presently supports.

The debugging hypercalls, as currently defined, are in fact very much reminiscent of how I originally implemented VMKD. The hypercalls define an interface for a partition to send large chunks of kernel debugger data over as discrete units, without any guarantee of reception. Additionally, they provide a mechanism to notify the hypervisor that the partition is polling for kernel debugger data, so that the hypervisor can take action to reduce the resource consumption of the partition while it is awaiting new data (thus alleviating the CPU spin issue that one often runs into while broken into the kernel debugger with existing virtualization solutions, VMKD nonwithstanding of course).

The original approach that I took to VMKD is fairly similar to this. I essentially replaced the serial port I/O instructions in kdcom.dll with a mechanism that buffered data up until a certain point, and then transmitted (or received) data to the VMM in a discrete unit. Like the Viridian approach to debugging, this greatly reduces the number of VM exits (as compared to a traditional virtual serial port) and provides the VMM with an opportunity to reduce the CPU usage of the guest while it is awaiting new kernel debugger data.

However, I believe that it’s possible to improve upon the Viridian debugging hypercalls in much the same way as I improved upon VMKD before I arrived at the current release version. For instance, by dispensing with the provision that data posted to the debugging interface will not be reliably delivered, and enforcing several additional requirements on the debugger protocol, it is possible to further improve performance of partition kernel debugging. The suggested additional debugger protocol requirements include stipulating that data is transmitted or received by the debugging hypercalls are discrete protocol data units, and that both ends of the kernel debugger connection will be able to recover if an unexpected discrete PDU is received after a guest (or kernel debugger) reset.

These restrictions would further reduce VM exits by moving any data retransmit and recovery procedures outside of the partition being debugged. Furthermore, with the ability to reliably and transactionally transmit and receive (or fail in a transacted fashion) in as a function of the debugging hypercall itself, there is no longer a necessity for the hypervisor to ever schedule a partition that is frozen waiting for kernel debugger data until new data is available (or a transactional failure, such as a partition-defined timeout occurs). (This is, essentially, the approach that VMKD currently takes.)

In actuality, I believe that it should be possible to implement all of the above improvements by moving the Viridian debugging support out of the hypervisor and into the parent partition for a debuggee partition, with the parent partition being responsible for making hypercalls to set up a shared memory mapping for data transfer (HvMapGpaPages) and allow for event-driven communication with the debuggee partition (HvCreatePort and related APIs) that could be used to request that debugger command data in the shared memory region be processed. Above and beyond performance implications, this approach has the added advantage of more easily supporting partition-to-partition debugging (unless I’ve missed something in the documentation, the Viridian debugging hypercalls do not provide any mechanism to pass debugging data from one partition to another for processing).

Additionally, this approach would also completely eliminate the need to provide any specialized kernel debugging support at all in the hypervisor microkernel, instead moving this support into a parent (or the root) partition, leaving it to that partition to deal with the particulars of data transfer. If that partition, or another partition on the same physical computer as the debuggee partition is acting as the debugger, then data can be “transferred” using the shared memory mapping. Otherwise, the parent (or root) partition can implement whatever reliable transport mechanism it desired for the kernel debugger data (say, a TCP connection to a remote kernel debugger over an IP-based network). Thus, this proposed approach could potentially not only open up additional remote kernel debugger transport options, but also reduce code complexity of the hypervisor itself (which I would like to think is almost always a desirable thing, as non-existant code doesn’t have security holes, and the hypervisor is the absolute most trusted (software) component of the system when it is used).

Given that Viridian has some time yet before RTM, perhaps if we keep ours fingers crossed, we’ll yet see some further improvements to the Viridian kernel debugging scene.

How does one retrieve the 32-bit context of a Wow64 program from a 64-bit process on Windows Server 2003 x64?

Thursday, November 1st, 2007

Recently, Jimmy asked me what the recommended way to retrieve the 32-bit context of a Wow64 application on Windows XP x64 / Windows Server 2003 x64 was.

I originally responded that the best way to do this was to use Wow64GetThreadContext, but Jimmy mentioned that this doesn’t exist on Windows XP x64 / Windows Server 2003 x64. Sure enough, I checked and it’s really not there, which is rather a bummer if one is trying to implement a 64-bit debugger process capable of debugging 32-bit processes on pre-Vista operating systems.

Normally, I don’t typically recommend using undocumented implementation details in production code, but in this case, there seems to be little choice as there’s no documented mechanism to perform this operation prior to Vista. Because Vista introduces a documented way to perform this task, going an undocumented route is at least slightly less questionable, as there’s an upper bound on what operating systems need to be supported, and major changes to the implementation of things on downlevel operating systems are rarer than with new operating system releases.

Clearly, this is not always the case; Windows XP Service Pack 2 changed an enormous amount of things, for instance. However, as a general rule, service packs tend to be relatively conservative with this sort of thing. That’s not that one has carte blanche with using undocumented implementation details on downlevel platforms, but perhaps one can sleep a bit easier at night knowing that things are less likely to break than in the next Windows release.

I had previously mentioned that the Wow64 layer takes a rather unexpected approach to how to implement GetThreadContext and SetThreadContext. While I mentioned at a high level what was going on, I didn’t really go into the details all that much.

The basic implementation of these routines is to determine whether the thread is running in 64-bit mode or not (determined by examining the SegCs value of the 64-bit context record for the thread as returned by NtGetContextThread). If the thread is running in 64-bit mode, and the thread is a Wow64 thread, then an assumption can be made that the thread is in the middle of a callout to the Wow64 layer (say, a system call).

In this case, the 32-bit context is saved at a well-known location by the process that translates from running in 32-bit mode to running in 64-bit mode for system calls and other voluntary, user mode “32-bit break out” events. Specifically, the Wow64 layer repurposes the second TLS slot of each 64-bit thread (that is, Teb->TlsSlots[ 1 ]) to point to a structure of the following layout:

typedef struct _WOW64_THREAD_INFO
   ULONG UnknownPrefix;
   WOW64_CONTEXT Wow64Context;
   ULONG UnknownSuffix;

(The real structure name is not known..)

Normally, system components do not use the TLS array, but the Wow64 layer is an exception. Because there is not normally any third party 64-bit code running in a Wow64 process, the Wow64 layer is free to do what it wants with the TlsSlots array of the 64-bit TEB for a Wow64 thread. (Each Wow64 thread has its own, separate 32-bit TEB, so this does not interfere with the operation of TLS by the 32-bit program that is currently executing.)

In the case where the requested Wow64 is in a 64-bit Wow64 callout, all one needs to do is to retrieve the base address of the 64-bit TEB of the thread in question, read the second entry in the TlsSlots array, and then read the WOW64_CONTEXT structure out of the memory block referred to by the second 64-bit TLS slot.

The other case that is significant is that where the Wow64 thread is running 32-bit code and is not in a Wow64 callout. In this case, because Wow64 runs x86 code natively, one simply needs to capture the 64-bit context of the desired thread and truncate all of the 64-bit registers to their 32-bit counterparts.

Setting the context of a Wow64 thread works exactly like retrieving the context of a Wow64 thread, except in reverse; one either modifies the 64-bit thread context if the thread is running 32-bit code, or one modifies the saved context record based off of the 64-bit TEB of the desired thread (which will be restored when the thread resumes execution).

I have posted a basic implementation of a version of Wow64­GetThreadContext that operates on pre-Windows-Vista platforms. Note that this implementation is incomplete; it does not translate floating point registers, nor does it only act on the subset of registers requested by the caller in CONTEXT::ContextFlags. The provided code also does not implement Wow64­SetThreadContext; implementing the “set” operation and extending the “get” operation to fully conform to GetThreadContext semantics are left as an exercise for the reader.

This code will operate on Vista x64 as well, although I would strongly recommend using the documented API on Vista and later platforms instead.

Note that the operation of Wow64 on IA64 platforms is completely different from that on x64. This information does not apply in any way to the IA64 version of Wow64.

Thread Local Storage, part 8: Wrap-up

Wednesday, October 31st, 2007

This is the final post in the Thread Local Storage series, which is comprised of the following articles:

  1. Thread Local Storage, part 1: Overview
  2. Thread Local Storage, part 2: Explicit TLS
  3. Thread Local Storage, part 3: Compiler and linker support for implicit TLS
  4. Thread Local Storage, part 4: Accessing __declspec(thread) data
  5. Thread Local Storage, part 5: Loader support for __declspec(thread) variables (process initialization time)
  6. Thread Local Storage, part 6: Design problems with the Windows Server 2003 (and earlier) approach to implicit TLS
  7. Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs
  8. Thread Local Storage, part 8: Wrap-up

By now, much of the inner workings of TLS (both implicit and explicit) on Windows should appear less mysterious, and a number of the seemingly arbitrary restrictions on limitations (maximum counts of explicit TLS slots on various operating systems, and limitations with respect to the usage of __declspec(thread) on demand loaded DLLs). Although many of these things can be (and should) considered implementation details that are subject to change, knowing how things work “under the hood” often comes in useful from time to time. For example, with an understanding of why there’s a hard limit to the number of available explicit TLS slots, the importance of reusing one TLS slots for many variables (by placing them into a structure that is pointed to by the contents of a TLS slot) should become clear.

Many of the details of implicit TLS are actually rather set in stone at this point, due to the fact that the compiler has been emitting code to directly access the ThreadLocalStoragePointer field in the TEB. Interestingly enough, this makes ThreadLocalStoragePointer a “guaranteed portable” part of the TEB, along with the NT_TIB header, despite the fact that the contents between the two are not defined to be portable (and are certainly not across, say, Windows 95).

Most of the inner workings of TLS are fairly straightforward, although there are some clever tricks employed to deal with scenarios such as TLS slots being released while threads are active. Many of the operational details of day to day TLS operation, such as how explicit TLS operates, are significantly different on Windows 95 and other operating systems of the 16-bit Windows lineage, so I would not recommend relying on the details of the implementation of TLS for non-NT-based systems.

Incidentally, most of the operating system itself does not use TLS in the way that it is exposed to third party programs. Instead, many operating system components either have their own dedicated fields in the TEB, or for larger amounts of data that may not need to be allocated for every thread in the system, a pointer field that can be filled with a pointer to a memory block at runtime if desired. For instance, there’s a ReservedForNtRpc field, a number of fields set aside for OpenGL ICDs (so much for Microsoft not supporting OpenGL), a WinSockData field for ws2_32, and many other similar fields for various operating system components.

This doesn’t mean that these components are really getting preferential treatment, as for the most part, an access to such a field in the TEB is in practice not really slower than an access through the documented TLS APIs. The benefit from providing these components with their own dedicated storage in the TEB is that in many cases, these components are already going to be active. If said operating system components used conventional TLS, then this would significantly detract from the already limited number of TLS slots available for use by third party components.

Some components do actually use standard TLS, or at least the space allocated in the TEB for standard TLS slots (though in special circumstances and without going through the standard explicit TLS APIs). For example, the 64-bit portion of the Wow64 layer in a 32-bit process repurposes some of the 64-bit TLS slots (which would normally be completely unused in such a process) for its own internal usage, thereby avoiding the need for dedicated storage in the TEB. That, however, is a story for another day.

Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs

Tuesday, October 30th, 2007

Yesterday, I outlined some of the pitfalls behind the approach that the loader has traditionally taken to implicit TLS, in Windows Server 2003 and earlier releases of the operating system.

With Windows Vista, Microsoft has taken a stab at alleviating some of the issues that make __declspec(thread) unusable for demand loaded DLLs. Although solving the problem may initially appear simple at first (one would tend to think that all that would need to be done would be to track and procesS TLS data for new modules as they’re loaded), the reality of the situation is unfortunately a fair amount more complicated than that.

At heart is the fact that implicit TLS is really only designed from the get-go to support operation at process initialization time. For example, this becomes evident when ones considers what would need to be done to allocate a TLS slot for a new module. This is in and of itself problematic, as the per-module TLS array is allocated at process initialization time, with only enough space for the modules that were present (and using TLS) at that time. Expanding the array is in this case a difficult thing to safely do, considering the code that the compiler generates for accessing TLS data.

The problem resides in the fact that the compiler reads the address of the current thread’s ThreadLocalStoragePointer and then later on dereferences the returned TLS array with the current module’s TLS index. Because all of this is done without synchronization, it is not in general safe to just switch out the old ThreadLocalStoragePointer with a new array and then release the old array from another thread context, as there is no way to ensure that the thread whose TLS array is being modified was not in the middle of accessing the TLS array.

A further difficulty presents itself in that there now needs to be a mechanism to proactively go out and place a new TLS module block into each running thread’s TLS array, as there may be multiple threads active when a module is demand-loaded. This is further complicated by the fact that said modifications are required to be performed before DllMain is called for the incoming module, and while the loader lock is still held by the current thread. This implies that, once again, the alterations to the TLS arrays of other threads will need to be performed by the current thread, without the cooperation of additional threads that are active in the process at the time of the DLL load.

These constraints are responsible for the bulk of the complexity of the new loader code in Windows Vista for TLS-related operations. The general concept behind how the new TLS support operates is as follows:

First, a new module is loaded via LdrLoadDll (which is used to implement LoadLibrary and similar Win32 functions). The loader examines the module to determine if it makes use of implicit TLS. If not, then no TLS-specific handling is performed and the typical loaded module processing occurs.

If an incoming module does make use of TLS, however, then LdrpHandleTlsData (an internal helper routine) is called to initialize support for the new module’s implicit TLS usage. LdrpHandleTlsData determines whether there is room in the ThreadLocalStoragePointer arrays of currently loaded threads for the new module’s TLS slot (with Windows Vista, the array can initially be larger than the total number of modules using TLS at process initialization time, for cheaper expansion of TLS data when a new module using TLS is demand-loaded). Because all running threads will at any given time have the same amount of space in their ThreadLocalStoragePointer, this is easily accomplished by a global variable to keep track of the array length. This variable is the SizeOfBitMap member of LdrpTlsBitmap, an RTL_BITMAP structure.

Depending on whether the existing ThreadLocalStoragePointer arrays are sufficient to contain the new module, LdrpHandleTlsdata allocates room for the TLS variable block for the new module and possibly new TLS arrays to store in the TEB of running threads. After the new data is allocated for each thread for the incoming module, a new process information class (ProcessTlsInformation) is utilized with an NtSetInformationProcess call to ask the kernel for help in switching out TLS data for any threads that are currently running in the process. Conceptually, this behavior is similar to ThreadZeroTlsCell, although its implementation is significantly more complicated. This step does not really appear to need to occur in kernel mode and does introduce significant (arguably unnecessary) complexity, so it is unclear why the designers elected to go this route.

In response to the ProcessTlsInformation request, the kernel enumerates threads in the current process and either swaps out one member of the ThreadLocalStoragePointer array for all threads, or swaps out the entire pointer to the ThreadLocalStoragePointer array itself in the TEB for all threads. The previous values for either the requested TLS index or the entire array pointer are then returned to user mode.

LdrpHandleTlsData then inspects the data that was returned to it by the kernel. Generally, this data represents either a TLS data block for a module that has been since unloaded (which is always safe to immediately free), or it represents an old TLS array for an already running thread. In the latter case, it is not safe to release the memory backing the array, as without the cooperation of the thread in question, there is no way to determine when the thread has released all possible references to the old memory block. Since the code to access the TLS array is hardcoded into every program using implicit TLS by the compiler, for practical purposes there is no particularly elegant way to make this determinatiion.

Because it is not easily possible to determine (prove) when the old TLS array pointer will never again be referenced, the loader enqueues the pointer into a list of heap blocks to be released at thread exit time when the thread that owns the old TLS array performs a clean exit. Thus, the old TLS array pointer (if the TLS array was expanded) is essentially intentionally leaked until the thread exits. This is a fairly minor memory loss in practice, as the array itself is an array of pointers only. Furthermore, the array is expanded in such a way that most of the time, a new module will take an unused slot in the array instead of requiring the TLS array to be reallocated each time. This sort of intentional leak is, once again, necessary due to the design of implicit TLS not being particular conducive to supporting demand loaded modules.

The loader lock itself is used for synchronization with respect to switching out TLS pointers in other threads in the current process. While a thread owns the loader lock, it is guaranteed that no other thread will attempt to modify the TLS array of it (or any other threads). Because the old TLS array pointers are kept if the TLS array is reallocated, there is no risk of touching deallocated memory when the swap is made, even though the threads whose TLS pointers are being swapped have no synchronization with respect to reading the TLS array in their TEBs.

When a module is unloaded, the TLS slot occupied by the module is released back into the TLS slot pool, but the module’s TLS variable space is not immediately freed until either individual threads for which TLS variable space were allocated exit, or a new module is loaded and happens to claim the outgoing module’s previous TLS slot.

For those interested, I have posted my interpretration of the new implicit TLS support in Vista. This code has not been completely tested, though it is expected to be correct enough for purposes of understanding the details of the TLS implementation. In particular, I have not verified every SEH scope in the ProcessTlsInformation implementation; the SEH scope statements (handlers in particular) are in many cases logical extrapolations of what the expected behavior should be in such cases. As always, it should be considered implementation details and subject to change without notice in future operating system releases.

(There also appear to be several unfortunate bugs in the Vista implementation of TLS, mostly related to inconsistent states and potential corruption if heap allocations fail at “bad” points in time. These are commented in the above code.)

The handler for the ProcessTlsInformation process set information class does not appear to be subfunction in reality, but instead a (rather large) case statement in the implementation of NtSetInformationProcess. It is presented as a subfunction for purposes of clarity. For reference, a control flow graph of NtSetInformationProcess is provided, with the basic blocks relevant to the ProcessTlsInformation case statement shaded. I suspect that this information class holds the record for the most convoluted usage of SEH scopes due to its heavy use of dual input/output parameters.

The information class implementation also appears to take many unconventional shortcuts that while technically workable for the use cases, would appear to be rather inconsistent with the general way that most other system calls and information classes are architected. The reasoning behind these inconsistencies is not known (perhaps as a time saver). For example, unlike most other process information classes, the only valid handle that can be used with this information class is NtCurrentProcess(). In other words, the information class handler implementation assumes the caller is the process to be modified.

Thread Local Storage, part 6: Design problems with the Windows Server 2003 (and earlier) approach to implicit TLS

Monday, October 29th, 2007

Last week, I described how the loader handles implicit TLS (as of Windows Server 2003). Although the loader’s support for implicit TLS works out well enough for what it was originally designed for, there are some cases where things do not turn out so happily. If you’ve been following along closely so far, you’ve probably already noticed some of the deficiencies relating to the design of implicit TLS. These defects in the design and implementation of TLS eventually spurred Microsoft to significantly revamp the loader’s implicit TLS support in Windows Vista.

The primary problem with respect to how Windows Server 2003 and earlier Windows versions support implicit TLS is that it just plain doesn’t work at all with DLLs that are dynamically loaded (via LoadLibrary, or LdrLoadDll). In fact, the way that implicit TLS fails if you try to dynamically load a DLL written to utilize it is actually rather spectacularly catastrophic.

What ends up happening is that the new DLL will have no TLS processing by the loader happen whatsoever. With our knowledge of how implicit TLS works at this point, the unfortunate consequences of this behavior should be readily apparent.

When a DLL using implicit TLS is loaded, because the loader doesn’t process the TLS directory, the _tls_index value is not initialized by the loader, nor is there space allocated for module’s TLS data in the ThreadLocalStoragePointer arrays of running threads. The DLL continues to load, however, and things will appear to work… until the first access to a __declspec(thread) variable occurs, that is.

The compiler typically initializes _tls_index to zero by default, so the value retains the value zero in the case where an implicit TLS using DLL is loaded after process initialization time. When an access to a __declspec(thread) variable occurs, the typical implicit TLS variable resolution process occurs. That is, ThreadLocalStoragePointer is fetched from the TEB and is indexed by _tls_index (which will always be zero), and the resultant pointer is assumed to be a pointer to the current thread’s thread local variables. Unfortunately, because the loader didn’t actually set _tls_index to a valid value, the DLL will reference the thread local variable storage of whichever module was legitimately assigned TLS index zero. This is typically going to be the main process executable, although it could be a DLL if the main process executable doesn’t use TLS but is static linked to a DLL that does use TLS.

This results in one of the absolute worst possible kinds of problems to debug. Now you’ve got one module trampling all over another module’s state, with the guilty module under the (mistaken) belief that the state that it’s referencing is really the guilty module’s own state. If you’re lucky, the process has no implicit TLS using at all (at process initialization time), and the ThreadLocalStoragePointer will not be allocated for the current thread and the initial access to a __declspec(thread) variable will simply result in an immediate null pointer dereference. More common, however, is the case that there is somebody in the process already using implicit TLS, in which case the module owning TLS index zero will have its thread local variables corrupted by the newly loaded module.

In this situation, the actual crash is typically long delayed until the first module finally gets around to using its thread local variable stage and fails due to the fact that it’s been overwritten, far after the fact. It is also possible that you’ll get lucky and the newly loaded module’s TLS variables will be much larger in size than the module with TLS index zero, in which case the initial access to the __declspec(thread) variable might immediately fault if it is sufficiently beyond the length of the heap allocation used for the already loaded module’s TLS variable storage. Of course, the offset of the variable accessed might be somewhere in between the edge of the current heap segment (page) and the end of the allocation used for the original module’s TLS variable storage, in which case heap corruption will occur instead of original module’s TLS variables for the current thread. (The loader uses the process heap to satisfy module TLS variable block allocations.)

Perhaps the only saving grace of the loader’s limitation with respect to implicit TLS and demand loaded DLLs is that due to the fact that the loader’s support for this situation has (not) operated correctly for so long now, many programmers know well enough to stay away from implicit TLS when used in conjunction with DLLs (or so I would hope).

These dire consequences of demand loading a module using __declspec(thread) variables are the reason for the seemingly after-the-fact warning about using implicit TLS with demand loaded DLLs in the LoadLibrary documentation on MSDN:

Windows Server 2003 and Windows XP: The Visual C++ compiler supports a syntax that enables you to declare thread-local variables: _declspec(thread). If you use this syntax in a DLL, you will not be able to load the DLL explicitly using LoadLibrary on versions of Windows prior to Windows Vista. If your DLL will be loaded explicitly, you must use the thread local storage functions instead of _declspec(thread). For an example, see Using Thread Local Storage in a Dynamic Link Library.

Clearly, the failure mode of demand loaded DLLs using implicit TLS is far from acceptable from a debugging perspective. Furthermore, this restriction puts a serious crimp in the practical usefulness of the otherwise highly useful __declspec(thread) support that has been baked into the compiler and linker, at least with respect to its usage in DLLs.

Fortunately, the Windows Vista loader takes some steps to address this problem, such that it becomes possible to use __declspec(thread) safely on Windows Vista and future operating system versions. The new loader support for implicit TLS in demand loaded DLLs is fairly complicated, though, due to some unfortunate design consequences of how implicit TLS works.

Next time, I’ll go in to some more details on just how the Windows Vista loader supports this scenario, as well as some of the caveats behind the implementation that is used in the loader going forward with Vista.

Thread Local Storage, part 5: Loader support for __declspec(thread) variables (process initialization time)

Friday, October 26th, 2007

Last time, I described the mechanism by which the compiler and linker generate code to access a variable that has been instanced per-thread via the __declspec(thread) extended storage class. Although the compiler and linker have essentially “set the stage” with respect to implicit TLS at this point, the loader is the component that “fills in the dots” and supplies the necessary run-time infrastructure to allow everything to operate.

Specifically, the loader is responsible for managing the allocation of per-module TLS index values, the allocation and management of the memory for the ThreadLocalStoragePointer array referred to by the TEB of every thread. Additionally, the loader is also responsible for managing the memory for each module’s thread-instanced (that is, __declspec(thread)-decorated) variables.

The loader’s TLS-related allocation and management duties can conceptually be split up into four distinct areas (Note that this represents the Windows Server 2003 and earlier view of things; I will go over some of the changes that Windows Vista makes this this model in a future posting in the TLS series.):

  1. At process initialization time, allocate _tls_index values, determine the extent of memory required for each module’s TLS block, and call TLS and DLL initializers (in that order).
  2. At thread initialization time, allocate and initialize TLS memory blocks for each module utilizing TLS, allocate the ThreadLocalStoragePointer array for the current thread, and link the TLS memory blocks in to the ThreadLocalStoragePointer array. Additionally, TLS initializers and then DLL initializers (in that order) are invoked for the current thread.
  3. At thread deinitialization time, call TLS deinitializers and then DLL deinitializers (in that order), and release the current thread’s TLS memory blocks for each module using TLS, and release the ThreadLocalStoragePointer array.
  4. At process deinitialization time, call TLS and DLL initializers (in that order).

Of course, the loader performs a number of other tasks when these events occur; this is simply a list of those that have some bearing on TLS support.

Most of these operations are fairly straightforward, with the arguable exception of process initialization. Process initialization of TLS is primarily handled in two subroutines inside ntdll, LdrpInitializeTls and LdrpAllocateTls.

LdrpInitializeTls is invoked during process initialization after all DLLs have been loaded, but before any initializer (or TLS) routines have been called. It essentially walks the loaded module list and sums the length of TLS data for each module that contains a valid TLS directory. For each module that contains TLS, a data structure is allocated that contains the length of the module’s TLS data and the TLS index that has been assigned to that module. (The TlsIndex field in the LDR_DATA_TABLE_ENTRY structure appears to be unused except as a flag that the module has TLS (being always set to -1), at least as far back as Windows XP. It is worth mentioning that the WINE implementation of implicit TLS incorrectly uses TlsIndex as the real module TLS index, so it may be unreliable to assume that it is always -1 if you care about working on WINE.)

Modules that use implicit TLS and which are present at initialization time are additionally marked as pinned in memory for the lifetime of the process by LdrpInitializeProcess (the LoadCount of any such module is fixed to 0xFFFF). In practice, this is typically unlikely to matter, as for such modules to be present at process initialization time, they must also by definition static linked by either the main process image or a dependency of the main process image.

After LdrpInitializeTls has determined which modules use TLS in the current process and has assigned those modules TLS index values, LdrpAllocateTls is called to allocate and initialize module TLS values for the initial thread.

At this point, process initialization continues, eventually resulting in TLS initializers and then DLL initializers (DllMain) being called for loaded modules. (Note that the main process image can have one or more TLS callbacks, even though it cannot have a DLL initializer routine.)

One interesting fact about TLS initializers is that they are always called before DLL initializers for their corresponding DLL. (The process occurs in sequence, such that DLL A’s TLS and DLL initializers are called, then DLL B’s TLS and DLL initializers, and so forth.) This means that TLS initializers need to be careful about making, say, CRT calls (as the C runtime is initialized before the user’s DllMain routine is called, by the actual DLL initializer entrypoint, such that the CRT will not be initialized when a TLS initializer for the module is invoked). This can be dangerous, as global objects will not have been constructed yet; the module will be in a completely uninitialized state except that imports have been snapped.

Another point worth mentioning about the loader’s TLS support is that contrary to the Portable Executable specification, the SizeOfZeroFill member of the IMAGE_TLS_DIRECTORY structure is not used (or supported) by the linker or the loader. This means that in practice, all TLS template data is initialized, and the size of the memory block allocated for per-module implicit TLS does not include the SizeOfZeroFill member as the PE documentation (or certain other publications that appear to be based on said documentation) would seem to state. (It seems that the WINE folks happened to get it wrong as well, thanks to the implication in the PE specification that the field is actually used.)

Some programs abuse TLS callbacks for anti-debugging purposes (gaining code execution before the normal process entrypoint routine is executed by creating a TLS callback for the main process image), although this is, in practice, quite obvious as almost all PE images do not use TLS callbacks at all.

Up through Windows Server 2003, the above is really all the loader needs to do with respect to supporting __declspec(thread). While this approach would appear to work quite well, it turns out that there are, in fact, some problems with it (if you’ve been following along thus far, you can probably figure out what they are). More on some of the limitations of the Windows Server 2003 approach to implicit TLS next week.

Thread Local Storage, part 2: Explicit TLS

Tuesday, October 23rd, 2007

Previously, I outlined some of the general design principles behind both flavors of TLS in use on Windows. Anyone can see the design and high level interface to TLS by reading MSDN, though; the interesting parts relate to the implementation itself.

The explicit TLS API is (by far) the simplest of the two classes of TLS in terms of the implementation, as it touches the fewest “moving parts”. As I mentioned last time, there are really just four key functions in the explicit TLS API. The most important two are TlsGetValue and TlsSetValue, which manage the actual setting and retrieving of per-thread pointers.

These two functions are simple enough to annotate entirely. The essential mechanism behind them is that they are basically just “dumb accessors” into an array (two arrays in actuality, TlsSlots and TlsExpansionSlots) in the TEB, which is indexed by the dwTlsIndex argument to return (or set) the desired per-thread variable. The implementation of TlsGetValue on Vista (32-bit) is as follows (TlsSetValue is similar, except that it writes to the arrays instead of reading from them, and has support for demand-allocating the TlsExpansionSlots array; more on that later):

	__in DWORD dwTlsIndex
   PTEB Teb = NtCurrentTeb(); // fs:[0x18]

   // Reset the last error state.
   Teb->LastErrorValue = 0;

   // If the variable is in the main array, return it.
   if (dwTlsIndex < 64)
      return Teb->TlsSlots[ dwTlsIndex ];

   if (dwTlsIndex > 1088)
      return 0;

   // Otherwise it's in the expansion array.
   // If it's not allocated, we default to zero.
   if (!Teb->TlsExpansionSlots)
      return 0;

   // Fetch the value from the expansion array.
   return Teb->TlsExpansionSlots[ dwTlsIndex - 64 ];

(The assembler version (annotated) is also available.)

The TlsSlots array in the TEB is a part of every thread, which gives each thread a guaranteed set of 64 thread local storage indexes. Later on, Microsoft decided that 64 was not enough TLS slots to go around and added the TlsExpansionSlots array, for an additional 1024 TLS slots. The TlsExpansionSlots array is demand-allocated in TlsAlloc if the initial set of 64 slots is exceeded.

(This is, by the way, the nature of the seemingly arbitrary 64 and 1088 TLS slot limitations mentioned by MSDN, for those keeping score.)

TlsAlloc and TlsFree are, for all intents and purposes, implemented just as what one would expect. They acquire a lock, search for a free TLS slot (returning the index if one is found), otherwise indicating to the caller that there are no free slots. If the first 64 slots are exhausted and the TlsExpansionSlots array has not been created, then TlsAlloc will allocate and zero space for 1024 more TLS slots (pointer-sized values), and then update the TlsExpansionSlots to refer to the newly allocated storage.

Internally, TlsAlloc and TlsFree utilize the Rtl bitmap package to track usage of individual TLS slots; each bit in a bitmap describes whether a particular TLS slot is free or in use. This allows for reasonably fast (and space efficient) mapping of TLS slot usage for book-keeping purposes.

If one has been following along so far, then the question as to what happens when TlsAlloc is called such that it must create the TlsExpansionSlots array after there is already more than one thread in the current process may have come to mind. This might appear to be a problem at first glance, as TlsAlloc only creates the array for the current thread. Although one might be tempted to conclude that, given this behavior of TlsAlloc, explicit TLS therefore doesn’t work reliably above 64 TLS slots if the extra slots are allocated after the second thread in the process is created, this is in fact not the case. There exists some clever sleight of hand that is performed by TlsGetValue and TlsSetValue, which compensates for the fact that TlsAlloc can only create the TlsExpansionSlots memory block for the current thread.

Specifically, if TlsGetValue is called with an array index within the confines of the TlsExpansionSlots array, but the array has not been allocated for the current thread, then zero is returned. (This is the default value for an uninitialized TLS slot, and is thus consequently legal.) Similarly, if TlsSetValue is called with an array index that falls under the TlsExpansionSlots array, and the array has not yet been created, TlsSetValue allocates the memory block on demand and initializes the requested TLS slot.

There also exists one final twist in TlsFree that is required to support the behavior of releasing a TLS slot while there are multiple threads running. A potential problem exists whereby a thread releases a TLS slot, and then it becomes reallocated, following which the previous contents of the TLS slot are still present on other threads running in the process. TlsFree alleviates this problem by asking the kernel for help, in the form of the ThreadZeroTlsCell thread information class. When the kernel sees a NtSetInformationThread call for ThreadZeroTlsCell, it enumerates all threads in the process and writes a zero pointer-length value to each running thread’s instance of the requested TLS slot, thus flushing the old contents and resetting the slot to the unallocated default state. (It is not strictly necessary for this to have been done in kernel mode, although the designers chose to go this route.)

When a thread exits normally, if the TlsExpansionSlots pointer has been allocated, it is freed to the process heap. (Of course, if a thread is terminated by TerminateThread, the TlsExpansionSlots array is leaked. This is yet one reason among innumerable others why you should stay away from TerminateThread.)

Next up: Examining implicit TLS support (__declspec(thread) variables).

Thread Local Storage, part 1: Overview

Monday, October 22nd, 2007

Windows, like practically any other mainstream multithreading operating system, provides a mechanism to allow programmers to efficiently store state on a per-thread basis. This capability is typically known as Thread Local Storage, and it’s quite handy in a number of circumstances where global variables might need to be instanced on a per-thread basis.

Although the usage of TLS on Windows is fairly well documented, the implementation details of it are not so much (though there are a smattering of pieces of third party documentation floating out there).

Conceptually, TLS is in principal not all that complicated (famous last words), at least from a high level. The general design is that all TLS accesses go through either a pointer or array that is present on the TEB, which is a system-defined data structure that is already instanced per thread.

The “per-thread” resolution of the TEB is fairly well documented, but for the benefit of those that are unaware, the general idea is that one of the segment registers (fs on x86, gs on x64) is repurposed by the OS to point to the base address of the TEB for the current thread. This allows, say, an access to fs:[0x0] (or gs:[0x0] on x64) to always access the TEB allocated for the current thread, regardless of other threads in the address space. The TEB does really exist in the flat address space of the process (and indeed there is a field in the TEB that contains the flat virtual address of it), but the segmentation mechanism is simply used to provide a convenient way to access the TEB quickly without having to search through a list of thread IDs and TEB pointers (or other relatively slow mechanisms).

On non-x86 and non-x64 architectures, the underlying mechanism by which the TEB is accessed varies, but the general theme is that there is a register of some sort which is always set to the base address of the current thread’s TEB for easy access.

The TEB itself is probably one of the best-documented undocumented Windows structures, primarily because there is type information included for the debugger’s benefit in all recent ntdll and ntoskrnl.exe builds. With this information and a little disassembly work, it is not that hard to understand the implementation behind TLS.

Before we can look at the implementation of how TLS works on Windows, however, it is necessary to know the documented mechanisms to use it. There are two ways to accomplish this task on Windows. The first mechanism is a set of kernel32 APIs (comprising TlsGetValue, TlsSetValue, TlsAlloc, and TlsFree that allows explicit access to TLS. The usage of the functions is fairly straightforward; TlsAlloc reserves space on all threads for a pointer-sized variable, and TlsGetValue can be used to read this per-thread storage on any thread (TlsSetValue and TlsFree are conceptually similar).

The second mechanism by which TLS can be accessed on Windows is through some special support from the loader (residing ntdll) and the compiler and linker, which allow “seamless”, implicit usage of thread local variables, just as one would use any global variable, provided that the variables are tagged with __declspec(thread) (when using the Microsoft build utilities). This is more convenient than using the TLS APIs as one doesn’t need to go and call a function every time you want to use a per-thread variable. It also relieves the programmer of having to explicitly remember to call TlsAlloc and TlsFree at initialization time and deinitialization time, and it implies an efficient usage of per-thread storage space (implicit TLS operates by allocating a single large chunk of memory, the size of which is defined by the sum of all per-thread variables, for each thread so that only one index into the implicit TLS array is used for all variables in a module).

With the advantages of implicit TLS, why would anyone use the explicit TLS API? Well, it turns out that prior to Windows Vista, there are some rather annoying limitations baked into the loader’s implicit TLS support. Specifically, implicit TLS does not operate when a module using it is not being loaded at process initialization time (during static import resolution). In practice, this means that it is typically not usable except by the main process image (.exe) of a process, and any DLL(s) that are guaranteed to be loaded at initialization time (such as DLL(s) that the main process image static links to).

Next time: Taking a closer look at explicit TLS and how it operates under the hood.

Fast kernel debugging for VMware, part 6: Roadmap to Future Improvements

Thursday, October 11th, 2007

Yesterday’s article described how VMKD currently communicates with DbgEng.dll in order to complete the high-speed connection between a local kernel debugger and the KD stub code running in a VM. At this point, VMKD is essentially operational, with significant improvements over conventional virtual serial port kernel debugging.

That is not to say, however, that nothing remains that could be improved in VMKD. There are a number of areas where significant steps forward could be taken with respect to either performance or end user experience, given a native (in the OS and in VMMs) implementation of the basic concepts behind VMKD. For example, despite the greatly accelerated data rate of VMKD-style kernel debugging, the 1394 kernel debugger transport still outpaces it for writing dump files. (Practically speaking, all operations except writing dump files are much faster on VMKD when compared to 1394.)

This is because the 1394 KD transport can “cheat” when it comes to physical memory reads. As the reader may or may not be aware, 1394 essentially provides an interface to directly access the target’s raw physical memory. DbgEng takes advantage of this capability, and overrides the normal functionality for reading physical memory on the target. Where all other transports send a multitude of DbgKdReadPhysicalMemoryApi packets to the target computer, requesting chunks of physical memory 4000 bytes at a time (4000 bytes is the maximum size of a KD packet across any transport), the 1394 KD client in DbgEng simply pulls the target computer’s physical memory directly “off the wire”, without needing to invoke the DbgKdReadPhysicalMemoryApi request for every 4000 bytes.

This optimization turns out to present very large performance improvements with respect to reading physical memory, as a request to write a dump file is at heart essentially just a large memcpy request, asking to copy the entire contents of physical memory of the target computer to the debugger so that the data can be written to a file. The 1394 KD client approach greatly reduces the amount of code that needs to run for every 4000 bytes of memory, especially in the VM case where every KD request and response pair involve separate VM exits and all the code that such operations involve, on top of all the processing logic guest-side when handling the DbgKdReadPhysicalMemoryApi request and sending the response data.

The same sort of optimization can of course be done in principal for virtual machine kernel debugging, but DbgEng lacks a pluggable interface to perform the highly optimized transfer of raw physical memory contents across the wire. One optimization that could be done without the assistance of DbgEng would be to locally interpret the DbgKdReadPhysicalMemoryApi request VMM-side and handle it without ever passing the request on to the guest-side code, but even this is suboptimal as it introduces a (admittedly short for a local KD process) round trip for every 4000 bytes of physical memory. If the DbgEng team stepped up to the plate and provided such an extensible interface, it would be much easier to provide the sort of speeds that one sees with writing dumps based on local KD.

Another enhancement that could be done Microsoft-side would be a better interface for replacing KD transport modules. Right now, due to the fact that ntoskrnl is static linked to KDCOM.DLL, the OS loader has a hardcoded hack that interprets the KD type in the OS loader options, loads one of the (hardcoded filenames) “kdcom.dll”, “kd1394.dll”, or “kdusb2.dll” modules, and inserts them into the loaded module list under the name “kdcom.dll”. Additionally, the KD transport module appears to be guarded by PatchGuard on Windows x64 editions (at least from the standpoint of PatchGuard 3), and on Windows Vista, Winload.exe enforces a signature check on the KD transport module. These checks are, unfortunately, not particularly conducive to allowing a third party to easily plug themselves into the KD transport path. (Unless virtualization vendors standardize on a way to signal the VMM that the guest wants attention, each virtualization platform is likely to need some slightly different code to effect a VM exit on each KdSendPacket and KdReceivePacket operation.)

Similarly, there are a number of enhancements that virtualization platform vendors could make VMM-side to make the VMKD-style approach more performant. For example, documented pluggable interfaces for communicating with the guest would be a huge step forward (although the virtualization vendor could just implement the whole KD transport replacement themselves instead of relying on a third party solution). VMware appears to be exploring this approach with VMCI, although this interface is unfortunately not supported on VMware Server or any other platforms besides VMware Workstation 6 to the best of my knowledge. Additionally, VMM authors are in the best position to provide documented and supported interfaces to allow pluggable code designed to interface with a VMM to directly access the register, physical, and virtual memory contexts of a given VM.

Virtualization vendors are also in a better position to integrate the installation and activation process for VMM plugins than a third party operating with no support or documentation. For example, the clumsy vmxinject.exe approach that VMKD takes to load its plugin code into the VMware VMM could be completely eliminated by a native architecture for installing, configuring, and loading VMM plugins (VMCI promises to take care of some of this, though not entirely to the extent that I’d hope).

I would strongly encourage Microsoft and virtualization vendors to work together on this front, as at least from the debugging experience (which is a non-trivial, popular use of virtual machines), there’s a significant potential for a better customer experience in the VM kernel debugging arena with a little cooperation here and there. VMKD is essentially a proof of concept showing that vast kernel debugging is absolutely technically possible for Windows virtual machines. Furthermore, with “inside knowledge” of either the kernel or the VMM, it would likely be trivial to implement the sort of pluggable interfaces that would have made the development and testing of VMKD a virtual walk in the park. In other words, if VMKD can be done without help from either Microsoft or VMware, it should be simple for virtualization vendors and Microsoft to implement similar functionality if they work together.

Next time: Parting shots, and thoughts on other improvements beyond simply fast kernel debugging in the virtualization space.