Posts Tagged ‘Loader’

Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs

Tuesday, October 30th, 2007

Yesterday, I outlined some of the pitfalls behind the approach that the loader has traditionally taken to implicit TLS, in Windows Server 2003 and earlier releases of the operating system.

With Windows Vista, Microsoft has taken a stab at alleviating some of the issues that make __declspec(thread) unusable for demand loaded DLLs. Although solving the problem may initially appear simple at first (one would tend to think that all that would need to be done would be to track and procesS TLS data for new modules as they’re loaded), the reality of the situation is unfortunately a fair amount more complicated than that.

At heart is the fact that implicit TLS is really only designed from the get-go to support operation at process initialization time. For example, this becomes evident when ones considers what would need to be done to allocate a TLS slot for a new module. This is in and of itself problematic, as the per-module TLS array is allocated at process initialization time, with only enough space for the modules that were present (and using TLS) at that time. Expanding the array is in this case a difficult thing to safely do, considering the code that the compiler generates for accessing TLS data.

The problem resides in the fact that the compiler reads the address of the current thread’s ThreadLocalStoragePointer and then later on dereferences the returned TLS array with the current module’s TLS index. Because all of this is done without synchronization, it is not in general safe to just switch out the old ThreadLocalStoragePointer with a new array and then release the old array from another thread context, as there is no way to ensure that the thread whose TLS array is being modified was not in the middle of accessing the TLS array.

A further difficulty presents itself in that there now needs to be a mechanism to proactively go out and place a new TLS module block into each running thread’s TLS array, as there may be multiple threads active when a module is demand-loaded. This is further complicated by the fact that said modifications are required to be performed before DllMain is called for the incoming module, and while the loader lock is still held by the current thread. This implies that, once again, the alterations to the TLS arrays of other threads will need to be performed by the current thread, without the cooperation of additional threads that are active in the process at the time of the DLL load.

These constraints are responsible for the bulk of the complexity of the new loader code in Windows Vista for TLS-related operations. The general concept behind how the new TLS support operates is as follows:

First, a new module is loaded via LdrLoadDll (which is used to implement LoadLibrary and similar Win32 functions). The loader examines the module to determine if it makes use of implicit TLS. If not, then no TLS-specific handling is performed and the typical loaded module processing occurs.

If an incoming module does make use of TLS, however, then LdrpHandleTlsData (an internal helper routine) is called to initialize support for the new module’s implicit TLS usage. LdrpHandleTlsData determines whether there is room in the ThreadLocalStoragePointer arrays of currently loaded threads for the new module’s TLS slot (with Windows Vista, the array can initially be larger than the total number of modules using TLS at process initialization time, for cheaper expansion of TLS data when a new module using TLS is demand-loaded). Because all running threads will at any given time have the same amount of space in their ThreadLocalStoragePointer, this is easily accomplished by a global variable to keep track of the array length. This variable is the SizeOfBitMap member of LdrpTlsBitmap, an RTL_BITMAP structure.

Depending on whether the existing ThreadLocalStoragePointer arrays are sufficient to contain the new module, LdrpHandleTlsdata allocates room for the TLS variable block for the new module and possibly new TLS arrays to store in the TEB of running threads. After the new data is allocated for each thread for the incoming module, a new process information class (ProcessTlsInformation) is utilized with an NtSetInformationProcess call to ask the kernel for help in switching out TLS data for any threads that are currently running in the process. Conceptually, this behavior is similar to ThreadZeroTlsCell, although its implementation is significantly more complicated. This step does not really appear to need to occur in kernel mode and does introduce significant (arguably unnecessary) complexity, so it is unclear why the designers elected to go this route.

In response to the ProcessTlsInformation request, the kernel enumerates threads in the current process and either swaps out one member of the ThreadLocalStoragePointer array for all threads, or swaps out the entire pointer to the ThreadLocalStoragePointer array itself in the TEB for all threads. The previous values for either the requested TLS index or the entire array pointer are then returned to user mode.

LdrpHandleTlsData then inspects the data that was returned to it by the kernel. Generally, this data represents either a TLS data block for a module that has been since unloaded (which is always safe to immediately free), or it represents an old TLS array for an already running thread. In the latter case, it is not safe to release the memory backing the array, as without the cooperation of the thread in question, there is no way to determine when the thread has released all possible references to the old memory block. Since the code to access the TLS array is hardcoded into every program using implicit TLS by the compiler, for practical purposes there is no particularly elegant way to make this determinatiion.

Because it is not easily possible to determine (prove) when the old TLS array pointer will never again be referenced, the loader enqueues the pointer into a list of heap blocks to be released at thread exit time when the thread that owns the old TLS array performs a clean exit. Thus, the old TLS array pointer (if the TLS array was expanded) is essentially intentionally leaked until the thread exits. This is a fairly minor memory loss in practice, as the array itself is an array of pointers only. Furthermore, the array is expanded in such a way that most of the time, a new module will take an unused slot in the array instead of requiring the TLS array to be reallocated each time. This sort of intentional leak is, once again, necessary due to the design of implicit TLS not being particular conducive to supporting demand loaded modules.

The loader lock itself is used for synchronization with respect to switching out TLS pointers in other threads in the current process. While a thread owns the loader lock, it is guaranteed that no other thread will attempt to modify the TLS array of it (or any other threads). Because the old TLS array pointers are kept if the TLS array is reallocated, there is no risk of touching deallocated memory when the swap is made, even though the threads whose TLS pointers are being swapped have no synchronization with respect to reading the TLS array in their TEBs.

When a module is unloaded, the TLS slot occupied by the module is released back into the TLS slot pool, but the module’s TLS variable space is not immediately freed until either individual threads for which TLS variable space were allocated exit, or a new module is loaded and happens to claim the outgoing module’s previous TLS slot.

For those interested, I have posted my interpretration of the new implicit TLS support in Vista. This code has not been completely tested, though it is expected to be correct enough for purposes of understanding the details of the TLS implementation. In particular, I have not verified every SEH scope in the ProcessTlsInformation implementation; the SEH scope statements (handlers in particular) are in many cases logical extrapolations of what the expected behavior should be in such cases. As always, it should be considered implementation details and subject to change without notice in future operating system releases.

(There also appear to be several unfortunate bugs in the Vista implementation of TLS, mostly related to inconsistent states and potential corruption if heap allocations fail at “bad” points in time. These are commented in the above code.)

The handler for the ProcessTlsInformation process set information class does not appear to be subfunction in reality, but instead a (rather large) case statement in the implementation of NtSetInformationProcess. It is presented as a subfunction for purposes of clarity. For reference, a control flow graph of NtSetInformationProcess is provided, with the basic blocks relevant to the ProcessTlsInformation case statement shaded. I suspect that this information class holds the record for the most convoluted usage of SEH scopes due to its heavy use of dual input/output parameters.

The information class implementation also appears to take many unconventional shortcuts that while technically workable for the use cases, would appear to be rather inconsistent with the general way that most other system calls and information classes are architected. The reasoning behind these inconsistencies is not known (perhaps as a time saver). For example, unlike most other process information classes, the only valid handle that can be used with this information class is NtCurrentProcess(). In other words, the information class handler implementation assumes the caller is the process to be modified.

Thread Local Storage, part 6: Design problems with the Windows Server 2003 (and earlier) approach to implicit TLS

Monday, October 29th, 2007

Last week, I described how the loader handles implicit TLS (as of Windows Server 2003). Although the loader’s support for implicit TLS works out well enough for what it was originally designed for, there are some cases where things do not turn out so happily. If you’ve been following along closely so far, you’ve probably already noticed some of the deficiencies relating to the design of implicit TLS. These defects in the design and implementation of TLS eventually spurred Microsoft to significantly revamp the loader’s implicit TLS support in Windows Vista.

The primary problem with respect to how Windows Server 2003 and earlier Windows versions support implicit TLS is that it just plain doesn’t work at all with DLLs that are dynamically loaded (via LoadLibrary, or LdrLoadDll). In fact, the way that implicit TLS fails if you try to dynamically load a DLL written to utilize it is actually rather spectacularly catastrophic.

What ends up happening is that the new DLL will have no TLS processing by the loader happen whatsoever. With our knowledge of how implicit TLS works at this point, the unfortunate consequences of this behavior should be readily apparent.

When a DLL using implicit TLS is loaded, because the loader doesn’t process the TLS directory, the _tls_index value is not initialized by the loader, nor is there space allocated for module’s TLS data in the ThreadLocalStoragePointer arrays of running threads. The DLL continues to load, however, and things will appear to work… until the first access to a __declspec(thread) variable occurs, that is.

The compiler typically initializes _tls_index to zero by default, so the value retains the value zero in the case where an implicit TLS using DLL is loaded after process initialization time. When an access to a __declspec(thread) variable occurs, the typical implicit TLS variable resolution process occurs. That is, ThreadLocalStoragePointer is fetched from the TEB and is indexed by _tls_index (which will always be zero), and the resultant pointer is assumed to be a pointer to the current thread’s thread local variables. Unfortunately, because the loader didn’t actually set _tls_index to a valid value, the DLL will reference the thread local variable storage of whichever module was legitimately assigned TLS index zero. This is typically going to be the main process executable, although it could be a DLL if the main process executable doesn’t use TLS but is static linked to a DLL that does use TLS.

This results in one of the absolute worst possible kinds of problems to debug. Now you’ve got one module trampling all over another module’s state, with the guilty module under the (mistaken) belief that the state that it’s referencing is really the guilty module’s own state. If you’re lucky, the process has no implicit TLS using at all (at process initialization time), and the ThreadLocalStoragePointer will not be allocated for the current thread and the initial access to a __declspec(thread) variable will simply result in an immediate null pointer dereference. More common, however, is the case that there is somebody in the process already using implicit TLS, in which case the module owning TLS index zero will have its thread local variables corrupted by the newly loaded module.

In this situation, the actual crash is typically long delayed until the first module finally gets around to using its thread local variable stage and fails due to the fact that it’s been overwritten, far after the fact. It is also possible that you’ll get lucky and the newly loaded module’s TLS variables will be much larger in size than the module with TLS index zero, in which case the initial access to the __declspec(thread) variable might immediately fault if it is sufficiently beyond the length of the heap allocation used for the already loaded module’s TLS variable storage. Of course, the offset of the variable accessed might be somewhere in between the edge of the current heap segment (page) and the end of the allocation used for the original module’s TLS variable storage, in which case heap corruption will occur instead of original module’s TLS variables for the current thread. (The loader uses the process heap to satisfy module TLS variable block allocations.)

Perhaps the only saving grace of the loader’s limitation with respect to implicit TLS and demand loaded DLLs is that due to the fact that the loader’s support for this situation has (not) operated correctly for so long now, many programmers know well enough to stay away from implicit TLS when used in conjunction with DLLs (or so I would hope).

These dire consequences of demand loading a module using __declspec(thread) variables are the reason for the seemingly after-the-fact warning about using implicit TLS with demand loaded DLLs in the LoadLibrary documentation on MSDN:

Windows Server 2003 and Windows XP: The Visual C++ compiler supports a syntax that enables you to declare thread-local variables: _declspec(thread). If you use this syntax in a DLL, you will not be able to load the DLL explicitly using LoadLibrary on versions of Windows prior to Windows Vista. If your DLL will be loaded explicitly, you must use the thread local storage functions instead of _declspec(thread). For an example, see Using Thread Local Storage in a Dynamic Link Library.

Clearly, the failure mode of demand loaded DLLs using implicit TLS is far from acceptable from a debugging perspective. Furthermore, this restriction puts a serious crimp in the practical usefulness of the otherwise highly useful __declspec(thread) support that has been baked into the compiler and linker, at least with respect to its usage in DLLs.

Fortunately, the Windows Vista loader takes some steps to address this problem, such that it becomes possible to use __declspec(thread) safely on Windows Vista and future operating system versions. The new loader support for implicit TLS in demand loaded DLLs is fairly complicated, though, due to some unfortunate design consequences of how implicit TLS works.

Next time, I’ll go in to some more details on just how the Windows Vista loader supports this scenario, as well as some of the caveats behind the implementation that is used in the loader going forward with Vista.

Thread Local Storage, part 5: Loader support for __declspec(thread) variables (process initialization time)

Friday, October 26th, 2007

Last time, I described the mechanism by which the compiler and linker generate code to access a variable that has been instanced per-thread via the __declspec(thread) extended storage class. Although the compiler and linker have essentially “set the stage” with respect to implicit TLS at this point, the loader is the component that “fills in the dots” and supplies the necessary run-time infrastructure to allow everything to operate.

Specifically, the loader is responsible for managing the allocation of per-module TLS index values, the allocation and management of the memory for the ThreadLocalStoragePointer array referred to by the TEB of every thread. Additionally, the loader is also responsible for managing the memory for each module’s thread-instanced (that is, __declspec(thread)-decorated) variables.

The loader’s TLS-related allocation and management duties can conceptually be split up into four distinct areas (Note that this represents the Windows Server 2003 and earlier view of things; I will go over some of the changes that Windows Vista makes this this model in a future posting in the TLS series.):

  1. At process initialization time, allocate _tls_index values, determine the extent of memory required for each module’s TLS block, and call TLS and DLL initializers (in that order).
  2. At thread initialization time, allocate and initialize TLS memory blocks for each module utilizing TLS, allocate the ThreadLocalStoragePointer array for the current thread, and link the TLS memory blocks in to the ThreadLocalStoragePointer array. Additionally, TLS initializers and then DLL initializers (in that order) are invoked for the current thread.
  3. At thread deinitialization time, call TLS deinitializers and then DLL deinitializers (in that order), and release the current thread’s TLS memory blocks for each module using TLS, and release the ThreadLocalStoragePointer array.
  4. At process deinitialization time, call TLS and DLL initializers (in that order).

Of course, the loader performs a number of other tasks when these events occur; this is simply a list of those that have some bearing on TLS support.

Most of these operations are fairly straightforward, with the arguable exception of process initialization. Process initialization of TLS is primarily handled in two subroutines inside ntdll, LdrpInitializeTls and LdrpAllocateTls.

LdrpInitializeTls is invoked during process initialization after all DLLs have been loaded, but before any initializer (or TLS) routines have been called. It essentially walks the loaded module list and sums the length of TLS data for each module that contains a valid TLS directory. For each module that contains TLS, a data structure is allocated that contains the length of the module’s TLS data and the TLS index that has been assigned to that module. (The TlsIndex field in the LDR_DATA_TABLE_ENTRY structure appears to be unused except as a flag that the module has TLS (being always set to -1), at least as far back as Windows XP. It is worth mentioning that the WINE implementation of implicit TLS incorrectly uses TlsIndex as the real module TLS index, so it may be unreliable to assume that it is always -1 if you care about working on WINE.)

Modules that use implicit TLS and which are present at initialization time are additionally marked as pinned in memory for the lifetime of the process by LdrpInitializeProcess (the LoadCount of any such module is fixed to 0xFFFF). In practice, this is typically unlikely to matter, as for such modules to be present at process initialization time, they must also by definition static linked by either the main process image or a dependency of the main process image.

After LdrpInitializeTls has determined which modules use TLS in the current process and has assigned those modules TLS index values, LdrpAllocateTls is called to allocate and initialize module TLS values for the initial thread.

At this point, process initialization continues, eventually resulting in TLS initializers and then DLL initializers (DllMain) being called for loaded modules. (Note that the main process image can have one or more TLS callbacks, even though it cannot have a DLL initializer routine.)

One interesting fact about TLS initializers is that they are always called before DLL initializers for their corresponding DLL. (The process occurs in sequence, such that DLL A’s TLS and DLL initializers are called, then DLL B’s TLS and DLL initializers, and so forth.) This means that TLS initializers need to be careful about making, say, CRT calls (as the C runtime is initialized before the user’s DllMain routine is called, by the actual DLL initializer entrypoint, such that the CRT will not be initialized when a TLS initializer for the module is invoked). This can be dangerous, as global objects will not have been constructed yet; the module will be in a completely uninitialized state except that imports have been snapped.

Another point worth mentioning about the loader’s TLS support is that contrary to the Portable Executable specification, the SizeOfZeroFill member of the IMAGE_TLS_DIRECTORY structure is not used (or supported) by the linker or the loader. This means that in practice, all TLS template data is initialized, and the size of the memory block allocated for per-module implicit TLS does not include the SizeOfZeroFill member as the PE documentation (or certain other publications that appear to be based on said documentation) would seem to state. (It seems that the WINE folks happened to get it wrong as well, thanks to the implication in the PE specification that the field is actually used.)

Some programs abuse TLS callbacks for anti-debugging purposes (gaining code execution before the normal process entrypoint routine is executed by creating a TLS callback for the main process image), although this is, in practice, quite obvious as almost all PE images do not use TLS callbacks at all.

Up through Windows Server 2003, the above is really all the loader needs to do with respect to supporting __declspec(thread). While this approach would appear to work quite well, it turns out that there are, in fact, some problems with it (if you’ve been following along thus far, you can probably figure out what they are). More on some of the limitations of the Windows Server 2003 approach to implicit TLS next week.