Archive for October, 2007

Thread Local Storage, part 8: Wrap-up

Wednesday, October 31st, 2007

This is the final post in the Thread Local Storage series, which is comprised of the following articles:

  1. Thread Local Storage, part 1: Overview
  2. Thread Local Storage, part 2: Explicit TLS
  3. Thread Local Storage, part 3: Compiler and linker support for implicit TLS
  4. Thread Local Storage, part 4: Accessing __declspec(thread) data
  5. Thread Local Storage, part 5: Loader support for __declspec(thread) variables (process initialization time)
  6. Thread Local Storage, part 6: Design problems with the Windows Server 2003 (and earlier) approach to implicit TLS
  7. Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs
  8. Thread Local Storage, part 8: Wrap-up

By now, much of the inner workings of TLS (both implicit and explicit) on Windows should appear less mysterious, and a number of the seemingly arbitrary restrictions on limitations (maximum counts of explicit TLS slots on various operating systems, and limitations with respect to the usage of __declspec(thread) on demand loaded DLLs). Although many of these things can be (and should) considered implementation details that are subject to change, knowing how things work “under the hood” often comes in useful from time to time. For example, with an understanding of why there’s a hard limit to the number of available explicit TLS slots, the importance of reusing one TLS slots for many variables (by placing them into a structure that is pointed to by the contents of a TLS slot) should become clear.

Many of the details of implicit TLS are actually rather set in stone at this point, due to the fact that the compiler has been emitting code to directly access the ThreadLocalStoragePointer field in the TEB. Interestingly enough, this makes ThreadLocalStoragePointer a “guaranteed portable” part of the TEB, along with the NT_TIB header, despite the fact that the contents between the two are not defined to be portable (and are certainly not across, say, Windows 95).

Most of the inner workings of TLS are fairly straightforward, although there are some clever tricks employed to deal with scenarios such as TLS slots being released while threads are active. Many of the operational details of day to day TLS operation, such as how explicit TLS operates, are significantly different on Windows 95 and other operating systems of the 16-bit Windows lineage, so I would not recommend relying on the details of the implementation of TLS for non-NT-based systems.

Incidentally, most of the operating system itself does not use TLS in the way that it is exposed to third party programs. Instead, many operating system components either have their own dedicated fields in the TEB, or for larger amounts of data that may not need to be allocated for every thread in the system, a pointer field that can be filled with a pointer to a memory block at runtime if desired. For instance, there’s a ReservedForNtRpc field, a number of fields set aside for OpenGL ICDs (so much for Microsoft not supporting OpenGL), a WinSockData field for ws2_32, and many other similar fields for various operating system components.

This doesn’t mean that these components are really getting preferential treatment, as for the most part, an access to such a field in the TEB is in practice not really slower than an access through the documented TLS APIs. The benefit from providing these components with their own dedicated storage in the TEB is that in many cases, these components are already going to be active. If said operating system components used conventional TLS, then this would significantly detract from the already limited number of TLS slots available for use by third party components.

Some components do actually use standard TLS, or at least the space allocated in the TEB for standard TLS slots (though in special circumstances and without going through the standard explicit TLS APIs). For example, the 64-bit portion of the Wow64 layer in a 32-bit process repurposes some of the 64-bit TLS slots (which would normally be completely unused in such a process) for its own internal usage, thereby avoiding the need for dedicated storage in the TEB. That, however, is a story for another day.

Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs

Tuesday, October 30th, 2007

Yesterday, I outlined some of the pitfalls behind the approach that the loader has traditionally taken to implicit TLS, in Windows Server 2003 and earlier releases of the operating system.

With Windows Vista, Microsoft has taken a stab at alleviating some of the issues that make __declspec(thread) unusable for demand loaded DLLs. Although solving the problem may initially appear simple at first (one would tend to think that all that would need to be done would be to track and procesS TLS data for new modules as they’re loaded), the reality of the situation is unfortunately a fair amount more complicated than that.

At heart is the fact that implicit TLS is really only designed from the get-go to support operation at process initialization time. For example, this becomes evident when ones considers what would need to be done to allocate a TLS slot for a new module. This is in and of itself problematic, as the per-module TLS array is allocated at process initialization time, with only enough space for the modules that were present (and using TLS) at that time. Expanding the array is in this case a difficult thing to safely do, considering the code that the compiler generates for accessing TLS data.

The problem resides in the fact that the compiler reads the address of the current thread’s ThreadLocalStoragePointer and then later on dereferences the returned TLS array with the current module’s TLS index. Because all of this is done without synchronization, it is not in general safe to just switch out the old ThreadLocalStoragePointer with a new array and then release the old array from another thread context, as there is no way to ensure that the thread whose TLS array is being modified was not in the middle of accessing the TLS array.

A further difficulty presents itself in that there now needs to be a mechanism to proactively go out and place a new TLS module block into each running thread’s TLS array, as there may be multiple threads active when a module is demand-loaded. This is further complicated by the fact that said modifications are required to be performed before DllMain is called for the incoming module, and while the loader lock is still held by the current thread. This implies that, once again, the alterations to the TLS arrays of other threads will need to be performed by the current thread, without the cooperation of additional threads that are active in the process at the time of the DLL load.

These constraints are responsible for the bulk of the complexity of the new loader code in Windows Vista for TLS-related operations. The general concept behind how the new TLS support operates is as follows:

First, a new module is loaded via LdrLoadDll (which is used to implement LoadLibrary and similar Win32 functions). The loader examines the module to determine if it makes use of implicit TLS. If not, then no TLS-specific handling is performed and the typical loaded module processing occurs.

If an incoming module does make use of TLS, however, then LdrpHandleTlsData (an internal helper routine) is called to initialize support for the new module’s implicit TLS usage. LdrpHandleTlsData determines whether there is room in the ThreadLocalStoragePointer arrays of currently loaded threads for the new module’s TLS slot (with Windows Vista, the array can initially be larger than the total number of modules using TLS at process initialization time, for cheaper expansion of TLS data when a new module using TLS is demand-loaded). Because all running threads will at any given time have the same amount of space in their ThreadLocalStoragePointer, this is easily accomplished by a global variable to keep track of the array length. This variable is the SizeOfBitMap member of LdrpTlsBitmap, an RTL_BITMAP structure.

Depending on whether the existing ThreadLocalStoragePointer arrays are sufficient to contain the new module, LdrpHandleTlsdata allocates room for the TLS variable block for the new module and possibly new TLS arrays to store in the TEB of running threads. After the new data is allocated for each thread for the incoming module, a new process information class (ProcessTlsInformation) is utilized with an NtSetInformationProcess call to ask the kernel for help in switching out TLS data for any threads that are currently running in the process. Conceptually, this behavior is similar to ThreadZeroTlsCell, although its implementation is significantly more complicated. This step does not really appear to need to occur in kernel mode and does introduce significant (arguably unnecessary) complexity, so it is unclear why the designers elected to go this route.

In response to the ProcessTlsInformation request, the kernel enumerates threads in the current process and either swaps out one member of the ThreadLocalStoragePointer array for all threads, or swaps out the entire pointer to the ThreadLocalStoragePointer array itself in the TEB for all threads. The previous values for either the requested TLS index or the entire array pointer are then returned to user mode.

LdrpHandleTlsData then inspects the data that was returned to it by the kernel. Generally, this data represents either a TLS data block for a module that has been since unloaded (which is always safe to immediately free), or it represents an old TLS array for an already running thread. In the latter case, it is not safe to release the memory backing the array, as without the cooperation of the thread in question, there is no way to determine when the thread has released all possible references to the old memory block. Since the code to access the TLS array is hardcoded into every program using implicit TLS by the compiler, for practical purposes there is no particularly elegant way to make this determinatiion.

Because it is not easily possible to determine (prove) when the old TLS array pointer will never again be referenced, the loader enqueues the pointer into a list of heap blocks to be released at thread exit time when the thread that owns the old TLS array performs a clean exit. Thus, the old TLS array pointer (if the TLS array was expanded) is essentially intentionally leaked until the thread exits. This is a fairly minor memory loss in practice, as the array itself is an array of pointers only. Furthermore, the array is expanded in such a way that most of the time, a new module will take an unused slot in the array instead of requiring the TLS array to be reallocated each time. This sort of intentional leak is, once again, necessary due to the design of implicit TLS not being particular conducive to supporting demand loaded modules.

The loader lock itself is used for synchronization with respect to switching out TLS pointers in other threads in the current process. While a thread owns the loader lock, it is guaranteed that no other thread will attempt to modify the TLS array of it (or any other threads). Because the old TLS array pointers are kept if the TLS array is reallocated, there is no risk of touching deallocated memory when the swap is made, even though the threads whose TLS pointers are being swapped have no synchronization with respect to reading the TLS array in their TEBs.

When a module is unloaded, the TLS slot occupied by the module is released back into the TLS slot pool, but the module’s TLS variable space is not immediately freed until either individual threads for which TLS variable space were allocated exit, or a new module is loaded and happens to claim the outgoing module’s previous TLS slot.

For those interested, I have posted my interpretration of the new implicit TLS support in Vista. This code has not been completely tested, though it is expected to be correct enough for purposes of understanding the details of the TLS implementation. In particular, I have not verified every SEH scope in the ProcessTlsInformation implementation; the SEH scope statements (handlers in particular) are in many cases logical extrapolations of what the expected behavior should be in such cases. As always, it should be considered implementation details and subject to change without notice in future operating system releases.

(There also appear to be several unfortunate bugs in the Vista implementation of TLS, mostly related to inconsistent states and potential corruption if heap allocations fail at “bad” points in time. These are commented in the above code.)

The handler for the ProcessTlsInformation process set information class does not appear to be subfunction in reality, but instead a (rather large) case statement in the implementation of NtSetInformationProcess. It is presented as a subfunction for purposes of clarity. For reference, a control flow graph of NtSetInformationProcess is provided, with the basic blocks relevant to the ProcessTlsInformation case statement shaded. I suspect that this information class holds the record for the most convoluted usage of SEH scopes due to its heavy use of dual input/output parameters.

The information class implementation also appears to take many unconventional shortcuts that while technically workable for the use cases, would appear to be rather inconsistent with the general way that most other system calls and information classes are architected. The reasoning behind these inconsistencies is not known (perhaps as a time saver). For example, unlike most other process information classes, the only valid handle that can be used with this information class is NtCurrentProcess(). In other words, the information class handler implementation assumes the caller is the process to be modified.

Thread Local Storage, part 6: Design problems with the Windows Server 2003 (and earlier) approach to implicit TLS

Monday, October 29th, 2007

Last week, I described how the loader handles implicit TLS (as of Windows Server 2003). Although the loader’s support for implicit TLS works out well enough for what it was originally designed for, there are some cases where things do not turn out so happily. If you’ve been following along closely so far, you’ve probably already noticed some of the deficiencies relating to the design of implicit TLS. These defects in the design and implementation of TLS eventually spurred Microsoft to significantly revamp the loader’s implicit TLS support in Windows Vista.

The primary problem with respect to how Windows Server 2003 and earlier Windows versions support implicit TLS is that it just plain doesn’t work at all with DLLs that are dynamically loaded (via LoadLibrary, or LdrLoadDll). In fact, the way that implicit TLS fails if you try to dynamically load a DLL written to utilize it is actually rather spectacularly catastrophic.

What ends up happening is that the new DLL will have no TLS processing by the loader happen whatsoever. With our knowledge of how implicit TLS works at this point, the unfortunate consequences of this behavior should be readily apparent.

When a DLL using implicit TLS is loaded, because the loader doesn’t process the TLS directory, the _tls_index value is not initialized by the loader, nor is there space allocated for module’s TLS data in the ThreadLocalStoragePointer arrays of running threads. The DLL continues to load, however, and things will appear to work… until the first access to a __declspec(thread) variable occurs, that is.

The compiler typically initializes _tls_index to zero by default, so the value retains the value zero in the case where an implicit TLS using DLL is loaded after process initialization time. When an access to a __declspec(thread) variable occurs, the typical implicit TLS variable resolution process occurs. That is, ThreadLocalStoragePointer is fetched from the TEB and is indexed by _tls_index (which will always be zero), and the resultant pointer is assumed to be a pointer to the current thread’s thread local variables. Unfortunately, because the loader didn’t actually set _tls_index to a valid value, the DLL will reference the thread local variable storage of whichever module was legitimately assigned TLS index zero. This is typically going to be the main process executable, although it could be a DLL if the main process executable doesn’t use TLS but is static linked to a DLL that does use TLS.

This results in one of the absolute worst possible kinds of problems to debug. Now you’ve got one module trampling all over another module’s state, with the guilty module under the (mistaken) belief that the state that it’s referencing is really the guilty module’s own state. If you’re lucky, the process has no implicit TLS using at all (at process initialization time), and the ThreadLocalStoragePointer will not be allocated for the current thread and the initial access to a __declspec(thread) variable will simply result in an immediate null pointer dereference. More common, however, is the case that there is somebody in the process already using implicit TLS, in which case the module owning TLS index zero will have its thread local variables corrupted by the newly loaded module.

In this situation, the actual crash is typically long delayed until the first module finally gets around to using its thread local variable stage and fails due to the fact that it’s been overwritten, far after the fact. It is also possible that you’ll get lucky and the newly loaded module’s TLS variables will be much larger in size than the module with TLS index zero, in which case the initial access to the __declspec(thread) variable might immediately fault if it is sufficiently beyond the length of the heap allocation used for the already loaded module’s TLS variable storage. Of course, the offset of the variable accessed might be somewhere in between the edge of the current heap segment (page) and the end of the allocation used for the original module’s TLS variable storage, in which case heap corruption will occur instead of original module’s TLS variables for the current thread. (The loader uses the process heap to satisfy module TLS variable block allocations.)

Perhaps the only saving grace of the loader’s limitation with respect to implicit TLS and demand loaded DLLs is that due to the fact that the loader’s support for this situation has (not) operated correctly for so long now, many programmers know well enough to stay away from implicit TLS when used in conjunction with DLLs (or so I would hope).

These dire consequences of demand loading a module using __declspec(thread) variables are the reason for the seemingly after-the-fact warning about using implicit TLS with demand loaded DLLs in the LoadLibrary documentation on MSDN:

Windows Server 2003 and Windows XP: The Visual C++ compiler supports a syntax that enables you to declare thread-local variables: _declspec(thread). If you use this syntax in a DLL, you will not be able to load the DLL explicitly using LoadLibrary on versions of Windows prior to Windows Vista. If your DLL will be loaded explicitly, you must use the thread local storage functions instead of _declspec(thread). For an example, see Using Thread Local Storage in a Dynamic Link Library.

Clearly, the failure mode of demand loaded DLLs using implicit TLS is far from acceptable from a debugging perspective. Furthermore, this restriction puts a serious crimp in the practical usefulness of the otherwise highly useful __declspec(thread) support that has been baked into the compiler and linker, at least with respect to its usage in DLLs.

Fortunately, the Windows Vista loader takes some steps to address this problem, such that it becomes possible to use __declspec(thread) safely on Windows Vista and future operating system versions. The new loader support for implicit TLS in demand loaded DLLs is fairly complicated, though, due to some unfortunate design consequences of how implicit TLS works.

Next time, I’ll go in to some more details on just how the Windows Vista loader supports this scenario, as well as some of the caveats behind the implementation that is used in the loader going forward with Vista.

VMKD released

Sunday, October 28th, 2007

I have posted an update to VMKD VMKD ( Since the last release (, the following things have changed (there is a changelog included with the package):

  1. Fixed an assert that kdvmware.sys was tripping on checked builds of the kernel (whoops). There was a bug in the code that was reprotecting kdcom.dll as a part of assuming control over the KD I/O routines.
  2. Added a potential fix for occasional difficulties resynchronizing with the guest across a reboot if DbgEng is not restarted. If you are still seeing synchronization problems from time to time, I’d be interested to see debug output from vmxpatch (available by attaching a debugger to vmware-vmx.exe) and DbgEng itself (available with CTRL-D/CTRL-ALT-D in kd.exe or WinDbg.exe, respectively).
  3. Added support for partial checked builds in a rather limited fashion. Any checked kernel that is used with VMKD ought to be named “krnltest.exe” in the guest. This seemingly arbitrary limitation is present because the file name specified via /KERNEL= is the actual name that appears in the loaded module list, and VMKD uses string comparisons on loaded module list file names to find the kernel image in-memory. There are certainly “better” ways to do this, but the current approach is fairly simple and aside from checked builds, tends to be the most reliable and officially supported way across a wide range of OS versions. Any file name may be specified for the checked HAL module in a partial checked build configuration.

    In the future, I may update the check to be more clever about finding the kernel so as to not rely on string comparisons, but it does not really appear to be worth the time for most purposes at this point.

Additionally, it has been confirmed that VMKD works with VMware Server 1.0.4 (no changes were required on VMKD’s end, and previous releases will work with VMware Server 1.0.4 as well). I still have not gotten around to verifying the operation on VMware Workstation, as for most purposes I have moved my VMware usage almost completely over to VMware Server.

Now, back to your regularly scheduled coverage on the depths of thread local storage on Windows…

Thread Local Storage, part 5: Loader support for __declspec(thread) variables (process initialization time)

Friday, October 26th, 2007

Last time, I described the mechanism by which the compiler and linker generate code to access a variable that has been instanced per-thread via the __declspec(thread) extended storage class. Although the compiler and linker have essentially “set the stage” with respect to implicit TLS at this point, the loader is the component that “fills in the dots” and supplies the necessary run-time infrastructure to allow everything to operate.

Specifically, the loader is responsible for managing the allocation of per-module TLS index values, the allocation and management of the memory for the ThreadLocalStoragePointer array referred to by the TEB of every thread. Additionally, the loader is also responsible for managing the memory for each module’s thread-instanced (that is, __declspec(thread)-decorated) variables.

The loader’s TLS-related allocation and management duties can conceptually be split up into four distinct areas (Note that this represents the Windows Server 2003 and earlier view of things; I will go over some of the changes that Windows Vista makes this this model in a future posting in the TLS series.):

  1. At process initialization time, allocate _tls_index values, determine the extent of memory required for each module’s TLS block, and call TLS and DLL initializers (in that order).
  2. At thread initialization time, allocate and initialize TLS memory blocks for each module utilizing TLS, allocate the ThreadLocalStoragePointer array for the current thread, and link the TLS memory blocks in to the ThreadLocalStoragePointer array. Additionally, TLS initializers and then DLL initializers (in that order) are invoked for the current thread.
  3. At thread deinitialization time, call TLS deinitializers and then DLL deinitializers (in that order), and release the current thread’s TLS memory blocks for each module using TLS, and release the ThreadLocalStoragePointer array.
  4. At process deinitialization time, call TLS and DLL initializers (in that order).

Of course, the loader performs a number of other tasks when these events occur; this is simply a list of those that have some bearing on TLS support.

Most of these operations are fairly straightforward, with the arguable exception of process initialization. Process initialization of TLS is primarily handled in two subroutines inside ntdll, LdrpInitializeTls and LdrpAllocateTls.

LdrpInitializeTls is invoked during process initialization after all DLLs have been loaded, but before any initializer (or TLS) routines have been called. It essentially walks the loaded module list and sums the length of TLS data for each module that contains a valid TLS directory. For each module that contains TLS, a data structure is allocated that contains the length of the module’s TLS data and the TLS index that has been assigned to that module. (The TlsIndex field in the LDR_DATA_TABLE_ENTRY structure appears to be unused except as a flag that the module has TLS (being always set to -1), at least as far back as Windows XP. It is worth mentioning that the WINE implementation of implicit TLS incorrectly uses TlsIndex as the real module TLS index, so it may be unreliable to assume that it is always -1 if you care about working on WINE.)

Modules that use implicit TLS and which are present at initialization time are additionally marked as pinned in memory for the lifetime of the process by LdrpInitializeProcess (the LoadCount of any such module is fixed to 0xFFFF). In practice, this is typically unlikely to matter, as for such modules to be present at process initialization time, they must also by definition static linked by either the main process image or a dependency of the main process image.

After LdrpInitializeTls has determined which modules use TLS in the current process and has assigned those modules TLS index values, LdrpAllocateTls is called to allocate and initialize module TLS values for the initial thread.

At this point, process initialization continues, eventually resulting in TLS initializers and then DLL initializers (DllMain) being called for loaded modules. (Note that the main process image can have one or more TLS callbacks, even though it cannot have a DLL initializer routine.)

One interesting fact about TLS initializers is that they are always called before DLL initializers for their corresponding DLL. (The process occurs in sequence, such that DLL A’s TLS and DLL initializers are called, then DLL B’s TLS and DLL initializers, and so forth.) This means that TLS initializers need to be careful about making, say, CRT calls (as the C runtime is initialized before the user’s DllMain routine is called, by the actual DLL initializer entrypoint, such that the CRT will not be initialized when a TLS initializer for the module is invoked). This can be dangerous, as global objects will not have been constructed yet; the module will be in a completely uninitialized state except that imports have been snapped.

Another point worth mentioning about the loader’s TLS support is that contrary to the Portable Executable specification, the SizeOfZeroFill member of the IMAGE_TLS_DIRECTORY structure is not used (or supported) by the linker or the loader. This means that in practice, all TLS template data is initialized, and the size of the memory block allocated for per-module implicit TLS does not include the SizeOfZeroFill member as the PE documentation (or certain other publications that appear to be based on said documentation) would seem to state. (It seems that the WINE folks happened to get it wrong as well, thanks to the implication in the PE specification that the field is actually used.)

Some programs abuse TLS callbacks for anti-debugging purposes (gaining code execution before the normal process entrypoint routine is executed by creating a TLS callback for the main process image), although this is, in practice, quite obvious as almost all PE images do not use TLS callbacks at all.

Up through Windows Server 2003, the above is really all the loader needs to do with respect to supporting __declspec(thread). While this approach would appear to work quite well, it turns out that there are, in fact, some problems with it (if you’ve been following along thus far, you can probably figure out what they are). More on some of the limitations of the Windows Server 2003 approach to implicit TLS next week.

Thread Local Storage, part 4: Accessing __declspec(thread) data

Thursday, October 25th, 2007

Yesterday, I outlined how the compiler and linker cooperate to support TLS. However, I didn’t mention just what exactly goes on under the hood when one declares a __declspec(thread) variable and accesses it.

Before the inner workings of a __declspec(thread) variable access can be explained, however, it is necessary to discuss several more special variables in tlssup.c. These special variables are referenced by _tls_used to create the TLS directory for the image.

The first variable of interest is _tls_index, which is implicitly referenced by the compiler in the per-thread storage resolution mechanism any time a thread local variable is referenced (well, almost every time; there’s an exception to this, which I’ll mention later on). _tls_index is also the only variable declared in tlssup.c that uses the default allocation storage class. Internally, it represents the current module’s TLS index. The per-module TLS index is, in principal, similar to a TLS index returned by TlsAlloc. However, the two are not compatible, and there exists significantly more work behind the per-module TLS index and its supporting code. I’ll cover all of that later as well; for now, just bear with me.

The definitions of _tls_start and _tls_end appear as so in tlssup.c:

#pragma data_seg(".tls")

#if defined (_M_IA64) || defined (_M_AMD64)
char _tls_start = 0;

#pragma data_seg(".tls$ZZZ")

#if defined (_M_IA64) || defined (_M_AMD64)
char _tls_end = 0;

This code creates the two variables and places them at the start and end of the “.tls” section. The compiler and linker will automatically assume a default allocation section of “.tls” for all __declspec(thread) variables, such that they will be placed between _tls_start and _tls_end in the final image. The two variables are used to tell the linker the bounds of the TLS storage template section, via the image’s TLS directory (_tls_used).

Now that we know how __declspec(thread) works from a language level, it is necessary to understand the supporting code the compiler generates for an access to a __declspec(thread) variable. This supporting code is, fortunately, fairly straightforward. Consider the following test program:

__declspec(thread) int threadedint = 0;

int __cdecl wmain(int ac,
   wchar_t **av)
   threadedint = 42;

   return 0;

For x64, the compiler generated the following code:

mov	 ecx, DWORD PTR _tls_index
mov	 rax, QWORD PTR gs:88
mov	 edx, OFFSET FLAT:threadedint
mov	 rax, QWORD PTR [rax+rcx*8]
mov	 DWORD PTR [rdx+rax], 42

Recall that the gs segment register refers to the base address of the TEB on x64. 88 (0x58) is the offset in the TEB for the ThreadLocalStoragePointer member on x64 (more on that later):

   +0x058 ThreadLocalStoragePointer : Ptr64 Void

If we examine the code after the linker has run, however, we’ll notice something strange:

mov     ecx, cs:_tls_index
mov     rax, gs:58h
mov     edx, 4
mov     rax, [rax+rcx*8]
mov     dword ptr [rdx+rax], 2Ah ; 42
xor     eax, eax

If you haven’t noticed it already, the offset of the “threadedint” variable was resolved to a small value (4). Recall that in the pre-link disassembly, the “mov edx, 4” instruction was “mov edx, OFFSET FLAT:threadedint”.

Now, 4 isn’t a very flat address (one would expect an address within the confines of the executable image to be used). What happened?

Well, it turns out that the linker has some tricks up its sleeve that were put into play here. The “offset” of a __declspec(thread) variable is assumed to be relative to the base of the “.tls” section by the linker when it is resolving address references. If one examines the “.tls” section of the image, things begin to make a bit more sense:

0000000001007000 _tls segment para public 'DATA' use64
0000000001007000      assume cs:_tls
0000000001007000     ;org 1007000h
0000000001007000 _tls_start        dd 0
0000000001007004 ; int threadedint
0000000001007004 ?threadedint@@3HA dd 0
0000000001007008 _tls_end          dd 0

The offset of “threadedint” from the start of the “.tls” section is indeed 4 bytes. But all of this still doesn’t explain how the instructions the compiler generated access a variable that is instanced per thread.

The “secret sauce” here lies in the following three instructions:

mov     ecx, cs:_tls_index
mov     rax, gs:58h
mov     rax, [rax+rcx*8]

These instructions fetch ThreadLocalStoragePointer out of the TEB and index it by _tls_index. The resulting pointer is then indexed again with the offset of threadedint from the start of the “.tls” section to form a complete pointer to this thread’s instance of the threadedint variable.

In C, the code that the compiler generated could be visualized as follows:

// This represents the ".tls" section
   int tls_start;
   int threadedint;
   int tls_end;


Teb     = NtCurrentTeb();
TlsData = Teb->ThreadLocalStoragePointer[ _tls_index ];

TlsData->threadedint = 42;

This should look familiar if you’ve used explicit TLS before. The typical paradigm for explicit TLS is to place a structure pointer in a TLS slot, and then to access your thread local state, the per thread instance of the structure is retrieved and the appropriate variable is then referenced off of the structure pointer. The difference here is that the compiler and linker (and loader, more on that later) cooperated to save you (the programmer) from having to do all of that explicitly; all you had to do was declare a __declspec(thread) variable and all of this happens magically behind the scenes.

There’s actually an additional curve that the compiler will sometimes throw with respect to how implicit TLS variables work from a code generation perspective. You may have noticed how I showed the x64 version of an access to a __declspec(thread) variable; this is because, by default, x86 builds of a .exe involve a special optimization (/GA (Optimize for Windows Application, quite possibly the worst name for a compiler flag ever)) that eliminates the step of referencing the special _tls_index variable by assuming that it is zero.

This optimization is only possible with a .exe that will run as the main process image. The assumption works in this case because the loader assigns per-module TLS index values on a sequential basis (based on the loaded module list), and the main process image should be the second thing in the loaded module list, after NTDLL (which, now that this optimization is being used, can never have any __declspec(thread) variables, or it would get TLS index zero instead of the main process image). It’s worth noting that in the (extremely rare) case that a .exe exports functions and is imported by another .exe, this optimization will cause random corruption if the imported .exe happens to use __declspec(thread).

For reference, with /GA enabled, the x86 build of the above code results in the following instructions:

mov     eax, large fs:2Ch
mov     ecx, [eax]
mov     dword ptr [ecx+4], 2Ah ; 42

Remember that on x86, fs points to the base address of the TEB, and that ThreadLocalStoragePointer is at offset +0x2C from the base of the x86 TEB.

Notice that there is no reference to _tls_index; the compiler assumes that it will take on the value zero. If one examines a .dll built with the x86 compiler, the /GA optimization is always disabled, and _tls_index is used as expected.

The magic behind __declspec(thread) extends beyond just the compiler and linker, however. Something still has to set up the storage for each module’s per-thread state, and that something is the loader. More on how the loader plays a part in this complex process next time.

Thread Local Storage, part 3: Compiler and linker support for implicit TLS

Wednesday, October 24th, 2007

Last time, I discussed the mechanisms by which so-called explicit TLS operates (the TlsGetValue, TlsSetValue and other associated supporting routines).

Although explicit TLS is certainly fairly heavily used, many of the more “interesting” pieces about how TLS works in fact relate to the work that the loader does to support implicit TLS, or __declspec(thread) variables (in CL). While both TLS mechanisms are designed to provide a similar effect, namely the capability to store information on a per-thread basis, many aspects of the implementations of the two different mechanisms are very different.

When you declare a variable with the __declspec(thread) extended storage class, the compiler and linker cooperate to allocate storage for the variable in a special region in the executable image. By convention, all variables with the __declspec(thread) storage class are placed in the .tls section of a PE image, although this is not technically required (in fact, the thread local variables do not even really need to be in their own section, merely contiguous in memory, at least from the loader’s perspective). On disk, this region of memory contains the initializer data for all thread local variables in a particular image. However, this data is never actually modified and references to a particular thread local variable will never refer to an address within this section of the PE image; the data is merely a “template” to be used when allocating storage for thread local variables after a thread has been created.

The compiler and linker also make use of several special variables in the context of implicit TLS support. Specifically, a variable by the name of _tls_used (of the type IMAGE_TLS_DIRECTORY) is created by a portion of the C runtime that is static linked into every program to represent the TLS directory that will be used in the final image (references to this variable should be extern “C” in C++ code for name decoration purposes, and storage for the variable need not be allocated as the supporting CRT stub code already creates the variable). The TLS directory is a part of the PE header of an executable image which describes to the loader how the image’s thread local variables are to be managed. The linker looks for a variable by the name of _tls_used and ensures that in the on-disk image, it overlaps with the actual TLS directory in the final image.

The source code for the particular section of C runtime logic that declares _tls_used lives in the tlssup.c file (which comes with Visual Studio), making the variable pseudo-documented. The standard declaration for _tls_used is as so:

const IMAGE_TLS_DIRECTORY _tls_used =
 (ULONG)(ULONG_PTR) &_tls_start, // start of tls data
 (ULONG)(ULONG_PTR) &_tls_end,   // end of tls data
 (ULONG)(ULONG_PTR) &_tls_index, // address of tls_index
 (ULONG)(ULONG_PTR) (&__xl_a+1), // pointer to callbacks
 (ULONG) 0,                      // size of tls zero fill
 (ULONG) 0                       // characteristics

The CRT code also provides a mechanism to allow a program to register a set of TLS callbacks, which are functions with a similar prototype to DllMain that are called when a thread starts or exits (cleanly) in the current process. (These callbacks can even be registered for a main process image, where there is no DllMain routine.) The callbacks are typed as PIMAGE_TLS_CALLBACK, and the TLS directory points to a null-terminated array of callbacks (called in sequence).

For a typical image, there will not exist any TLS callbacks (in practice, almost everything uses DllMain to perform per-thread initialization tasks). However, the support is retained and is fully functional. To use the support that the CRT provides for TLS callbacks, one needs to declare a variable that is stored in the specially named “.CRT$XLx” section, where x is a value between A and Z. For example, one might write the following code:

#pragma section(".CRT$XLY",long,read)

extern "C" __declspec(allocate(".CRT$XLY"))
  PIMAGE_TLS_CALLBACK _xl_y  = MyTlsCallback;

The strange business with the special section names is required because the in-memory ordering of the TLS callback pointers is significant. To understand what is happening with this peculiar looking declaration, it is first necessary to understand a bit about the compiler and linker organize data in the final PE image that is produced.

Non-header data in a PE image is placed into one or more sections, which are regions of memory with a common set of attributes (such as page protection). The __declspec(allocate(“section-name”)) keyword (CL-specific) tells the compiler that a particular variable is to be placed in a specific section in the final executable. The compiler additionally has support for concatenating similarly-named sections into one larger section. This support is activated by prefixing a section name with a $ character followed by any other text. The compiler concatenates the resulting section with the section of the same name, truncated at the $ character (inclusive).

The compiler alphabetically orders individual sections when concatenating them (due to the usage of the $ character in the section name). This means that in-memory (in the final executable image), a variable in the “.CRT$XLB” section will be after a variable in the “.CRT$XLA” section but before a variable in “.CRT$XLZ” section. The C runtime uses this quirk of the compiler to create an array of null terminated function pointers to TLS callbacks (with the pointer stored in the “.CRT$XLZ” section being the null terminator). Thus, in order to ensure that the declared function pointer resides within the confines of the TLS callback array being referenced by _tls_used, it is necessary place in a section of the form “.CRT$XLx“.

The creation of the TLS directory is, however, only one portion of how the compiler and linker work together to support __declspec(thread) variables. Next time, I’ll discuss just how the compiler and linker manage accesses to such variables.

Update: Phil mentions that this support for TLS callbacks does not work before the Visual Studio 2005 release. Be warned if you are still using an old compiler package.

Thread Local Storage, part 2: Explicit TLS

Tuesday, October 23rd, 2007

Previously, I outlined some of the general design principles behind both flavors of TLS in use on Windows. Anyone can see the design and high level interface to TLS by reading MSDN, though; the interesting parts relate to the implementation itself.

The explicit TLS API is (by far) the simplest of the two classes of TLS in terms of the implementation, as it touches the fewest “moving parts”. As I mentioned last time, there are really just four key functions in the explicit TLS API. The most important two are TlsGetValue and TlsSetValue, which manage the actual setting and retrieving of per-thread pointers.

These two functions are simple enough to annotate entirely. The essential mechanism behind them is that they are basically just “dumb accessors” into an array (two arrays in actuality, TlsSlots and TlsExpansionSlots) in the TEB, which is indexed by the dwTlsIndex argument to return (or set) the desired per-thread variable. The implementation of TlsGetValue on Vista (32-bit) is as follows (TlsSetValue is similar, except that it writes to the arrays instead of reading from them, and has support for demand-allocating the TlsExpansionSlots array; more on that later):

	__in DWORD dwTlsIndex
   PTEB Teb = NtCurrentTeb(); // fs:[0x18]

   // Reset the last error state.
   Teb->LastErrorValue = 0;

   // If the variable is in the main array, return it.
   if (dwTlsIndex < 64)
      return Teb->TlsSlots[ dwTlsIndex ];

   if (dwTlsIndex > 1088)
      return 0;

   // Otherwise it's in the expansion array.
   // If it's not allocated, we default to zero.
   if (!Teb->TlsExpansionSlots)
      return 0;

   // Fetch the value from the expansion array.
   return Teb->TlsExpansionSlots[ dwTlsIndex - 64 ];

(The assembler version (annotated) is also available.)

The TlsSlots array in the TEB is a part of every thread, which gives each thread a guaranteed set of 64 thread local storage indexes. Later on, Microsoft decided that 64 was not enough TLS slots to go around and added the TlsExpansionSlots array, for an additional 1024 TLS slots. The TlsExpansionSlots array is demand-allocated in TlsAlloc if the initial set of 64 slots is exceeded.

(This is, by the way, the nature of the seemingly arbitrary 64 and 1088 TLS slot limitations mentioned by MSDN, for those keeping score.)

TlsAlloc and TlsFree are, for all intents and purposes, implemented just as what one would expect. They acquire a lock, search for a free TLS slot (returning the index if one is found), otherwise indicating to the caller that there are no free slots. If the first 64 slots are exhausted and the TlsExpansionSlots array has not been created, then TlsAlloc will allocate and zero space for 1024 more TLS slots (pointer-sized values), and then update the TlsExpansionSlots to refer to the newly allocated storage.

Internally, TlsAlloc and TlsFree utilize the Rtl bitmap package to track usage of individual TLS slots; each bit in a bitmap describes whether a particular TLS slot is free or in use. This allows for reasonably fast (and space efficient) mapping of TLS slot usage for book-keeping purposes.

If one has been following along so far, then the question as to what happens when TlsAlloc is called such that it must create the TlsExpansionSlots array after there is already more than one thread in the current process may have come to mind. This might appear to be a problem at first glance, as TlsAlloc only creates the array for the current thread. Although one might be tempted to conclude that, given this behavior of TlsAlloc, explicit TLS therefore doesn’t work reliably above 64 TLS slots if the extra slots are allocated after the second thread in the process is created, this is in fact not the case. There exists some clever sleight of hand that is performed by TlsGetValue and TlsSetValue, which compensates for the fact that TlsAlloc can only create the TlsExpansionSlots memory block for the current thread.

Specifically, if TlsGetValue is called with an array index within the confines of the TlsExpansionSlots array, but the array has not been allocated for the current thread, then zero is returned. (This is the default value for an uninitialized TLS slot, and is thus consequently legal.) Similarly, if TlsSetValue is called with an array index that falls under the TlsExpansionSlots array, and the array has not yet been created, TlsSetValue allocates the memory block on demand and initializes the requested TLS slot.

There also exists one final twist in TlsFree that is required to support the behavior of releasing a TLS slot while there are multiple threads running. A potential problem exists whereby a thread releases a TLS slot, and then it becomes reallocated, following which the previous contents of the TLS slot are still present on other threads running in the process. TlsFree alleviates this problem by asking the kernel for help, in the form of the ThreadZeroTlsCell thread information class. When the kernel sees a NtSetInformationThread call for ThreadZeroTlsCell, it enumerates all threads in the process and writes a zero pointer-length value to each running thread’s instance of the requested TLS slot, thus flushing the old contents and resetting the slot to the unallocated default state. (It is not strictly necessary for this to have been done in kernel mode, although the designers chose to go this route.)

When a thread exits normally, if the TlsExpansionSlots pointer has been allocated, it is freed to the process heap. (Of course, if a thread is terminated by TerminateThread, the TlsExpansionSlots array is leaked. This is yet one reason among innumerable others why you should stay away from TerminateThread.)

Next up: Examining implicit TLS support (__declspec(thread) variables).

Thread Local Storage, part 1: Overview

Monday, October 22nd, 2007

Windows, like practically any other mainstream multithreading operating system, provides a mechanism to allow programmers to efficiently store state on a per-thread basis. This capability is typically known as Thread Local Storage, and it’s quite handy in a number of circumstances where global variables might need to be instanced on a per-thread basis.

Although the usage of TLS on Windows is fairly well documented, the implementation details of it are not so much (though there are a smattering of pieces of third party documentation floating out there).

Conceptually, TLS is in principal not all that complicated (famous last words), at least from a high level. The general design is that all TLS accesses go through either a pointer or array that is present on the TEB, which is a system-defined data structure that is already instanced per thread.

The “per-thread” resolution of the TEB is fairly well documented, but for the benefit of those that are unaware, the general idea is that one of the segment registers (fs on x86, gs on x64) is repurposed by the OS to point to the base address of the TEB for the current thread. This allows, say, an access to fs:[0x0] (or gs:[0x0] on x64) to always access the TEB allocated for the current thread, regardless of other threads in the address space. The TEB does really exist in the flat address space of the process (and indeed there is a field in the TEB that contains the flat virtual address of it), but the segmentation mechanism is simply used to provide a convenient way to access the TEB quickly without having to search through a list of thread IDs and TEB pointers (or other relatively slow mechanisms).

On non-x86 and non-x64 architectures, the underlying mechanism by which the TEB is accessed varies, but the general theme is that there is a register of some sort which is always set to the base address of the current thread’s TEB for easy access.

The TEB itself is probably one of the best-documented undocumented Windows structures, primarily because there is type information included for the debugger’s benefit in all recent ntdll and ntoskrnl.exe builds. With this information and a little disassembly work, it is not that hard to understand the implementation behind TLS.

Before we can look at the implementation of how TLS works on Windows, however, it is necessary to know the documented mechanisms to use it. There are two ways to accomplish this task on Windows. The first mechanism is a set of kernel32 APIs (comprising TlsGetValue, TlsSetValue, TlsAlloc, and TlsFree that allows explicit access to TLS. The usage of the functions is fairly straightforward; TlsAlloc reserves space on all threads for a pointer-sized variable, and TlsGetValue can be used to read this per-thread storage on any thread (TlsSetValue and TlsFree are conceptually similar).

The second mechanism by which TLS can be accessed on Windows is through some special support from the loader (residing ntdll) and the compiler and linker, which allow “seamless”, implicit usage of thread local variables, just as one would use any global variable, provided that the variables are tagged with __declspec(thread) (when using the Microsoft build utilities). This is more convenient than using the TLS APIs as one doesn’t need to go and call a function every time you want to use a per-thread variable. It also relieves the programmer of having to explicitly remember to call TlsAlloc and TlsFree at initialization time and deinitialization time, and it implies an efficient usage of per-thread storage space (implicit TLS operates by allocating a single large chunk of memory, the size of which is defined by the sum of all per-thread variables, for each thread so that only one index into the implicit TLS array is used for all variables in a module).

With the advantages of implicit TLS, why would anyone use the explicit TLS API? Well, it turns out that prior to Windows Vista, there are some rather annoying limitations baked into the loader’s implicit TLS support. Specifically, implicit TLS does not operate when a module using it is not being loaded at process initialization time (during static import resolution). In practice, this means that it is typically not usable except by the main process image (.exe) of a process, and any DLL(s) that are guaranteed to be loaded at initialization time (such as DLL(s) that the main process image static links to).

Next time: Taking a closer look at explicit TLS and how it operates under the hood.

The optimizer has different traits between the x86 and x64 compilers

Friday, October 19th, 2007

Last time, I described how the x86 compiler sometimes optimizes conditional assignments.

It is worth adding to this sentiment, though, that the x64 compiler is in fact a different beast from the x86 compiler. The x64 instruction set brings new guaranteed-universally-supported instructions, a new calling convention (with new restrictions on how the stack is used in functions), and an increased set of registers. These can have a significant impact on the compiler.

As an example, let’s take a look at what the x64 compiler did when compiling (what I can assume) is the same source file into the code that we saw yesterday. There are a number of differences even with the small section that I posted that are worth pointing out. The approximately equivalent section of code in the x64 version is as so:

cmp     cl, 30h
mov     ebp, 69696969h
mov     edx, 30303030h
mov     eax, ebp
mov     esi, 2
cmovnz   eax, edx
mov     [rsp+78h+PacketLeader], eax

There are a number of differences here. First, there’s a very significant change that is not readily apparent from the small code snippets that I’ve pasted. Specifically, in the x86 build of this particular module, this code resided in a small helper subfunction that was called by a large exported API. With the x64 build, by contrast, the compiler decided to inline this function into its caller. (This is why this version of the code just stores the value in a local variable instead of storing it through a parameter pointer.)

Secondly, the compiler used cmovnz instead of going to the trouble of using the setcc, dec, and route. Because all x64 processors support the cmovcc family of instructions, the compiler has a free hand to always use them for any x64 platform target.

There are a number of different reasons why the compiler might perform particular optimizations. Although I’m hardly on the Microsoft compiler team, I might be able to hazard a guess as to why, for instance, the compiler might have decided to inline this code chunk instead of leaving it in a helper function as in the x86 build.

On x86, the call to this helper function looks like so:

push    [ebp+KdContext]
lea     eax, [ebp+PacketLeader]
push    eax
push    [ebp+PacketType]
call    _KdCompReceivePacketLeader@12

Following the call instruction, throughout the main function, there are a number of comparisons between the PacketLeader local variable (that was filled in by KdCompReceivePacketLeader) and one of the constants (0x69696969) that we saw PacketLeader being set to in the code fragment. To be exact, there are three occurances of the following in short order after the call (though there are control flow structures such as loops in between):

cmp     [ebp+PacketLeader], 69696969h

These are invariably followed by a conditional jump to another location.

Now, when we move to x64, things change a bit. One of the most significant differences between x64 and x86 (aside from the large address space of course) is the addition of a number of new general purpose registers. These are great for the optimizer, as they can allow for a number of things to be cached in registers instead of having to be either spilled to the stack or, in this case, encoded as instruction operands.

Again, to be clear, I don’t work on the x64 compiler, so this is really just my impression of things based on logical deduction. That being said, it would seem to me that one sort of optimization that you might be able to make easier on x64 in this case would be to replace all the large “cmp” instructions that reference the 0x69696969 constant with a comparison against a value cached in a register. This is desirable because a cmp instruction that compares a value dereferenced based on ebp (the frame pointer, in other words, a local variable) with a 4-byte immediate value (0x69696969) is a whopping 7 bytes long.

Now, 7 bytes might not seem like much, but little things like this add up over time and contribute to additional paging I/O by virtue of the program code being larger. Paging I/O is very slow, so it is advantageous for the compiler to try to reduce code size where possible in the interest of cutting down on the number of code pages.

Because x64 has a large number of extra general purpose registers (compared to x86), it is easier for the compiler to “justify” the “expenditure” of devoting a register to, say, caching a frequently used value for purposes of reducing code size.

In this particular incident, because the 0x69696969 constant is both referenced in the helper function and the main function, one benefit of inlining the code would be that it would be possible to “share” the constant in a cached register across both the reference in the helper function code and all the comparisons in the main function.

This is essentially what the compiler does in the x64 version. 0x69696969 is loaded into ebp, and depending on the condition flags when the cmovnz is executed will remain loaded into eax based off of a mov eax, ebp instruction.

Later on in the main function, comparisons against the 0x69696969 constant are performed via a check against ebp instead of an immediate 4-byte operand. For example, the long 7-byte cmp instructions on x86 become the following 4 byte instructions on x64:

cmp     [rsp+78h+PacketLeader], ebp

I’m sure this is probably not the only reason why the function was inlined for the x64 build and not the x86 build, but the optimizer is fairly competent, and I’d be surprised if this kind of possible optimization wasn’t factored in. Other reasons in favor of inlining on x64 are, for instance, the restrictions that the (required) calling convention places against the sort of custom calling conventions possible on x86, and the fact that any non-leaf function that isn’t inlined requires its own unwind metadata entries in the exception directory (which, for small functions, can be a non-trivial amount of overhead compared to the opcode length of the function itself).

Aside from changes about decisions on whether to inline code or not, there are a number of new optimizations that are exclusive to x64. That, however, is a topic for another day.