Thread Local Storage, part 2: Explicit TLS

Previously, I outlined some of the general design principles behind both flavors of TLS in use on Windows. Anyone can see the design and high level interface to TLS by reading MSDN, though; the interesting parts relate to the implementation itself.

The explicit TLS API is (by far) the simplest of the two classes of TLS in terms of the implementation, as it touches the fewest “moving parts”. As I mentioned last time, there are really just four key functions in the explicit TLS API. The most important two are TlsGetValue and TlsSetValue, which manage the actual setting and retrieving of per-thread pointers.

These two functions are simple enough to annotate entirely. The essential mechanism behind them is that they are basically just “dumb accessors” into an array (two arrays in actuality, TlsSlots and TlsExpansionSlots) in the TEB, which is indexed by the dwTlsIndex argument to return (or set) the desired per-thread variable. The implementation of TlsGetValue on Vista (32-bit) is as follows (TlsSetValue is similar, except that it writes to the arrays instead of reading from them, and has support for demand-allocating the TlsExpansionSlots array; more on that later):

PVOID
WINAPI
TlsGetValue(
	__in DWORD dwTlsIndex
	)
{
   PTEB Teb = NtCurrentTeb(); // fs:[0x18]

   // Reset the last error state.
   Teb->LastErrorValue = 0;

   // If the variable is in the main array, return it.
   if (dwTlsIndex < 64)
      return Teb->TlsSlots[ dwTlsIndex ];

   if (dwTlsIndex > 1088)
   {
      BaseSetLastNTError( STATUS_INVALID_PARAMETER );
      return 0;
   }

   // Otherwise it's in the expansion array.
   // If it's not allocated, we default to zero.
   if (!Teb->TlsExpansionSlots)
      return 0;

   // Fetch the value from the expansion array.
   return Teb->TlsExpansionSlots[ dwTlsIndex - 64 ];
}

(The assembler version (annotated) is also available.)

The TlsSlots array in the TEB is a part of every thread, which gives each thread a guaranteed set of 64 thread local storage indexes. Later on, Microsoft decided that 64 was not enough TLS slots to go around and added the TlsExpansionSlots array, for an additional 1024 TLS slots. The TlsExpansionSlots array is demand-allocated in TlsAlloc if the initial set of 64 slots is exceeded.

(This is, by the way, the nature of the seemingly arbitrary 64 and 1088 TLS slot limitations mentioned by MSDN, for those keeping score.)

TlsAlloc and TlsFree are, for all intents and purposes, implemented just as what one would expect. They acquire a lock, search for a free TLS slot (returning the index if one is found), otherwise indicating to the caller that there are no free slots. If the first 64 slots are exhausted and the TlsExpansionSlots array has not been created, then TlsAlloc will allocate and zero space for 1024 more TLS slots (pointer-sized values), and then update the TlsExpansionSlots to refer to the newly allocated storage.

Internally, TlsAlloc and TlsFree utilize the Rtl bitmap package to track usage of individual TLS slots; each bit in a bitmap describes whether a particular TLS slot is free or in use. This allows for reasonably fast (and space efficient) mapping of TLS slot usage for book-keeping purposes.

If one has been following along so far, then the question as to what happens when TlsAlloc is called such that it must create the TlsExpansionSlots array after there is already more than one thread in the current process may have come to mind. This might appear to be a problem at first glance, as TlsAlloc only creates the array for the current thread. Although one might be tempted to conclude that, given this behavior of TlsAlloc, explicit TLS therefore doesn’t work reliably above 64 TLS slots if the extra slots are allocated after the second thread in the process is created, this is in fact not the case. There exists some clever sleight of hand that is performed by TlsGetValue and TlsSetValue, which compensates for the fact that TlsAlloc can only create the TlsExpansionSlots memory block for the current thread.

Specifically, if TlsGetValue is called with an array index within the confines of the TlsExpansionSlots array, but the array has not been allocated for the current thread, then zero is returned. (This is the default value for an uninitialized TLS slot, and is thus consequently legal.) Similarly, if TlsSetValue is called with an array index that falls under the TlsExpansionSlots array, and the array has not yet been created, TlsSetValue allocates the memory block on demand and initializes the requested TLS slot.

There also exists one final twist in TlsFree that is required to support the behavior of releasing a TLS slot while there are multiple threads running. A potential problem exists whereby a thread releases a TLS slot, and then it becomes reallocated, following which the previous contents of the TLS slot are still present on other threads running in the process. TlsFree alleviates this problem by asking the kernel for help, in the form of the ThreadZeroTlsCell thread information class. When the kernel sees a NtSetInformationThread call for ThreadZeroTlsCell, it enumerates all threads in the process and writes a zero pointer-length value to each running thread’s instance of the requested TLS slot, thus flushing the old contents and resetting the slot to the unallocated default state. (It is not strictly necessary for this to have been done in kernel mode, although the designers chose to go this route.)

When a thread exits normally, if the TlsExpansionSlots pointer has been allocated, it is freed to the process heap. (Of course, if a thread is terminated by TerminateThread, the TlsExpansionSlots array is leaked. This is yet one reason among innumerable others why you should stay away from TerminateThread.)

Next up: Examining implicit TLS support (__declspec(thread) variables).

Tags: ,

4 Responses to “Thread Local Storage, part 2: Explicit TLS”

  1. Marc Sherman says:

    In this article:

    PTEB Teb = NtCurrentTeb(); // fs:[0x18]

    conflicts with the previous article:

    “This allows, say, an access to fs:[0×0] (or gs:[0×0] on x64) to always access the TEB allocated for the current thread, …”

    Why the discrepancy of offsets (0x18 vs. 0x0)?

    thanks,
    Marc

  2. Skywing says:

    Well, here’s the deal. You can access any part of the TEB via two methods:

    1) You can reference the field you want off of fs, with fs:[offsetof( TEB, Field )]. This is inconvenient though as the compiler won’t do this automatically; there isn’t a way to tell the compiler that a structure is “based” at a segment register. Code that uses this method typically is either specially generated by the compiler or written in assembler (although with the new __readfsdword and friends family of intrinsics, this can be done in straight C now in a generic fashion with CL).

    2) You can resolve the flat address of the TEB and use it for future accesses. This is the most common way of doing it. On x86, at +0x18 into the TEB is a (flat address space) pointer to the TEB:

    +0x018 Self : Ptr32

    And, to be more exact, you would only reference fs:[0x00] if you are wanting the first field of the TEB. (You could do this by resolving the flat address at +0x18 and then dereferencing the pointer with an offset of +0x00, or just reference fs:[0x00] directly.)

    Offset 0 from the TEB start happens to be the exception handler linked list:

    +0x000 ExceptionList : Ptr32

    Code that references fs:[0x00] specifically on x86 is thus often adding or removing an exception handler. Similarly, fs:[0x0E10] on x86 would get you the thread’s guaranteed TLS slot array, just as if you resolved the flat address and used that pointer.

    +0xe10 TlsSlots : [64] Ptr32 Void

    Most C code uses the latter method as there’s a common macro, NtCurrentTeb, which returns a flat pointer to the TEB based off of the Teb->NtTib.Self field (at +0x18 on x86). From there, one can access fields in the TEB naturally through a typed structure pointer.

  3. Marc Sherman says:

    Got it. Thanks.

    Marc

  4. […] Nynaeve Adventures in Windows debugging and reverse engineering. « Thread Local Storage, part 2: Explicit TLS […]