Posts Tagged ‘CRT’

Why does every heap trace in UMDH get stuck at “malloc”?

Thursday, February 21st, 2008

One of the more useful tools for tracking down memory leaks in Windows is a utility called UMDH that ships with the WinDbg distribution. Although I’ve previously covered what UMDH does at a high level, and how it functions, the basic principle for it, in a nutshell, is that it uses special instrumentation in the heap manager that is designed to log stack traces when heap operations occur.

UMDH utilizes the heap manager’s stack trace instrumentation to associate call stacks with outstanding allocations. More specifically, UMDH is capable of taking a “snapshot” of the current state of all heaps in a process, associating like-sized allocations from like-sized callstacks, and aggregrating them in a useful form.

The general principle of operation is that UMDH is typically run two (or more times), once to capture a “baseline” snapshot of the process after it has finished initializing (as there are expected to always be a number of outstanding allocations while the process is running that would not be normally expected to be freed until process exit time, for example, any allocations used to build the command line parameter arrays provided to the main function of a C program, or any other application-derived allocations that would be expected to remain checked out for the lifetime of the program.

This first “baseline” snapshot is essentially intended to be a means to filter out all of these expected, long-running allocations that would otherwise show up as useless noise if one were to simply take a single snapshot of the heap after the process had leaked memory.

The second (and potentially subsequent) snapshots are intended to be taken after the process has leaked a noticeable amount of memory. UMDH is then run again in a special mode that is designed to essentially do a logical “diff” between the “baseline” snapshot and the “leaked” snapshot, filtering out any allocations that were present in both of them and returning a list of new, outstanding allocations, which would generally include any leaked heap blocks (although there may well be legitimate outstanding allocations as well, which is why it is important to ensure that the “leaked” snapshot is taken only after a non-trivial amount of memory has been leaked, if at all possible).

Now, this is all well and good, and while UMDH proves to be a very effective tool for tracking down memory leaks with this strategy, taking a “before” and “after” diff of a problem and analyzing the two to determine what’s gone wrong is hardly a new, ground-breaking concept.

While the theory behind UMDH is sound, however, there are some situations where it can work less than optimally. The most common failure case of UMDH in my experience is not actually so much related to UMDH itself, but rather the heap manager instrumentation code that is responsible for logging stack traces in the first place.

As I had previously discussed, the heap manager stack trace instrumentation logic does not have access to symbols, and on x86, “perfect” stack traces are not generally possible, as there is no metadata attached with a particular function (outside of debug symbols) that describes how to unwind past it.

The typical approach taken on x86 is to assume that all functions in the call stack do not use frame pointer omission (FPO) optimizations that allow the compiler to eliminate the usage of ebp for a function entirely, or even repurpose it for a scratch register.

Now, most of the libraries that ship with the operating system in recent OS releases have FPO explicitly turned off for x86 builds, with the sole intent of allowing the built-in stack trace instrumentation logic to be able to traverse through system-supplied library functions up through to application code (after all, if every heap stack trace dead-ended at kernel32!HeapAlloc, the whole concept of heap allocation traces would be fairly useless).

Unfortunately, there happens to be a notable exception to this rule, one that actually came around to bite me at work recently. I was attempting to track down a suspected leak with UMDH in one of our programs, and noticed that all of the allocations were grouped into a single stack trace that dead-ended in a rather spectacularly unhelpful way. Digging in a bit deeper, in the individual snapshot dumps from UMDH contained scores of allocations with the following backtrace logged:

00000488 bytes in 0x1 allocations
   (@ 0x00000428 + 0x00000018) by: BackTrace01786

        7C96D6DC : ntdll!RtlDebugAllocateHeap+000000E1
        7C949D18 : ntdll!RtlAllocateHeapSlowly+00000044
        7C91B298 : ntdll!RtlAllocateHeap+00000E64
        211A179A : program!malloc+0000007A

This particular outcome happened to be rather unfortunate, as in the specific case of the program I was debugging at work, virtually all memory allocations in the program (including the ones I suspected of leaking) happened to ultimately get funneled through malloc.

Obviously, getting told that “yes, every leaked memory allocation goes through malloc” isn’t really all that helpful if (most) every allocation in the program in question happened to go through malloc. The UMDH output begged the question, however, as to why exactly malloc was breaking the stack traces. Digging in a bit deeper, I discovered the following gem while disassembling the implementation of malloc:

0:011> u program!malloc
[f:\sp\vctools\crt_bld\self_x86\crt\src\malloc.c @ 155]:
211a1720 55              push    ebp
211a1721 8b6c2408        mov     ebp,dword ptr [esp+8]
211a1725 83fde0          cmp     ebp,0FFFFFFE0h

In particular, it would appear that the default malloc implementation on the static link CRT on Visual C++ 2005 not only doesn’t use a frame pointer, but it trashes ebp as a scratch register (here, using it as an alias register for the first parameter, the count in bytes of memory to allocate). Disassembling the DLL version of the CRT revealed the same problem; ebp was reused as a scratch register.

What does this all mean? Well, anything using malloc that’s built with Visual C++ 2005 won’t be diagnosable with UMDH or anything else that relies on ebp-based stack traces, at least not on x86 builds. Given that many things internally go through malloc, including operator new (at least in the default implementation), this means that in the default configuration, things get a whole lot harder to debug than they should be.

One workaround here would be to build your own copy of the CRT with /Oy- (force frame pointer usage), but I don’t really consider building the CRT a very viable option, as that’s a whole lot of manual work to do and get up and running correctly on every developer’s machine, not to mention all the headaches that service releases that will require rebuilds will bring with such an approach.

For operator new, it’s fortunately relatively doable to overload it in a relatively supported way to be implemented against a different allocation strategy. In the case of malloc, however, things don’t really have such a happy ending; one is either forced to re-alias the name using preprocessor macro hackery to a custom implementation that does not suffer from a lack of frame pointer usage, or otherwise change all references to malloc/free to refer to a custom allocator function (perhaps implemented against the process heap directly instead of the CRT heap a-la malloc).

So, the next time you use UMDH and get stuck scratching your head while trying to figure out why your stack traces are all dead-ending somewhere less than useful, keep in mind that the CRT itself may be to blame, especially if you’re relying on CRT allocators. Hopefully, in a future release of Visual Studio, the folks responsible for turning off FPO in the standard OS libraries can get in touch with the persons responsible for CRT builds and arrange for the same to be done, if not for the entire CRT, then at least for all the code paths in the standard heap routines. Until then, however, these CRT allocator routines remain roadblocks for effective leak diagnosis, at least when using the better tools available for the job (UMDH).

Most Win32 applications are really multithreaded, even if they don’t know it

Monday, November 12th, 2007

Most Win32 programs out there operate with more than one thread in the process at least some of the time, even if the program code doesn’t explicitly create a thread. The reason for this is that a number of system APIs internally create threads for their purposes.

For example, console Ctrl-C/Ctrl-Break events are handled in a new thread by default. (Internally, CSRSS creates a new thread in the address space of the recipient process, which then calls a kernel32 function to invoke the active console handler list. This is one of the reasons why kernel32 is guaranteed to be at the same base address system wide, incidentally, as CSRSS caches the address of the Ctrl-C dispatcher in kernel32 and assumes that it will be valid across all client processes[1].) This can introduce potentially unexpected threading concerns depending on what a console control handler does. For example, if a console control handler writes to the console, this access will not necessarily be synchronized with any I/O the initial thread is doing with the console, unless some form of explicit synchronization is used.

This is one of the reasons why Visual Studio 2005 eschews the single threaded CRT entirely, instead favoring the multithreaded variants. At this point in the game, there is comparatively little benefit to providing a single threaded CRT that will break in strange ways (such as if a console control handler is registered), just to save a minuscule amount of locking overhead. In a program that is primarily single threaded, the critical section package is fast enough that any overhead incurred by the CRT acquiring locks is going to be minuscule compared to the actual processing overhead of whatever operation is being performed.

An even greater subset of Win32 APIs use worker threads internally, though these typically have limited visibility to the main process image. For example, WSAAsyncGetHostByName uses a worker thread to synchronously call gethostbyname and then post the results to the requester via a window message. Even debugging an application typically results in threads being created in the target process for purpose of debugger break-in processing.

Due to the general trend that most Win32 programs will encounter multiple threads in one fashion or another, other parts of the Win32 programming environment besides the CRT are gradually dropping the last vestiges of their support for purely single threaded operation as well. For instance, the low fragmentation heap (LFH) does not support HEAP_NO_SERIALIZE, which disables the usage of the heap’s built-in critical section. In this case, as with the CRT, the cases where it is always provably safe to go without locks entirely are so limited and provide such a small benefit that it’s just not worth the trouble to take out the call to acquire the internal critical section. (Entering an unowned critical section does not involve a kernel mode transition and is quite cheap. In a primarily single threaded application, any critical sections will by definition almost always be either unowned or owned by the current thread in a recursive call.)

[1] This is not strictly true in some cases with the advent of Wow64, but that is a topic for another day.

The default invalid parameter behavior for the VC8 CRT doesn’t break into the debugger

Thursday, November 8th, 2007

One of the problems that confuses people from time to time here at work is that if you happen to hit a condition that trips the “invalid parameter” handler for VC8, and you’ve got a debugger attached to the process that fails, then the process mysteriously exits without giving the debugger a chance to inspect the condition of the program in question.

For those unfamiliar with the concept, the “invalid parameter” handler is a new addition to the Microsoft CRT, which kills the process if various invalid states are encountered. For example, dereferencing a bogus iterator in a release build might trip the invalid parameter handler if you’re lucky (if not, you might see random memory corruption, of course).

The reason why there is no debugger interaction here is that the default CRT invalid parameter handler (present in invarg.c if you’ve got the CRT source code handy) invokes UnhandledExceptionFilter in an attempt to (presumably) give the debugger a crack at the exception. Unfortunately, in reality, UnhandledExceptionFilter will just return immediately if a debugger is attached to the process, assuming that this will cause the standard SEH dispatcher logic to pass the event to the debugger. Because the default invalid parameter handler doesn’t really go through the SEH dispatcher but is in fact simply a direct call to UnhandledExceptionFilter, this results in no notification to the debugger whatsoever.

This counter-intuitive behavior can be more than a little bit confusing when you’re trying to debug a problem, since from the debugger, all you might see in a case like a bad iterator dereference would be this:

0:000:x86> g
00000000`7759053a c3              ret

If we pull up a stack trace, then things become a bit more informative:

0:000:x86> k

However, while we can get a stack trace for the thread that tripped the invalid parameter event in cases like this with a simple single threaded program, adding multiple threads will throw a wrench into the debuggability of this scenario. For example, with the following simple test program, we might see the following when running the process under the debugger after we continue the initial process breakpoint (this example is being run as a 32-bit program under Vista x64, though the same principle should apply elsewhere):

0:000:x86> g
sub     rsp,48h
0:000> k
Call Site

What happened? Well, the last thread in the process here happened to be the newly created thread instead of the thread that called TerminateProcess. To make matters worse, the other thread (which was the one that caused the actual problem) is already gone, killed by TerminateProcess, and its stack has been blown away. This means that we can’t just figure out what’s happened by asking for a stack trace of all threads in the process:

0:000> ~*k

.  0  Id: 1888.1314 Suspend: -1 Unfrozen
Call Site

Unfortunately, this scenario is fairly common in practice, as most non-trivial programs use multiple threads for one reason or another. If nothing else, many OS-provided APIs internally create or make use of worker threads.

There is a way to make out useful information in a scenario like this, but it is unfortunately not easy to do after the fact, which means that you’ll need to have a debugger attached and at your disposal before the failure happens. The simplest way to catch the culprit red-handed here is to just breakpoint on ntdll!NtTerminateProcess. (A conditional breakpoint could be employed to check for NtCurrentProcess ((HANDLE)-1) in the first parameter if the process frequently calls TerminateProcess, but this is typically not the case and often it is sufficient to simply set a blind breakpoint on the routine.)

For example, in the case of the provided test program, we get much more useful results with the breakpoint in place:

0:000:x86> bp ntdll32!NtTerminateProcess
0:000:x86> g
Breakpoint 0 hit
mov     eax,29h
0:000:x86> k

That’s much more diagnosable than a stack trace for the completely wrong thread.

Note that from an error reporting perspective, it is possible to catch these errors by registering an invalid parameter handler (via _set_invalid_parameter_handler), which is rougly analogus to the mechanism one uses to register a custom handler for pure virtual function call failures.

I tend to prefer debugging with release builds instead of debug builds.

Friday, November 2nd, 2007

One of the things that I find myself espousing both at work and outside of work from time to time is the value of debugging using release builds of programs (for Windows applications, anyways). This may seem contradictory to some at first glance, as one would tend to believe that the debug build is in fact better for debugging (it is named the “debug build”, after all).

However, I tend to disagree with this sentiment, on several grounds:

  1. Debugging on debug builds only is an unrealistic situation. Most of the “interesting” problems that crop up in real life tend to be with release builds on customer sites or production environments. Many of the time, we do not have the luxury of being able to ship out a debug build to a customer or production environment.

    There is no doubt that debugging using the debug build can be easier, but I am of the opinion that it is disadvantageous to be unable to effectively debug release builds. Debugging with release builds all the time ensures that you can do this when you’ve really got no choice, or when it is not feasible to try and repro a problem using a debug build.

  2. Debug builds sometimes interfere with debugging. This is a highly counterintuitive concept initially, one that many people seem to be surprised at. To see what I mean, consider the scenario where one has a random memory corruption bug.

    This sort of problem is typically difficult and time consuming to track down, so one would want to use all available tools to help in this process. One most useful tool in the toolkit of any competent Windows debugger should be page heap, which is a special mode of the RTL heap (which implements the Win32 heap as exposed by APIs such as HeapAlloc).

    Page heap places a guard page at the end (or before, depending on its configuration) of every allocation. This guard page is marked inaccessible, such that any attempt to write to an allocation that exceeds the bounds of the allocated memory region will immediately fault with an access violation, instead of leaving the corruption to cause random failures at a later time. In effect, page heap allows one to catch the guility party “red handed” in many classes of heap corruption scenarios.

    Unfortunately, the debug build greatly diminishes the ability of page heap to operate. This is because when the debug version of the C runtime is used, any memory allocations that go through the CRT (such as new, malloc, and soforth) have special check and fill patterns placed before and after the allocation. These fill patterns are intended to be used to help detect memory corruption problems. When a memory block is returned using an API such as free, the CRT first checks the fill patterns to ensure that they are intact. If a discrepancy is found, the CRT will break into the debugger and notify the user that memory corruption has occured.

    If one has been following along thus far, it should not be too difficult to see how this conflicts with page heap. The problem lies in the fact that from the heap’s perspective, the debug CRT per-allocation metadata (including the check and fill patterns) are part of the user allocation, and so the special guard page is placed after (or before, if underrun protection is enabled) the fill patterns. This means that some classes of memory corruption bugs will overwrite the debug CRT metadata, but won’t trip page heap up, meaning that the only indication of memory corruption will be when the allocation is released, instead of when the corruption actually occured.

  3. Local variable and source line stepping are unreliable in release builds. Again, as with the first point, it is dangerous to get into a pattern of relying on these conveniences as they simply do not work correctly (or in the expected fashion) in release builds, after the optimizer has had its way with the program. If you get used to always relying on local variable and source line support, when used in conjunction with debug builds, then you’re going to be in for a rude awakening when you have to debug a release build. More than once at work I’ve been pulled in to help somebody out after they had gone down a wrong path when debugging something because the local variable display showed the wrong contents for a variable in a release build.

    The moral of the story here is to not rely on this information from the debugger, as it is only reliable for debug builds. Even then, local variable display will not work correctly unless you are stepping in source line mode, as within a source line (while stepping in assembly mode), local variables may not be initialized in the way that the debugger expects given the debug information.

Now, just to be clear, I’m not saying that anyone should abandon debug builds completely. There are a lot of valuable checks added by debug builds (assertions, the enhanced iterator validation in the VS2005 CRT, and stack variable corruption checks, just to name a few). However, it is important to be able to debug problems with release builds, and it seems to me that always relying on debug builds is detrimental to being able to do this. (Obviously, this can vary, but this is simply speaking on my personal experience.)

When I am debugging something, I typically only use assembly mode and line number information, if available (for manually matching up instructions with source code). Source code is still of course a useful time saver in many instances (if you have it), but I prefer not relying on the debugger to “get it right” with respect to such things, having been burned too many times in the past with incorrect results being returned in non-debug builds.

With a little bit of practice, you can get the same information that you would out of local variable display and the like with some basic reading of disassembly text and examination of the stack and register contents. As an added bonus, if you can do this in debug builds, you should by definition be able to do so in release builds as well, even when the debugger is unable to track locals correctly due to limitations in the debug information format.

Thread Local Storage, part 4: Accessing __declspec(thread) data

Thursday, October 25th, 2007

Yesterday, I outlined how the compiler and linker cooperate to support TLS. However, I didn’t mention just what exactly goes on under the hood when one declares a __declspec(thread) variable and accesses it.

Before the inner workings of a __declspec(thread) variable access can be explained, however, it is necessary to discuss several more special variables in tlssup.c. These special variables are referenced by _tls_used to create the TLS directory for the image.

The first variable of interest is _tls_index, which is implicitly referenced by the compiler in the per-thread storage resolution mechanism any time a thread local variable is referenced (well, almost every time; there’s an exception to this, which I’ll mention later on). _tls_index is also the only variable declared in tlssup.c that uses the default allocation storage class. Internally, it represents the current module’s TLS index. The per-module TLS index is, in principal, similar to a TLS index returned by TlsAlloc. However, the two are not compatible, and there exists significantly more work behind the per-module TLS index and its supporting code. I’ll cover all of that later as well; for now, just bear with me.

The definitions of _tls_start and _tls_end appear as so in tlssup.c:

#pragma data_seg(".tls")

#if defined (_M_IA64) || defined (_M_AMD64)
char _tls_start = 0;

#pragma data_seg(".tls$ZZZ")

#if defined (_M_IA64) || defined (_M_AMD64)
char _tls_end = 0;

This code creates the two variables and places them at the start and end of the “.tls” section. The compiler and linker will automatically assume a default allocation section of “.tls” for all __declspec(thread) variables, such that they will be placed between _tls_start and _tls_end in the final image. The two variables are used to tell the linker the bounds of the TLS storage template section, via the image’s TLS directory (_tls_used).

Now that we know how __declspec(thread) works from a language level, it is necessary to understand the supporting code the compiler generates for an access to a __declspec(thread) variable. This supporting code is, fortunately, fairly straightforward. Consider the following test program:

__declspec(thread) int threadedint = 0;

int __cdecl wmain(int ac,
   wchar_t **av)
   threadedint = 42;

   return 0;

For x64, the compiler generated the following code:

mov	 ecx, DWORD PTR _tls_index
mov	 rax, QWORD PTR gs:88
mov	 edx, OFFSET FLAT:threadedint
mov	 rax, QWORD PTR [rax+rcx*8]
mov	 DWORD PTR [rdx+rax], 42

Recall that the gs segment register refers to the base address of the TEB on x64. 88 (0x58) is the offset in the TEB for the ThreadLocalStoragePointer member on x64 (more on that later):

   +0x058 ThreadLocalStoragePointer : Ptr64 Void

If we examine the code after the linker has run, however, we’ll notice something strange:

mov     ecx, cs:_tls_index
mov     rax, gs:58h
mov     edx, 4
mov     rax, [rax+rcx*8]
mov     dword ptr [rdx+rax], 2Ah ; 42
xor     eax, eax

If you haven’t noticed it already, the offset of the “threadedint” variable was resolved to a small value (4). Recall that in the pre-link disassembly, the “mov edx, 4” instruction was “mov edx, OFFSET FLAT:threadedint”.

Now, 4 isn’t a very flat address (one would expect an address within the confines of the executable image to be used). What happened?

Well, it turns out that the linker has some tricks up its sleeve that were put into play here. The “offset” of a __declspec(thread) variable is assumed to be relative to the base of the “.tls” section by the linker when it is resolving address references. If one examines the “.tls” section of the image, things begin to make a bit more sense:

0000000001007000 _tls segment para public 'DATA' use64
0000000001007000      assume cs:_tls
0000000001007000     ;org 1007000h
0000000001007000 _tls_start        dd 0
0000000001007004 ; int threadedint
0000000001007004 ?threadedint@@3HA dd 0
0000000001007008 _tls_end          dd 0

The offset of “threadedint” from the start of the “.tls” section is indeed 4 bytes. But all of this still doesn’t explain how the instructions the compiler generated access a variable that is instanced per thread.

The “secret sauce” here lies in the following three instructions:

mov     ecx, cs:_tls_index
mov     rax, gs:58h
mov     rax, [rax+rcx*8]

These instructions fetch ThreadLocalStoragePointer out of the TEB and index it by _tls_index. The resulting pointer is then indexed again with the offset of threadedint from the start of the “.tls” section to form a complete pointer to this thread’s instance of the threadedint variable.

In C, the code that the compiler generated could be visualized as follows:

// This represents the ".tls" section
   int tls_start;
   int threadedint;
   int tls_end;


Teb     = NtCurrentTeb();
TlsData = Teb->ThreadLocalStoragePointer[ _tls_index ];

TlsData->threadedint = 42;

This should look familiar if you’ve used explicit TLS before. The typical paradigm for explicit TLS is to place a structure pointer in a TLS slot, and then to access your thread local state, the per thread instance of the structure is retrieved and the appropriate variable is then referenced off of the structure pointer. The difference here is that the compiler and linker (and loader, more on that later) cooperated to save you (the programmer) from having to do all of that explicitly; all you had to do was declare a __declspec(thread) variable and all of this happens magically behind the scenes.

There’s actually an additional curve that the compiler will sometimes throw with respect to how implicit TLS variables work from a code generation perspective. You may have noticed how I showed the x64 version of an access to a __declspec(thread) variable; this is because, by default, x86 builds of a .exe involve a special optimization (/GA (Optimize for Windows Application, quite possibly the worst name for a compiler flag ever)) that eliminates the step of referencing the special _tls_index variable by assuming that it is zero.

This optimization is only possible with a .exe that will run as the main process image. The assumption works in this case because the loader assigns per-module TLS index values on a sequential basis (based on the loaded module list), and the main process image should be the second thing in the loaded module list, after NTDLL (which, now that this optimization is being used, can never have any __declspec(thread) variables, or it would get TLS index zero instead of the main process image). It’s worth noting that in the (extremely rare) case that a .exe exports functions and is imported by another .exe, this optimization will cause random corruption if the imported .exe happens to use __declspec(thread).

For reference, with /GA enabled, the x86 build of the above code results in the following instructions:

mov     eax, large fs:2Ch
mov     ecx, [eax]
mov     dword ptr [ecx+4], 2Ah ; 42

Remember that on x86, fs points to the base address of the TEB, and that ThreadLocalStoragePointer is at offset +0x2C from the base of the x86 TEB.

Notice that there is no reference to _tls_index; the compiler assumes that it will take on the value zero. If one examines a .dll built with the x86 compiler, the /GA optimization is always disabled, and _tls_index is used as expected.

The magic behind __declspec(thread) extends beyond just the compiler and linker, however. Something still has to set up the storage for each module’s per-thread state, and that something is the loader. More on how the loader plays a part in this complex process next time.

Thread Local Storage, part 3: Compiler and linker support for implicit TLS

Wednesday, October 24th, 2007

Last time, I discussed the mechanisms by which so-called explicit TLS operates (the TlsGetValue, TlsSetValue and other associated supporting routines).

Although explicit TLS is certainly fairly heavily used, many of the more “interesting” pieces about how TLS works in fact relate to the work that the loader does to support implicit TLS, or __declspec(thread) variables (in CL). While both TLS mechanisms are designed to provide a similar effect, namely the capability to store information on a per-thread basis, many aspects of the implementations of the two different mechanisms are very different.

When you declare a variable with the __declspec(thread) extended storage class, the compiler and linker cooperate to allocate storage for the variable in a special region in the executable image. By convention, all variables with the __declspec(thread) storage class are placed in the .tls section of a PE image, although this is not technically required (in fact, the thread local variables do not even really need to be in their own section, merely contiguous in memory, at least from the loader’s perspective). On disk, this region of memory contains the initializer data for all thread local variables in a particular image. However, this data is never actually modified and references to a particular thread local variable will never refer to an address within this section of the PE image; the data is merely a “template” to be used when allocating storage for thread local variables after a thread has been created.

The compiler and linker also make use of several special variables in the context of implicit TLS support. Specifically, a variable by the name of _tls_used (of the type IMAGE_TLS_DIRECTORY) is created by a portion of the C runtime that is static linked into every program to represent the TLS directory that will be used in the final image (references to this variable should be extern “C” in C++ code for name decoration purposes, and storage for the variable need not be allocated as the supporting CRT stub code already creates the variable). The TLS directory is a part of the PE header of an executable image which describes to the loader how the image’s thread local variables are to be managed. The linker looks for a variable by the name of _tls_used and ensures that in the on-disk image, it overlaps with the actual TLS directory in the final image.

The source code for the particular section of C runtime logic that declares _tls_used lives in the tlssup.c file (which comes with Visual Studio), making the variable pseudo-documented. The standard declaration for _tls_used is as so:

const IMAGE_TLS_DIRECTORY _tls_used =
 (ULONG)(ULONG_PTR) &_tls_start, // start of tls data
 (ULONG)(ULONG_PTR) &_tls_end,   // end of tls data
 (ULONG)(ULONG_PTR) &_tls_index, // address of tls_index
 (ULONG)(ULONG_PTR) (&__xl_a+1), // pointer to callbacks
 (ULONG) 0,                      // size of tls zero fill
 (ULONG) 0                       // characteristics

The CRT code also provides a mechanism to allow a program to register a set of TLS callbacks, which are functions with a similar prototype to DllMain that are called when a thread starts or exits (cleanly) in the current process. (These callbacks can even be registered for a main process image, where there is no DllMain routine.) The callbacks are typed as PIMAGE_TLS_CALLBACK, and the TLS directory points to a null-terminated array of callbacks (called in sequence).

For a typical image, there will not exist any TLS callbacks (in practice, almost everything uses DllMain to perform per-thread initialization tasks). However, the support is retained and is fully functional. To use the support that the CRT provides for TLS callbacks, one needs to declare a variable that is stored in the specially named “.CRT$XLx” section, where x is a value between A and Z. For example, one might write the following code:

#pragma section(".CRT$XLY",long,read)

extern "C" __declspec(allocate(".CRT$XLY"))
  PIMAGE_TLS_CALLBACK _xl_y  = MyTlsCallback;

The strange business with the special section names is required because the in-memory ordering of the TLS callback pointers is significant. To understand what is happening with this peculiar looking declaration, it is first necessary to understand a bit about the compiler and linker organize data in the final PE image that is produced.

Non-header data in a PE image is placed into one or more sections, which are regions of memory with a common set of attributes (such as page protection). The __declspec(allocate(“section-name”)) keyword (CL-specific) tells the compiler that a particular variable is to be placed in a specific section in the final executable. The compiler additionally has support for concatenating similarly-named sections into one larger section. This support is activated by prefixing a section name with a $ character followed by any other text. The compiler concatenates the resulting section with the section of the same name, truncated at the $ character (inclusive).

The compiler alphabetically orders individual sections when concatenating them (due to the usage of the $ character in the section name). This means that in-memory (in the final executable image), a variable in the “.CRT$XLB” section will be after a variable in the “.CRT$XLA” section but before a variable in “.CRT$XLZ” section. The C runtime uses this quirk of the compiler to create an array of null terminated function pointers to TLS callbacks (with the pointer stored in the “.CRT$XLZ” section being the null terminator). Thus, in order to ensure that the declared function pointer resides within the confines of the TLS callback array being referenced by _tls_used, it is necessary place in a section of the form “.CRT$XLx“.

The creation of the TLS directory is, however, only one portion of how the compiler and linker work together to support __declspec(thread) variables. Next time, I’ll discuss just how the compiler and linker manage accesses to such variables.

Update: Phil mentions that this support for TLS callbacks does not work before the Visual Studio 2005 release. Be warned if you are still using an old compiler package.