Thread Local Storage, part 1: Overview

Windows, like practically any other mainstream multithreading operating system, provides a mechanism to allow programmers to efficiently store state on a per-thread basis. This capability is typically known as Thread Local Storage, and it’s quite handy in a number of circumstances where global variables might need to be instanced on a per-thread basis.

Although the usage of TLS on Windows is fairly well documented, the implementation details of it are not so much (though there are a smattering of pieces of third party documentation floating out there).

Conceptually, TLS is in principal not all that complicated (famous last words), at least from a high level. The general design is that all TLS accesses go through either a pointer or array that is present on the TEB, which is a system-defined data structure that is already instanced per thread.

The “per-thread” resolution of the TEB is fairly well documented, but for the benefit of those that are unaware, the general idea is that one of the segment registers (fs on x86, gs on x64) is repurposed by the OS to point to the base address of the TEB for the current thread. This allows, say, an access to fs:[0x0] (or gs:[0x0] on x64) to always access the TEB allocated for the current thread, regardless of other threads in the address space. The TEB does really exist in the flat address space of the process (and indeed there is a field in the TEB that contains the flat virtual address of it), but the segmentation mechanism is simply used to provide a convenient way to access the TEB quickly without having to search through a list of thread IDs and TEB pointers (or other relatively slow mechanisms).

On non-x86 and non-x64 architectures, the underlying mechanism by which the TEB is accessed varies, but the general theme is that there is a register of some sort which is always set to the base address of the current thread’s TEB for easy access.

The TEB itself is probably one of the best-documented undocumented Windows structures, primarily because there is type information included for the debugger’s benefit in all recent ntdll and ntoskrnl.exe builds. With this information and a little disassembly work, it is not that hard to understand the implementation behind TLS.

Before we can look at the implementation of how TLS works on Windows, however, it is necessary to know the documented mechanisms to use it. There are two ways to accomplish this task on Windows. The first mechanism is a set of kernel32 APIs (comprising TlsGetValue, TlsSetValue, TlsAlloc, and TlsFree that allows explicit access to TLS. The usage of the functions is fairly straightforward; TlsAlloc reserves space on all threads for a pointer-sized variable, and TlsGetValue can be used to read this per-thread storage on any thread (TlsSetValue and TlsFree are conceptually similar).

The second mechanism by which TLS can be accessed on Windows is through some special support from the loader (residing ntdll) and the compiler and linker, which allow “seamless”, implicit usage of thread local variables, just as one would use any global variable, provided that the variables are tagged with __declspec(thread) (when using the Microsoft build utilities). This is more convenient than using the TLS APIs as one doesn’t need to go and call a function every time you want to use a per-thread variable. It also relieves the programmer of having to explicitly remember to call TlsAlloc and TlsFree at initialization time and deinitialization time, and it implies an efficient usage of per-thread storage space (implicit TLS operates by allocating a single large chunk of memory, the size of which is defined by the sum of all per-thread variables, for each thread so that only one index into the implicit TLS array is used for all variables in a module).

With the advantages of implicit TLS, why would anyone use the explicit TLS API? Well, it turns out that prior to Windows Vista, there are some rather annoying limitations baked into the loader’s implicit TLS support. Specifically, implicit TLS does not operate when a module using it is not being loaded at process initialization time (during static import resolution). In practice, this means that it is typically not usable except by the main process image (.exe) of a process, and any DLL(s) that are guaranteed to be loaded at initialization time (such as DLL(s) that the main process image static links to).

Next time: Taking a closer look at explicit TLS and how it operates under the hood.

Tags: Internals, TLS

This entry was posted on Monday, October 22nd, 2007 at 7:00 am and is filed under NT Internals, Programming, Windows. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

12 Responses to “Thread Local Storage, part 1: Overview”

mxatone says:

October 22, 2007 at 9:25 am

Hi,

Well TLS are pretty interesting and I didn’t know we can create a variable automaticly using TLS api. I saw that you can use TLS for antidebug purpose during binary loading :

(1) TLS-callback

This anti-debug was not so well-known a few years ago. It consists to instruct the PE loader that the first entry point of the program is referenced in a Thread Local Storage entry (10th directory entry number in the PE optional header). By doing so, the program entry-point won’t be executed first. The TLS entry can then perform anti-debug checks in a stealthy way.
Note that in practice, this technique is not widely used.
Though older debuggers (including OllyDbg) are not TLS-aware, counter-measures are quite easy to take, by the means of plugins of custom patcher tools.

from: http://www.securityfocus.com/infocus/1893 (Windows Anti-Debug Reference)

It is or will be useful for TCmalloc google portage which is not finish at all but seems pretty cool :

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

I wait for your next post about it, I never looked at TLS internal implementation. Thank you.
Skywing says:

October 22, 2007 at 10:24 am

The TLS callback doesn’t really change the entrypoint of the program, it just sets up an additional callback (an analogue of DllMain) that is to be called on thread create or thread delete events. It just so happens that due to the order that the loader initializes the process, that it shall be called before the main process image entrypoint instead of after.

This is, however, not all that useful as an anti-debug mechanism; at the simplest, one could just attach after that point and not at the start of the program. That and the fact that “program has TLS callbacks” pretty much sticks out as a red flag, as it is very rare that most programs would ever need to use them. (I think I’ve had to use them myself a total of once, in terms of “legitimate” uses. They were handy for hacking around a totally broken appcompat layer for Quake 3 on Vista, though, but that’s a story for another day.)
Marc Sherman says:

October 22, 2007 at 11:01 am

Skywing,

This is the clearest description of the fs register I have read. That register was always semi-mysterious to me, but this article has made it very clear, thanks.

Looking forward to the rest of this series.

Marc
Chris Clark says:

October 22, 2007 at 12:22 pm

I am not so familiar with TLS but I am wondering about how it works with thread pool threads? With thread pool threads, you are basically running work items on threads that are reused. In this case, I imagine that you cannot use thread local storage to share info “across jobs” because you don’t actually know which thread the next job is going to run on. Is this correct? Not sure it is an interesting thing to talk about and I am certainly not completely informed about it but it popped into mind when reading this. As always, great entry.
Russell Osterlund says:

October 22, 2007 at 1:04 pm

If one takes the time to step through the Window’s loader, you will see that the loader code calls an internal routine, LdrpInitializeTls, before calling DbgBreakPoint (assuming the IsDebuggerPresent flag has been set). LdrpInitializeTls then iterates through each DLL loaded at that point looking for the presence of a TLS directory entry in the PE file. If one is found, only then will the TLS “magic” happen. This is I think the reason for the “annoying limitations” you cited.
Skywing says:

October 22, 2007 at 2:14 pm

Chris: Yeah, you shouldn’t use TLS in work item routines. More specifically, you can’t assume that two calls to a work item routine will originate from the same thread pool thread, so you can’t store persisted information across work item requests in TLS. (You could conceivably use TLS only while you’re in the context of the current work routine, but this is a fairly narrow use case. It is safer to just avoid TLS entirely in work routines.)

Russell: Yes, although things change a bit in Vista (I’ve got a planned post to talk about how things have evolved with how implicit TLS used to work pre-Vista and how it works post-Vista, so I’ll hold off on saying more until those posts go live).
ac says:

October 22, 2007 at 11:10 pm

I hope you will post the hacks you did for Quake 3 aswell sometime in the future
Skywing says:

October 22, 2007 at 11:23 pm

Maybe. It was nothing too special, mostly something to un-break ALT+TAB and fix the habit of the game to do bad things to the display’s gamma correction. It would probably be easier to just recompile the program, as id has released the source code for Quake 3.
Nynaeve » Blog Archive » Thread Local Storage, part 2: Explicit TLS says:

October 23, 2007 at 7:00 am

[…] Nynaeve Adventures in Windows debugging and reverse engineering. « Thread Local Storage, part 1: Overview […]
Nynaeve » Blog Archive » Thread Local Storage, part 8: Wrap-up says:

October 31, 2007 at 11:36 am

[…] Thread Local Storage, part 1: Overview […]
Bobo says:

December 5, 2009 at 7:05 pm

It’s not the TEB, it’s the KTHREAD data structure!
Hugh says:

December 18, 2009 at 7:56 am

TLS is one way of providing per-thread non-volatile storage. Other OSs I have worked on (for example Stratus VOS) allowed one to declare per-thread static variables (a bit like _declspec(thread)).

The problem on Windows is performance, the _declspec(thread) seems costly, because each reference to such a var requires an overeahd of getting the TLS info.

I recently created some hand crafted code that we use to replace TLS in our own product’s core library. This API provides the same thing (a per-thread pointer to some scratchpad data) but uses over 50% less CPU time (optimized x64 C code).

Although I’m pleased with the algorithm (and it seems very solid) I do think that Microsoft could have done a better job with this, it really should be a lot less costly.

Thanks for an enlightening post though!

Hugh