Posts Tagged ‘Internals’

A catalog of NTDLL kernel mode to user mode callbacks, part 6: LdrInitializeThunk

Wednesday, November 28th, 2007

Previously, I described the mechanism by which the kernel mode to user mode callback dispatcher (KiUserCallbackDispatcher) operates, and how it is utilized by win32k.sys for various window manager related operations.

The next special NTDLL kernel mode to user mode “callback” up on the list is LdrInitializeThunk, which is not so much a callback as the entry point at which all user mode threads begin their execution system-wide. Although the Win32 CreateThread API (and even the NtCreateThread system service that is used to implement the Win32 CreateThread) provide the illusion that a thread begins its execution at a specified start routine (or instruction pointer, in the case of NtCreateThread), this is not truly the case.

CreateThread internally layers in a special kernel32 stub routine (BaseThreadStart or BaseProcessStart) in between the specified thread routine and the actual initial instruction of the thread. The kernel32 stub routine wrappers the call to the user-supplied thread start routine to provide services such a “top-level” SEH frame for the support of UnhandledExceptionFilter.

However, there exists yet another layer of indirection before a thread begins its execution at the specified thread start routine, even beyond the kernel32 stub routine at the start of all Win32 threads (the way the kernel32 stub routine works changes slightly with Windows Vista, though that is outside the scope of this discussion). The presence of this extra layer of indirection can be inferred by examining the documentation on MSDN for DllMain, which states that all threads call out to the DllMain routine at some point during thread start-up. The kernel32 stub routine is not involved in this process, and obviously the user-supplied thread entry point does not have to explicitly attempt to call DllMain for every loaded DLL with DLL_THREAD_ATTACH. This leaves us with the question of who actually arranges for these DllMain calls to happen when a thread begins.

The answer to this question is, of course, the feature routine of this article, LdrInitializeThunk. When a user mode thread is readied to begin initial execution after being created, the initial context that is realized is not the context value supplied to NtCreateThread (which would eventually end up in the user-supplied thread entry point). Instead, execution really begins at LdrInitializeThunk within NTDLL, which is supplied a CONTEXT record that describes the initially requested state of the thread (as supplied to NtCreateThread, or as created by NtCreateThreadEx on Windows Vista). This context record is provided as an argument to LdrInitializeThunk, in order to allow for control to (eventually) be transferred to the user-supplied thread entry point.

When invoked by a new thread, LdrInitializeThunk invokes LdrpInitialize to perform the remainder of the initialization tasks, and then calls upon the NtContinue system service to restore the supplied context record. I have made available a C-like representation of this process for illustration purposes.

LdrpInitialize makes a determination as to whether the process has already been initialized (for the purposes of NTDLL). This step is necessary as LdrInitializeThunk (and by extension, LdrpInitialize) is not only the actual entry point for a new thread in an already initialized process, but it is also the entry point for the initial thread in a completely new process (making it the first piece of code that is run in user mode in a new process). If the process has not already been initialized, then LdrpInitialize performs process initialization tasks by invoking LdrpInitializeProcess (some process initialization is performed inline by LdrpInitialize in the process initialization case as well). If the process is a Wow64 process, then the Wow64 NTDLL is loaded and invoked to initialize the 32-bit NTDLL’s per-process state.

Otherwise, if the process is already initialized, then LdrpInitialize invokes LdrpInitializeThread to perform per-thread initialization, which primarily involves invoking DllMain and TLS callbacks for loaded modules while the loader lock is held. (This is the reason why it is not supported to wait on a new thread to run its thread initialization routine from DllMain, because the new thread will immediately become blocked upon the loader lock, waiting for the thread already in DllMain to complete its processing.) If the process is a Wow64 process, then there is again support for making a call to the 32-bit NTDLL for purposes of running 32-bit per-thread initialization code.

When the required process or thread initialization tasks have been completed, LdrpInitialize returns to LdrInitializeThunk, which then realizes the user-supplied thread start context with a call to the NtContinue system service.

One consequence of this architecture for process and thread initialization is that it becomes (slightly) more difficult to step through process initialization, because the thread that initializes the process is not the first thread created in the process, but rather the first thread that executes LdrInitializeThunk. This means that one cannot simply create a process as suspended and attach a debugger to the process in order to step through process initialization, as the debugger break-in thread will run the very process initialization code that one wishes to step through before executing the break-in thread breakpoint instruction!

There is, fortunately, support for debugging this scenario built in to the Windows debugger package. By setting the debugger to break on the ‘create process’ event, it is possible to manually set a breakpoint on a new process (created by the debugger) or a child process (if child process debugging is enabled) before LdrInitializeThunk is run. This support is activated by breaking on the cpr (Create Process) event. For example, using the following ntsd command line, it is possible to examine the target and set breakpoints before any user mode code runs:

ntsd.exe -xe cpr C:\windows\system32\cmd.exe

Note that as the loaded module list will not have been initialized at this point, symbols will not be functional, so it is necessary to resolve the offset of LdrInitializeThunk manually in order to set a breakpoint. The process can be continued with the typical “g” command.

Update: Pavel Labedinsky points out that you can use “-xe ld:ntdll.dll” to achieve the same effect, but with the benefit that symbols for NTDLL are available. This is a better option as it alleviates the necessity to manually resolve the addresses of locations that you wish to set a breakpoint on..

Next up: Examining RtlUserThreadStart (present on Windows Vista and beyond).

A catalog of NTDLL kernel mode to user mode callbacks, part 5: KiUserCallbackDispatcher

Wednesday, November 21st, 2007

Last time, I briefly outlined the operation of KiRaiseUserExceptionDispatcher, and how it is used by the NtClose system service to report certain classes of handle misuse under the debugger.

All of the NTDLL kernel mode to user mode “callbacks” that I have covered thus far have been, for the most part fairly “passive” in nature. By this, I mean that the kernel does not explicitly call any of these callbacks, at least in the usual nature of making a function call. Instead, all of the routines that we have discussed thus far are only invoked instead of the normal return procedure for a system call or interrupt, under certain conditions. (Conceptually, this is similar in some respects to returning to a different location using longjmp.)

In contrast to the other routines that we have discussed thus far, KiUserCallbackDispatcher breaks completely out of the passive callback model. The user mode callback dispatcher is, as the name implies, a trampoline that is used to make full-fledged calls to user mode, from kernel mode. (It is complemented by the NtCallbackReturn system service, which resumes execution in kernel mode following a user mode callback’s completion. Note that this means that a user mode callback can make auxiliary calls into the kernel without “returning” back to the original kernel mode caller.)

Calling user mode to kernel mode is a very non-traditional approach in the Windows world, and for good reason. Such calls are typically dangerous and need to be implemented very carefully in order to avoid creating any number of system reliability or integrity issues. Beyond simply validating any data returned to kernel mode from user mode, there are a far greater number of concerns with a direct kernel mode to user mode call model as supported by KiUserCallbackDispatcher. For example, a thread running in user mode can be freely suspended, delayed for a very long period of time due to a high priority user thread, or even terminated. These actions mean that any code spanning a call out to user mode must not hold locks, have acquired memory or other resources that might need to be released, or soforth.

From a kernel mode perspective, the way a user mode callback using KiUserCallbackDispatcher works is that the kernel saves the current processor state on the current kernel stack, alters the view of the top of the current kernel stack to point after the saved register state, sets a field in the current thread (CallbackStack) to point to the stack frame containing the saved register state (the previous CallbackStack value is saved to allow for recursive callbacks), and then executes a return to user mode using the standard return mechanism.

The user mode return address is, of course, set to the feature NTDLL routine of this article, KiUserCallbackDispatcher. The way the user mode callback dispatcher operates is fairly simple. First, it indexes into an array stored in the PEB with an argument to the callback dispatcher that is used to select the function to be invoked. Then, the callback routine located in the array is invoked, and provided with a single pointer-sized argument from kernel mode (this argument is typically a structure pointer containing several parameters packaged up into one contiguous block of memory). The actual implementation of KiUserCallbackDispatcher is fairly simple, and I have posted a C representation of it.

In Win32, kernel mode to user mode callbacks are used exclusively by User32 for windowing related aspects, such as calling a window procedure to send a WM_NCCREATE message during the creation of a new window on behalf of a user mode caller that has invoked NtUserCreateWindowEx. For example, during window creation processing, if we set a breakpoint on KiUserCallbackDispatcher, we might see the following:

Breakpoint 1 hit
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`775851ca ntdll!KiUserCallbackDispatch
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0x27c
00000000`77585550 USER32!CreateWindowEx+0x3fe
000007fe`fddfa5b5 USER32!CreateWindowExW+0x70
000007fe`fde221d3 ole32!InitMainThreadWnd+0x65
000007fe`fde2150c ole32!wCoInitializeEx+0xfa
00000000`ff7e6db0 ole32!CoInitializeEx+0x18c
00000000`ff7ecf8b notepad!WinMain+0x5c
00000000`7746cdcd notepad!IsTextUTF8+0x24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

If we step through this call a bit more, we’ll see that it eventually ends up in a function by the name of user32!_fnINLPCREATESTRUCT, which eventually calls user32!DispatchClientMessage with the WM_NCCREATE window message, allowing the window procedure of the new window to participate in the window creation process, despite the fact that win32k.sys handles the creation of a window in kernel mode.

Callbacks are, as previously mentioned, permitted to be nested (or even recursively made) as well. For example, after watching calls to KiUserCallbackDispatcher for a time, we’ll probably see something akin to the following:

Breakpoint 1 hit
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`7758b45a ntdll!KiUserCallbackDispatch
00000000`7758b4a4 USER32!NtUserMessageCall+0xa
00000000`7758e55a USER32!RealDefWindowProcWorker+0xb1
000007fe`fca62118 USER32!RealDefWindowProcW+0x5a
000007fe`fca61fa1 uxtheme!_ThemeDefWindowProc+0x298
00000000`7758b992 uxtheme!ThemeDefWindowProcW+0x11
00000000`ff7e69ef USER32!DefWindowProcW+0xe6
00000000`7758e25a notepad!NPWndProc+0x217
00000000`7758cbaf USER32!UserCallWinProcCheckWow+0x1ad
00000000`77584e1c USER32!DispatchClientMessage+0xc3
00000000`77692016 USER32!_fnINOUTNCCALCSIZE+0x3c
00000000`775851ca ntdll!KiUserCallbackDispatcherContinue
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0x27c
00000000`77585550 USER32!CreateWindowEx+0x3fe
00000000`ff7e9525 USER32!CreateWindowExW+0x70
00000000`ff7e6e12 notepad!NPInit+0x1f9
00000000`ff7ecf8b notepad!WinMain+0xbe
00000000`7746cdcd notepad!IsTextUTF8+0x24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd

This support for recursive callbacks is a large factor in why threads that talk to win32k.sys often have so-called “large kernel stacks”. The kernel mode dispatcher for user mode calls will attempt to convert the thread to a large kernel stack when a call is made, as the typical sized kernel stack is not large enough to support the number of recursive kernel mode to user mode calls present in a many complicated window messaging calls.

If the process is a Wow64 process, then the callback array in the PEB is prepointed to an array of conversion functions inside the Wow64 layer, which map the callback argument to a version compatible with the 32-bit user32.dll, as appropriate.

Next up: Taking a look at LdrInitializeThunk, where all user mode threads really begin their execution.

A catalog of NTDLL kernel mode to user mode callbacks, part 4: KiRaiseUserExceptionDispatcher

Tuesday, November 20th, 2007

The previous post in this series outlined how KiUserApcDispatcher operates for the purposes of enabling user mode APCs. Unlike KiUserExceptionDispatcher, which is expected to modify the return information from the context of an interrupt (or exception) in kernel mode, KiUserApcDispatcher is intended to operate on the return context of an active system call.

In this vein, the third kernel mode to user mode NTDLL “callback” that we shall be investigating, KiRaiseUserExceptionDispatcher, is fairly similar. Although in some respects akin to KiUserExceptionDispatcher, at least relating to raising an exception in user mode on the behalf of a kernel mode caller, the KiRaiseUserExceptionDispatcher “callback” really has more in common with KiUserApcDispatcher from a usage and implementation standpoint. It also so happens that the implementation of this callback routine, which is quite simple, is completely representable in C, as a standard calling convention is used.

KiRaiseUserExceptionDispatcher is used if a system call wishes to raise an exception in user mode instead of simply return an NTSTATUS, as is the standard convention. It simply constructs a standard exception record using a supplied status code (which must be written to a well-known location in the current thread’s TEB beforehand), and passes the exception to RtlRaiseException (the same routine that is used internally by the Win32 RaiseException API).

This is a fairly atypical scenario, as most errors (including bad pointer parameters) will simply result in an appropriate status code being returned by the system call, such as STATUS_ACCESS_VIOLATION.

The one place the does currently use the services of KiRaiseUserExceptionDispatcher is NtClose, the system service responsible for closing handles (which implements the Win32 CloseHandle API). When a debugger is attached to a process and a protected handle (as set by a call to SetHandleInformation with HANDLE_FLAG_PROTECT_FROM_CLOSE) is passed to NtClose, then a STATUS_HANDLE_NOT_CLOSABLE exception is raised to user mode via KiRaiseUserExceptionDispatcher. For example, one might see the following when trying to close a protected handle with a debugger attached:

(2600.1994): Unknown exception - code c0000235 (first chance)
(2600.1994): Unknown exception - code c0000235 (second chance)
00000000`776920e7 8b8424c0000000  mov eax,dword ptr [rsp+0C0h]
0:000> !error c0000235
Error code: (NTSTATUS) 0xc0000235 (3221226037) - NtClose was
called on a handle that was protected from close via
0:000> k
RetAddr           Call Site
00000000`7746dadd ntdll!KiRaiseUserExceptionDispatcher+0x3a
00000000`01001955 kernel32!CloseHandle+0x29
00000000`01001e60 TestApp!wmain+0x35
00000000`7746cdcd TestApp!__tmainCRTStartup+0x120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

The exception can be manually continued in a debugger to allow the program to operate after this occurs, though such a bad handle reference is typically indicative of a serious bug.

A similar behavior is activated if a program is being debugged under Application Verifier and an invalid handle is closed. This behavior of raising an exception on a bad handle closure attempt is intended as a debugging aid, since most programs do not bother to check the return value of CloseHandle.

Incidentally, bad kernel handle closure references like this in kernel mode will result in a bugcheck instead of simply raising an exception that can be caught. Drivers do not have the “luxury” of continuing when they touch a bad handle, except when probing a user handle, of course. (Old versions of Process Explorer used to bring down the box with a bugcheck due to this, for instance, if you tried to close a protected handle. This bug has fortunately since been fixed.)

Next time, we’ll take a look at KiUserCallbackDispatcher, which is used to a great degree by win32k.sys.

A catalog of NTDLL kernel mode to user mode callbacks, part 3: KiUserApcDispatcher

Monday, November 19th, 2007

I previously described the behavior of the kernel mode to user mode exception dispatcher (KiUserExceptionDispatcher). While exceptions are arguably the most commonly seen of the kernel mode to user mode “callbacks” in NTDLL, they are not the only such event.

Much like exceptions that traverse into user mode, user mode APCs are all (with the exception of initial thread startup) funneled through a single dispatcher routine inside NTDLL, KiUserApcDispatcher. KiUserApcDispatcher is responsible for invoking the actual APC routine, re-testing for APC delivery (to allow for the entire APC queue to be drained at once), and returning to the return point of the alertable system call that was interrupted to deliver the user mode APC. From the perspective of the user mode APC dispatcher, the stack is logically arranged as if the APC dispatcher would “return” to the instruction immediately following the syscall instruction in the system call that began the alertable wait.

For example, if we breakpoint on KiUserApcDispatcher and examine the stack (in this case, after an alertable WaitForSingleObjectEx call that has been interrupted to deliver a user mode APC), we might see the following:

Breakpoint 0 hit
00000000`77691f40 488b0c24  mov     rcx,qword ptr [rsp]
0:000> k
RetAddr           Call Site
00000000`776902ba ntdll!KiUserApcDispatcher
00000000`7746d820 ntdll!NtWaitForSingleObject+0xa
00000000`01001994 kernel32!WaitForSingleObjectEx+0x9c
00000000`01001eb0 TestApp!wmain+0x64
00000000`7746cdcd TestApp!__tmainCRTStartup+0x120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

From an implementation standpoint, KiUserApcDispatcher is conceptually fairly simple. I’ve posted a translation of the assembler for those interested. Keep in mind, however, that as with the other kernel mode to user mode callbacks, the routine is actually written in assembler and utilizes constructs not expressible through C, such as a custom calling convention.

Note that the APC routine invoked by KiUserApcDispatcher corresponds roughly to a standard native APC routine, specifically a PKNORMAL_ROUTINE (except on x64, but more on that later). This is not compatible with the Win32 view of an APC routine, which takes a single parameter, as opposed to be three that a KNORMAL_ROUTINE takes. As a result, there is a kernel32 function, BaseDispatchAPC that wrappers all Win32 APCs, providing an SEH frame around the call (and activating the appropriate activation context, if necessary). BaseDispatchAPC also converts from the native APC calling convention into the Win32 APC calling convention.

For Win32 I/O completion routines, a similar wrapper routine (BasepIoCompletion) serves the purpose of converting from the standard NT APC calling convention to the Win32 I/O completion callback calling convention (which primarily includes unpackaging any I/O completion parameters from the IO_STATUS_BLOCK).

With Windows x64, the behavior of KiUserApcDispatcher changes slightly. Specifically, the APC routine invoked has four parameters instead of the standard set of three parameters for a PKNORMAL_ROUTINE. This is still compatible with standard NT APC routines due to a quirk of the x64 calling convention, whereby the first four arguments are passed by register. (This means that internally, any routines with zero through four arguments that fit within the confines of a standard pointer-sized argument slot are “compatible”, at least from a calling convention perspective.)

The fourth parameter added by KiUserApcDispatcher in the x64 case is a pointer to the context record that is to be resumed when the APC dispatching process is finished. This additional argument is used by Wow64.dll if the process is a Wow64 process, as Wow64.dll wrappers all 32-bit APC routines around a thunk routine (Wow64ApcRoutine). Wow64ApcRoutine internally uses this extra PCONTEXT argument to take control of resuming execution after the real APC routine is invoked. Thus in the 64-bit NTDLL Wow64 case, the NtContinue call following the call to the user APC routine never occurs.

Next time, we’ll take a look at the other kernel mode to user mode exception “callback”, KiRaiseUserExceptionDispatcher.

A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher

Friday, November 16th, 2007

Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each of them, in terms of both calling convention and what functionality they perform.

KiUserExceptionDispatcher is the routine responsible for calling the user mode portion of the SEH dispatcher. When an exception occurs, and it is an exception that would generate an SEH event, the kernel checks to see whether the exception occurred while running user mode code. If so, then the kernel alters the trap frame on the stack, such that when the kernel returns from the interrupt or exception, execution resumes at KiUserExceptionDispatcher instead of the instruction that raised the fault. The kernel also arranges for several parameters (a PCONTEXT and a PEXCEPTION_RECORD) that describe the state of the machine when the exception occurred to be passed to KiUserExceptionDispatcher upon the return to user mode. (This model of changing the return address for a return from kernel mode to user mode is a common idiom in the Windows kernel for several user mode event notification mechanisms.)

Once the kernel mode stack unwinds and control is transferred to KiUserExceptionDispatcher in user mode, the exception is processed locally via a call to RtlDispatchException, which is the core of the user mode exception dispatcher logic. If the exception was successfully dispatched (that is, an exception handler handled it), the final user mode context is realized with a call to RtlRestoreContext, which simply loads the registers in the given context into the processor’s architectural execution state.

Otherwise, the exception is “re-thrown” to kernel mode for last chance processing via a call to NtRaiseException. This gives the user mode debugger (if any) a final shot at handling the exception, before the kernel terminates the process. (The kernel internally provides the user mode debugger and kernel debugger a first chance shot at such exceptions before arranging for KiUserExceptionDispatcher to be run.)

I have posted a pseudo-C representation of KiUserExceptionDispatcher for the curious. Note that it uses a specialized custom calling convention, and by virtue of this, is actually an assembler function. However, a C representation is often easier to understand.

Not all user mode exceptions originate from kernel mode in this fashion; in many cases (such as with the RaiseException API), the exception dispatching process is originated entirely from user mode and KiUserExceptionDispatcher is not involved.

Incidentally, if one has been following along with some of the recent postings, the reason why the invalid parameter reporting mechanism of the Visual Studio 2005 CRT doesn’t properly break into into an already attached debugger should start to become clear now, given some additional knowledge of how the exception dispatching process works.

Because the VS2005 CRT simulates an exception by building a context and exception record, and passing these to UnhandledExceptionFilter, the normal exception dispatcher logic is not involved. This means that nobody makes a call to the NtRaiseException system service (as would normally be the case if UnhandledExceptionFilter were called as a part of normal exception dispatching), and thus there is no notification sent to the user mode debugger asking it to pre-process the simulated STATUS_INVALID_PARAMETER exception.

Update: Posted representation of KiUserExceptionDispatcher.

Next up: Taking a look at the user mode APC dispatcher (KiUserApcDispatcher).

A catalog of NTDLL kernel mode to user mode callbacks, part 1: Overview

Thursday, November 15th, 2007

As I previously mentioned, NTDLL maintains a set of special entrypoints that are used by the kernel to invoke certain functionality on the behalf of user mode.

In general, the functionality offered by these entrypoints is fairly simple, although having an understanding of how each are used provides useful insight into how certain features (such as user mode APCs) really work “under the hood”.

For the purposes of this discussion, the following are the NTDLL exported entrypoints that the kernel uses to communicate to user mode:

  1. KiUserExceptionDispatcher
  2. KiUserApcDispatcher
  3. KiRaiseUserExceptionDispatcher
  4. KiUserCallbackDispatcher
  5. LdrInitializeThunk
  6. RtlUserThreadStart
  7. EtwpNotificationThread
  8. LdrHotPatchRoutine

(There are other NTDLL exports used by the kernel, but not for direct user mode communication.)

These routines are generally used to inform user mode of a particular event occurring, though the specifics of how each routine is called vary somewhat.

KiUserExceptionDispatcher, KiUserApcDispatcher, and KiRaiseUserExceptionDispatcher are exclusively used when user mode has entered kernel mode, either implicitly, due to a processor interrupt (say, a page fault that will eventually trigger an access violation), or explicitly, due to a system call (such as NtWaitForSingleObject). The mechanism that the kernel uses to invoke these entrypoints is to alter the context that will be realized upon return from kernel mode to user mode. The return to user mode context information (a KTRAP_FRAME) is modified such that when kernel mode returns, instead of returning to the point upon which user mode invoked a kernel mode transition, control is transferred to one of the three dispatcher routines. Additional arguments are supplied to these dispatcher routines as necessary.

KiUserCallbackDispatcher is used to explicitly call out to user mode from kernel mode. This inverted mode of operation is typically discouraged in favor of models such as the pending IRP “inverted call model”. For historical design reasons, however, the Win32 subsystem (win32k.sys) uses this for a number of tasks (such as calling a user mode window procedure in response to a kernel mode window message operation). The user mode callout mechanism is not extensible to support arbitrary user mode callback destinations.

LdrInitializeThunk is the first instruction that any user mode thread runs, before the “actual” thread entrypoint. As such, it is the address at which every user mode thread system-wide begins execution.

RtlUserThreadStart is used on Windows Vista and Windows Server 2008 (and later OS’s) to form the initial entrypoint context for a thread started with NtCreateThreadEx (this is a marked departure from the approach taken by NtCreateThread, wherein the user mode caller supplies the initial thread context).

EtwpNotificationThread and LdrHotPatchRoutine both correspond to a standard thread entrypoint routine. These entrypoints are referenced by user mode threads that are created in the context of a particular process to carry out certain tasks on behalf of the kernel. As the latter two routines are generally only rarely encountered, this series does not describe them in detail.

Despite (or perhaps in spite of) the more or less completely unrealized promise of less reboots for hotfixes with Windows Server 2003, I think I’ve seen a grand total of one or two hotfixes in the entire lifetime of the OS that supported hotpatching. Knowing the effort that must have gone into developing hotpatching support, it is depressing to see security bulletin after security bulletin state “No, this hotfix does not support hotpatching. A reboot will be required.”. That is, however, a topic for another day.

Next up: Examining the operation of KiUserExceptionDispatcher in more detail.

Why are certain DLLs required to be at the same base address system-wide?

Tuesday, November 13th, 2007

There are several Windows DLLs that are, for various reasons, required to be at the same base address system-wide (though for several of these, a case could be made that alternate base addresses could be used in different Terminal Server sessions). Although not explicitly documented by Microsoft (as far as I know), a number of programs rely on these “fixed base” DLLs.

That is not to say that the base addresses of these DLLs cannot change, but that while the system is running, all processes will have these DLLs mapped at the same base address (if they are indeed mapped at all).

The current set of DLLs that require the same base address system-wide includes NTDLL, kernel32, and user32, though the reasons for this requirement vary a bit between each DLL.

NTDLL must be at the same address system wide because there are a number of routines it exports that are used by the kernel to arrange for (indirect) calls to user mode. For example, ntdll!LdrInitializeThunk is the true start address of every user mode thread, and ntdll!KiUserApcDispatcher is used to invoke a user mode APC if one is ready to be processed while a thread is in a user mode wait. The kernel resolves the address of these (and other) “special” exports at system initialization time, and then uses these addresses when it needs to arrange for user mode code execution. Because the kernel caches these resolved function pointers, NTDLL cannot typically be based at a different address from process to process, as the kernel will always reference the same address for these special exports across all processes.

Additionally, some special NTDLL exports are used by user mode code in such a way that they are assumed to be at the same base address cross-process. For example, ntdll!DbgUiRemoteBreakIn is used by the debugger to break in to a process, and the debugger assumes that the local address of DbgUiRemoteBreakIn matches the remote address of DbgUiRemoteBreakIn in the target process.

Kernel32 is required to be at the same base address because there are a number of internal kernel32 routines that, similar to ntdll!DbgUiRemoteBreakIn, are used in cross-process thread injection. One example of this used to be the console control event handler In the case of console events, during kernel32.dll initialization, the address of the Ctrl-C event dispatcher is passed to WinSrv.dll (in CSRSS space).

Originally, WinSrv simply cached the dispatcher pointer after the first process was created (thus requiring kernel32 to be at the same base address across all processes in the session). On modern systems, however, WinSrv now tracks the client’s kernel32 dispatcher pointer on a per-process basis to account for the fact that the dispatcher is at a different address in the 32-bit kernel32 (versus the 64-bit kernel32). Ironically, the developer who made this change to WinSrv actually forgot to add support for using the current process’s dispatcher pointer in several corner cases (such as kernel32!SetLastConsoleActiveEvent and the corresponding winsrv!SrvConsoleNotifyLastClose and winsrv!RemoveConsole CSRSS-side routines). In the cases where WinSrv still incorrectly passes the cached (64-bit) Ctrl-C dispatcher value to CreateRemoteThread, wow64.dll has a special hack (wow64!MapContextAddress64TO32) that cleans up after WinSrv and fixes the thread start address to refer to the 32-bit kernel32.dll.

By the time this change to WinSrv and Ctrl-C processing was made, though, the application compatibility impact of removing the kernel32 base address to be the same system-wide would have been too severe to eliminate the restriction (virtually all third party code injection code now relies heavily on this assumption). Thus, for this (and other) reasons, kernel32 still remains with the restriction that it may not be relocated to a different base address cross-process.

User32 is required to be at the same address cross-process because there is an array of user32 function addresses provided to win32k.sys for the built-in window class window procedures (among other things). This function pointer array is captured via a call to NtUserInitializeClientPfnArrays at session start up time, when WinSrv is initializing win32k during CSRSS initialization. Wow64win.dll, the NtUser/win32k Wow64 support library, provides support for mapping 32-bit to 64-bit (and vice versa) for these function addresses, as necessary for the support of 32-bit processes on 64-bit platforms.

The user32 and kernel32 requirements could arguably be relaxed to only apply within a Terminal Server session, although the Windows XP (and later) cross-session debugging support muddies the waters with respect to kernel32 due to a necessity to support debugger break-in with debuggers that utilize DebugBreak for their break-in threads. (The Wow64 layer provides translation assistance for mapping DebugBreak to a 32-bit address if a 64-bit thread is created at the address of the 64-bit kernel32 DebugBreak export.)

Note that the ASLR support in Windows Vista does not run afoul of these restrictions, as Vista’s ASLR always picks the same base address for a given DLL per operating system boot. Thus, even though, say, NTDLL can have its base address randomized at boot time under Windows Vista, the particular randomized base address that was chosen is still used by all processes until the operating system is restarted.

Update: skape points out that I (somehow) neglected to mention the most important restriction on kernel32 base addressing, that being on Windows Server 2003 and earlier operating systems, internal kernel32 routines are used as the start address of new threads by CreateRemoteThread and CreateProcess.

How does one retrieve the 32-bit context of a Wow64 program from a 64-bit process on Windows Server 2003 x64?

Thursday, November 1st, 2007

Recently, Jimmy asked me what the recommended way to retrieve the 32-bit context of a Wow64 application on Windows XP x64 / Windows Server 2003 x64 was.

I originally responded that the best way to do this was to use Wow64GetThreadContext, but Jimmy mentioned that this doesn’t exist on Windows XP x64 / Windows Server 2003 x64. Sure enough, I checked and it’s really not there, which is rather a bummer if one is trying to implement a 64-bit debugger process capable of debugging 32-bit processes on pre-Vista operating systems.

Normally, I don’t typically recommend using undocumented implementation details in production code, but in this case, there seems to be little choice as there’s no documented mechanism to perform this operation prior to Vista. Because Vista introduces a documented way to perform this task, going an undocumented route is at least slightly less questionable, as there’s an upper bound on what operating systems need to be supported, and major changes to the implementation of things on downlevel operating systems are rarer than with new operating system releases.

Clearly, this is not always the case; Windows XP Service Pack 2 changed an enormous amount of things, for instance. However, as a general rule, service packs tend to be relatively conservative with this sort of thing. That’s not that one has carte blanche with using undocumented implementation details on downlevel platforms, but perhaps one can sleep a bit easier at night knowing that things are less likely to break than in the next Windows release.

I had previously mentioned that the Wow64 layer takes a rather unexpected approach to how to implement GetThreadContext and SetThreadContext. While I mentioned at a high level what was going on, I didn’t really go into the details all that much.

The basic implementation of these routines is to determine whether the thread is running in 64-bit mode or not (determined by examining the SegCs value of the 64-bit context record for the thread as returned by NtGetContextThread). If the thread is running in 64-bit mode, and the thread is a Wow64 thread, then an assumption can be made that the thread is in the middle of a callout to the Wow64 layer (say, a system call).

In this case, the 32-bit context is saved at a well-known location by the process that translates from running in 32-bit mode to running in 64-bit mode for system calls and other voluntary, user mode “32-bit break out” events. Specifically, the Wow64 layer repurposes the second TLS slot of each 64-bit thread (that is, Teb->TlsSlots[ 1 ]) to point to a structure of the following layout:

typedef struct _WOW64_THREAD_INFO
   ULONG UnknownPrefix;
   WOW64_CONTEXT Wow64Context;
   ULONG UnknownSuffix;

(The real structure name is not known..)

Normally, system components do not use the TLS array, but the Wow64 layer is an exception. Because there is not normally any third party 64-bit code running in a Wow64 process, the Wow64 layer is free to do what it wants with the TlsSlots array of the 64-bit TEB for a Wow64 thread. (Each Wow64 thread has its own, separate 32-bit TEB, so this does not interfere with the operation of TLS by the 32-bit program that is currently executing.)

In the case where the requested Wow64 is in a 64-bit Wow64 callout, all one needs to do is to retrieve the base address of the 64-bit TEB of the thread in question, read the second entry in the TlsSlots array, and then read the WOW64_CONTEXT structure out of the memory block referred to by the second 64-bit TLS slot.

The other case that is significant is that where the Wow64 thread is running 32-bit code and is not in a Wow64 callout. In this case, because Wow64 runs x86 code natively, one simply needs to capture the 64-bit context of the desired thread and truncate all of the 64-bit registers to their 32-bit counterparts.

Setting the context of a Wow64 thread works exactly like retrieving the context of a Wow64 thread, except in reverse; one either modifies the 64-bit thread context if the thread is running 32-bit code, or one modifies the saved context record based off of the 64-bit TEB of the desired thread (which will be restored when the thread resumes execution).

I have posted a basic implementation of a version of Wow64­GetThreadContext that operates on pre-Windows-Vista platforms. Note that this implementation is incomplete; it does not translate floating point registers, nor does it only act on the subset of registers requested by the caller in CONTEXT::ContextFlags. The provided code also does not implement Wow64­SetThreadContext; implementing the “set” operation and extending the “get” operation to fully conform to GetThreadContext semantics are left as an exercise for the reader.

This code will operate on Vista x64 as well, although I would strongly recommend using the documented API on Vista and later platforms instead.

Note that the operation of Wow64 on IA64 platforms is completely different from that on x64. This information does not apply in any way to the IA64 version of Wow64.

Thread Local Storage, part 8: Wrap-up

Wednesday, October 31st, 2007

This is the final post in the Thread Local Storage series, which is comprised of the following articles:

  1. Thread Local Storage, part 1: Overview
  2. Thread Local Storage, part 2: Explicit TLS
  3. Thread Local Storage, part 3: Compiler and linker support for implicit TLS
  4. Thread Local Storage, part 4: Accessing __declspec(thread) data
  5. Thread Local Storage, part 5: Loader support for __declspec(thread) variables (process initialization time)
  6. Thread Local Storage, part 6: Design problems with the Windows Server 2003 (and earlier) approach to implicit TLS
  7. Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs
  8. Thread Local Storage, part 8: Wrap-up

By now, much of the inner workings of TLS (both implicit and explicit) on Windows should appear less mysterious, and a number of the seemingly arbitrary restrictions on limitations (maximum counts of explicit TLS slots on various operating systems, and limitations with respect to the usage of __declspec(thread) on demand loaded DLLs). Although many of these things can be (and should) considered implementation details that are subject to change, knowing how things work “under the hood” often comes in useful from time to time. For example, with an understanding of why there’s a hard limit to the number of available explicit TLS slots, the importance of reusing one TLS slots for many variables (by placing them into a structure that is pointed to by the contents of a TLS slot) should become clear.

Many of the details of implicit TLS are actually rather set in stone at this point, due to the fact that the compiler has been emitting code to directly access the ThreadLocalStoragePointer field in the TEB. Interestingly enough, this makes ThreadLocalStoragePointer a “guaranteed portable” part of the TEB, along with the NT_TIB header, despite the fact that the contents between the two are not defined to be portable (and are certainly not across, say, Windows 95).

Most of the inner workings of TLS are fairly straightforward, although there are some clever tricks employed to deal with scenarios such as TLS slots being released while threads are active. Many of the operational details of day to day TLS operation, such as how explicit TLS operates, are significantly different on Windows 95 and other operating systems of the 16-bit Windows lineage, so I would not recommend relying on the details of the implementation of TLS for non-NT-based systems.

Incidentally, most of the operating system itself does not use TLS in the way that it is exposed to third party programs. Instead, many operating system components either have their own dedicated fields in the TEB, or for larger amounts of data that may not need to be allocated for every thread in the system, a pointer field that can be filled with a pointer to a memory block at runtime if desired. For instance, there’s a ReservedForNtRpc field, a number of fields set aside for OpenGL ICDs (so much for Microsoft not supporting OpenGL), a WinSockData field for ws2_32, and many other similar fields for various operating system components.

This doesn’t mean that these components are really getting preferential treatment, as for the most part, an access to such a field in the TEB is in practice not really slower than an access through the documented TLS APIs. The benefit from providing these components with their own dedicated storage in the TEB is that in many cases, these components are already going to be active. If said operating system components used conventional TLS, then this would significantly detract from the already limited number of TLS slots available for use by third party components.

Some components do actually use standard TLS, or at least the space allocated in the TEB for standard TLS slots (though in special circumstances and without going through the standard explicit TLS APIs). For example, the 64-bit portion of the Wow64 layer in a 32-bit process repurposes some of the 64-bit TLS slots (which would normally be completely unused in such a process) for its own internal usage, thereby avoiding the need for dedicated storage in the TEB. That, however, is a story for another day.

Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs

Tuesday, October 30th, 2007

Yesterday, I outlined some of the pitfalls behind the approach that the loader has traditionally taken to implicit TLS, in Windows Server 2003 and earlier releases of the operating system.

With Windows Vista, Microsoft has taken a stab at alleviating some of the issues that make __declspec(thread) unusable for demand loaded DLLs. Although solving the problem may initially appear simple at first (one would tend to think that all that would need to be done would be to track and procesS TLS data for new modules as they’re loaded), the reality of the situation is unfortunately a fair amount more complicated than that.

At heart is the fact that implicit TLS is really only designed from the get-go to support operation at process initialization time. For example, this becomes evident when ones considers what would need to be done to allocate a TLS slot for a new module. This is in and of itself problematic, as the per-module TLS array is allocated at process initialization time, with only enough space for the modules that were present (and using TLS) at that time. Expanding the array is in this case a difficult thing to safely do, considering the code that the compiler generates for accessing TLS data.

The problem resides in the fact that the compiler reads the address of the current thread’s ThreadLocalStoragePointer and then later on dereferences the returned TLS array with the current module’s TLS index. Because all of this is done without synchronization, it is not in general safe to just switch out the old ThreadLocalStoragePointer with a new array and then release the old array from another thread context, as there is no way to ensure that the thread whose TLS array is being modified was not in the middle of accessing the TLS array.

A further difficulty presents itself in that there now needs to be a mechanism to proactively go out and place a new TLS module block into each running thread’s TLS array, as there may be multiple threads active when a module is demand-loaded. This is further complicated by the fact that said modifications are required to be performed before DllMain is called for the incoming module, and while the loader lock is still held by the current thread. This implies that, once again, the alterations to the TLS arrays of other threads will need to be performed by the current thread, without the cooperation of additional threads that are active in the process at the time of the DLL load.

These constraints are responsible for the bulk of the complexity of the new loader code in Windows Vista for TLS-related operations. The general concept behind how the new TLS support operates is as follows:

First, a new module is loaded via LdrLoadDll (which is used to implement LoadLibrary and similar Win32 functions). The loader examines the module to determine if it makes use of implicit TLS. If not, then no TLS-specific handling is performed and the typical loaded module processing occurs.

If an incoming module does make use of TLS, however, then LdrpHandleTlsData (an internal helper routine) is called to initialize support for the new module’s implicit TLS usage. LdrpHandleTlsData determines whether there is room in the ThreadLocalStoragePointer arrays of currently loaded threads for the new module’s TLS slot (with Windows Vista, the array can initially be larger than the total number of modules using TLS at process initialization time, for cheaper expansion of TLS data when a new module using TLS is demand-loaded). Because all running threads will at any given time have the same amount of space in their ThreadLocalStoragePointer, this is easily accomplished by a global variable to keep track of the array length. This variable is the SizeOfBitMap member of LdrpTlsBitmap, an RTL_BITMAP structure.

Depending on whether the existing ThreadLocalStoragePointer arrays are sufficient to contain the new module, LdrpHandleTlsdata allocates room for the TLS variable block for the new module and possibly new TLS arrays to store in the TEB of running threads. After the new data is allocated for each thread for the incoming module, a new process information class (ProcessTlsInformation) is utilized with an NtSetInformationProcess call to ask the kernel for help in switching out TLS data for any threads that are currently running in the process. Conceptually, this behavior is similar to ThreadZeroTlsCell, although its implementation is significantly more complicated. This step does not really appear to need to occur in kernel mode and does introduce significant (arguably unnecessary) complexity, so it is unclear why the designers elected to go this route.

In response to the ProcessTlsInformation request, the kernel enumerates threads in the current process and either swaps out one member of the ThreadLocalStoragePointer array for all threads, or swaps out the entire pointer to the ThreadLocalStoragePointer array itself in the TEB for all threads. The previous values for either the requested TLS index or the entire array pointer are then returned to user mode.

LdrpHandleTlsData then inspects the data that was returned to it by the kernel. Generally, this data represents either a TLS data block for a module that has been since unloaded (which is always safe to immediately free), or it represents an old TLS array for an already running thread. In the latter case, it is not safe to release the memory backing the array, as without the cooperation of the thread in question, there is no way to determine when the thread has released all possible references to the old memory block. Since the code to access the TLS array is hardcoded into every program using implicit TLS by the compiler, for practical purposes there is no particularly elegant way to make this determinatiion.

Because it is not easily possible to determine (prove) when the old TLS array pointer will never again be referenced, the loader enqueues the pointer into a list of heap blocks to be released at thread exit time when the thread that owns the old TLS array performs a clean exit. Thus, the old TLS array pointer (if the TLS array was expanded) is essentially intentionally leaked until the thread exits. This is a fairly minor memory loss in practice, as the array itself is an array of pointers only. Furthermore, the array is expanded in such a way that most of the time, a new module will take an unused slot in the array instead of requiring the TLS array to be reallocated each time. This sort of intentional leak is, once again, necessary due to the design of implicit TLS not being particular conducive to supporting demand loaded modules.

The loader lock itself is used for synchronization with respect to switching out TLS pointers in other threads in the current process. While a thread owns the loader lock, it is guaranteed that no other thread will attempt to modify the TLS array of it (or any other threads). Because the old TLS array pointers are kept if the TLS array is reallocated, there is no risk of touching deallocated memory when the swap is made, even though the threads whose TLS pointers are being swapped have no synchronization with respect to reading the TLS array in their TEBs.

When a module is unloaded, the TLS slot occupied by the module is released back into the TLS slot pool, but the module’s TLS variable space is not immediately freed until either individual threads for which TLS variable space were allocated exit, or a new module is loaded and happens to claim the outgoing module’s previous TLS slot.

For those interested, I have posted my interpretration of the new implicit TLS support in Vista. This code has not been completely tested, though it is expected to be correct enough for purposes of understanding the details of the TLS implementation. In particular, I have not verified every SEH scope in the ProcessTlsInformation implementation; the SEH scope statements (handlers in particular) are in many cases logical extrapolations of what the expected behavior should be in such cases. As always, it should be considered implementation details and subject to change without notice in future operating system releases.

(There also appear to be several unfortunate bugs in the Vista implementation of TLS, mostly related to inconsistent states and potential corruption if heap allocations fail at “bad” points in time. These are commented in the above code.)

The handler for the ProcessTlsInformation process set information class does not appear to be subfunction in reality, but instead a (rather large) case statement in the implementation of NtSetInformationProcess. It is presented as a subfunction for purposes of clarity. For reference, a control flow graph of NtSetInformationProcess is provided, with the basic blocks relevant to the ProcessTlsInformation case statement shaded. I suspect that this information class holds the record for the most convoluted usage of SEH scopes due to its heavy use of dual input/output parameters.

The information class implementation also appears to take many unconventional shortcuts that while technically workable for the use cases, would appear to be rather inconsistent with the general way that most other system calls and information classes are architected. The reasoning behind these inconsistencies is not known (perhaps as a time saver). For example, unlike most other process information classes, the only valid handle that can be used with this information class is NtCurrentProcess(). In other words, the information class handler implementation assumes the caller is the process to be modified.