Archive for November, 2007

A catalog of NTDLL kernel mode to user mode callbacks, part 6: LdrInitializeThunk

Wednesday, November 28th, 2007

Previously, I described the mechanism by which the kernel mode to user mode callback dispatcher (KiUserCallbackDispatcher) operates, and how it is utilized by win32k.sys for various window manager related operations.

The next special NTDLL kernel mode to user mode “callback” up on the list is LdrInitializeThunk, which is not so much a callback as the entry point at which all user mode threads begin their execution system-wide. Although the Win32 CreateThread API (and even the NtCreateThread system service that is used to implement the Win32 CreateThread) provide the illusion that a thread begins its execution at a specified start routine (or instruction pointer, in the case of NtCreateThread), this is not truly the case.

CreateThread internally layers in a special kernel32 stub routine (BaseThreadStart or BaseProcessStart) in between the specified thread routine and the actual initial instruction of the thread. The kernel32 stub routine wrappers the call to the user-supplied thread start routine to provide services such a “top-level” SEH frame for the support of UnhandledExceptionFilter.

However, there exists yet another layer of indirection before a thread begins its execution at the specified thread start routine, even beyond the kernel32 stub routine at the start of all Win32 threads (the way the kernel32 stub routine works changes slightly with Windows Vista, though that is outside the scope of this discussion). The presence of this extra layer of indirection can be inferred by examining the documentation on MSDN for DllMain, which states that all threads call out to the DllMain routine at some point during thread start-up. The kernel32 stub routine is not involved in this process, and obviously the user-supplied thread entry point does not have to explicitly attempt to call DllMain for every loaded DLL with DLL_THREAD_ATTACH. This leaves us with the question of who actually arranges for these DllMain calls to happen when a thread begins.

The answer to this question is, of course, the feature routine of this article, LdrInitializeThunk. When a user mode thread is readied to begin initial execution after being created, the initial context that is realized is not the context value supplied to NtCreateThread (which would eventually end up in the user-supplied thread entry point). Instead, execution really begins at LdrInitializeThunk within NTDLL, which is supplied a CONTEXT record that describes the initially requested state of the thread (as supplied to NtCreateThread, or as created by NtCreateThreadEx on Windows Vista). This context record is provided as an argument to LdrInitializeThunk, in order to allow for control to (eventually) be transferred to the user-supplied thread entry point.

When invoked by a new thread, LdrInitializeThunk invokes LdrpInitialize to perform the remainder of the initialization tasks, and then calls upon the NtContinue system service to restore the supplied context record. I have made available a C-like representation of this process for illustration purposes.

LdrpInitialize makes a determination as to whether the process has already been initialized (for the purposes of NTDLL). This step is necessary as LdrInitializeThunk (and by extension, LdrpInitialize) is not only the actual entry point for a new thread in an already initialized process, but it is also the entry point for the initial thread in a completely new process (making it the first piece of code that is run in user mode in a new process). If the process has not already been initialized, then LdrpInitialize performs process initialization tasks by invoking LdrpInitializeProcess (some process initialization is performed inline by LdrpInitialize in the process initialization case as well). If the process is a Wow64 process, then the Wow64 NTDLL is loaded and invoked to initialize the 32-bit NTDLL’s per-process state.

Otherwise, if the process is already initialized, then LdrpInitialize invokes LdrpInitializeThread to perform per-thread initialization, which primarily involves invoking DllMain and TLS callbacks for loaded modules while the loader lock is held. (This is the reason why it is not supported to wait on a new thread to run its thread initialization routine from DllMain, because the new thread will immediately become blocked upon the loader lock, waiting for the thread already in DllMain to complete its processing.) If the process is a Wow64 process, then there is again support for making a call to the 32-bit NTDLL for purposes of running 32-bit per-thread initialization code.

When the required process or thread initialization tasks have been completed, LdrpInitialize returns to LdrInitializeThunk, which then realizes the user-supplied thread start context with a call to the NtContinue system service.

One consequence of this architecture for process and thread initialization is that it becomes (slightly) more difficult to step through process initialization, because the thread that initializes the process is not the first thread created in the process, but rather the first thread that executes LdrInitializeThunk. This means that one cannot simply create a process as suspended and attach a debugger to the process in order to step through process initialization, as the debugger break-in thread will run the very process initialization code that one wishes to step through before executing the break-in thread breakpoint instruction!

There is, fortunately, support for debugging this scenario built in to the Windows debugger package. By setting the debugger to break on the ‘create process’ event, it is possible to manually set a breakpoint on a new process (created by the debugger) or a child process (if child process debugging is enabled) before LdrInitializeThunk is run. This support is activated by breaking on the cpr (Create Process) event. For example, using the following ntsd command line, it is possible to examine the target and set breakpoints before any user mode code runs:

ntsd.exe -xe cpr C:\windows\system32\cmd.exe

Note that as the loaded module list will not have been initialized at this point, symbols will not be functional, so it is necessary to resolve the offset of LdrInitializeThunk manually in order to set a breakpoint. The process can be continued with the typical “g” command.

Update: Pavel Labedinsky points out that you can use “-xe ld:ntdll.dll” to achieve the same effect, but with the benefit that symbols for NTDLL are available. This is a better option as it alleviates the necessity to manually resolve the addresses of locations that you wish to set a breakpoint on..

Next up: Examining RtlUserThreadStart (present on Windows Vista and beyond).

A catalog of NTDLL kernel mode to user mode callbacks, part 5: KiUserCallbackDispatcher

Wednesday, November 21st, 2007

Last time, I briefly outlined the operation of KiRaiseUserExceptionDispatcher, and how it is used by the NtClose system service to report certain classes of handle misuse under the debugger.

All of the NTDLL kernel mode to user mode “callbacks” that I have covered thus far have been, for the most part fairly “passive” in nature. By this, I mean that the kernel does not explicitly call any of these callbacks, at least in the usual nature of making a function call. Instead, all of the routines that we have discussed thus far are only invoked instead of the normal return procedure for a system call or interrupt, under certain conditions. (Conceptually, this is similar in some respects to returning to a different location using longjmp.)

In contrast to the other routines that we have discussed thus far, KiUserCallbackDispatcher breaks completely out of the passive callback model. The user mode callback dispatcher is, as the name implies, a trampoline that is used to make full-fledged calls to user mode, from kernel mode. (It is complemented by the NtCallbackReturn system service, which resumes execution in kernel mode following a user mode callback’s completion. Note that this means that a user mode callback can make auxiliary calls into the kernel without “returning” back to the original kernel mode caller.)

Calling user mode to kernel mode is a very non-traditional approach in the Windows world, and for good reason. Such calls are typically dangerous and need to be implemented very carefully in order to avoid creating any number of system reliability or integrity issues. Beyond simply validating any data returned to kernel mode from user mode, there are a far greater number of concerns with a direct kernel mode to user mode call model as supported by KiUserCallbackDispatcher. For example, a thread running in user mode can be freely suspended, delayed for a very long period of time due to a high priority user thread, or even terminated. These actions mean that any code spanning a call out to user mode must not hold locks, have acquired memory or other resources that might need to be released, or soforth.

From a kernel mode perspective, the way a user mode callback using KiUserCallbackDispatcher works is that the kernel saves the current processor state on the current kernel stack, alters the view of the top of the current kernel stack to point after the saved register state, sets a field in the current thread (CallbackStack) to point to the stack frame containing the saved register state (the previous CallbackStack value is saved to allow for recursive callbacks), and then executes a return to user mode using the standard return mechanism.

The user mode return address is, of course, set to the feature NTDLL routine of this article, KiUserCallbackDispatcher. The way the user mode callback dispatcher operates is fairly simple. First, it indexes into an array stored in the PEB with an argument to the callback dispatcher that is used to select the function to be invoked. Then, the callback routine located in the array is invoked, and provided with a single pointer-sized argument from kernel mode (this argument is typically a structure pointer containing several parameters packaged up into one contiguous block of memory). The actual implementation of KiUserCallbackDispatcher is fairly simple, and I have posted a C representation of it.

In Win32, kernel mode to user mode callbacks are used exclusively by User32 for windowing related aspects, such as calling a window procedure to send a WM_NCCREATE message during the creation of a new window on behalf of a user mode caller that has invoked NtUserCreateWindowEx. For example, during window creation processing, if we set a breakpoint on KiUserCallbackDispatcher, we might see the following:

Breakpoint 1 hit
ntdll!KiUserCallbackDispatch:
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`775851ca ntdll!KiUserCallbackDispatch
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0x27c
00000000`77585550 USER32!CreateWindowEx+0x3fe
000007fe`fddfa5b5 USER32!CreateWindowExW+0x70
000007fe`fde221d3 ole32!InitMainThreadWnd+0x65
000007fe`fde2150c ole32!wCoInitializeEx+0xfa
00000000`ff7e6db0 ole32!CoInitializeEx+0x18c
00000000`ff7ecf8b notepad!WinMain+0x5c
00000000`7746cdcd notepad!IsTextUTF8+0x24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

If we step through this call a bit more, we’ll see that it eventually ends up in a function by the name of user32!_fnINLPCREATESTRUCT, which eventually calls user32!DispatchClientMessage with the WM_NCCREATE window message, allowing the window procedure of the new window to participate in the window creation process, despite the fact that win32k.sys handles the creation of a window in kernel mode.

Callbacks are, as previously mentioned, permitted to be nested (or even recursively made) as well. For example, after watching calls to KiUserCallbackDispatcher for a time, we’ll probably see something akin to the following:

Breakpoint 1 hit
ntdll!KiUserCallbackDispatch:
00000000`77691ff7 488b4c2420  mov rcx,qword ptr [rsp+20h]
0:000> k
RetAddr           Call Site
00000000`7758b45a ntdll!KiUserCallbackDispatch
00000000`7758b4a4 USER32!NtUserMessageCall+0xa
00000000`7758e55a USER32!RealDefWindowProcWorker+0xb1
000007fe`fca62118 USER32!RealDefWindowProcW+0x5a
000007fe`fca61fa1 uxtheme!_ThemeDefWindowProc+0x298
00000000`7758b992 uxtheme!ThemeDefWindowProcW+0x11
00000000`ff7e69ef USER32!DefWindowProcW+0xe6
00000000`7758e25a notepad!NPWndProc+0x217
00000000`7758cbaf USER32!UserCallWinProcCheckWow+0x1ad
00000000`77584e1c USER32!DispatchClientMessage+0xc3
00000000`77692016 USER32!_fnINOUTNCCALCSIZE+0x3c
00000000`775851ca ntdll!KiUserCallbackDispatcherContinue
00000000`7758514a USER32!ZwUserCreateWindowEx+0xa
00000000`775853f4 USER32!VerNtUserCreateWindowEx+0x27c
00000000`77585550 USER32!CreateWindowEx+0x3fe
00000000`ff7e9525 USER32!CreateWindowExW+0x70
00000000`ff7e6e12 notepad!NPInit+0x1f9
00000000`ff7ecf8b notepad!WinMain+0xbe
00000000`7746cdcd notepad!IsTextUTF8+0x24f
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd

This support for recursive callbacks is a large factor in why threads that talk to win32k.sys often have so-called “large kernel stacks”. The kernel mode dispatcher for user mode calls will attempt to convert the thread to a large kernel stack when a call is made, as the typical sized kernel stack is not large enough to support the number of recursive kernel mode to user mode calls present in a many complicated window messaging calls.

If the process is a Wow64 process, then the callback array in the PEB is prepointed to an array of conversion functions inside the Wow64 layer, which map the callback argument to a version compatible with the 32-bit user32.dll, as appropriate.

Next up: Taking a look at LdrInitializeThunk, where all user mode threads really begin their execution.

A catalog of NTDLL kernel mode to user mode callbacks, part 4: KiRaiseUserExceptionDispatcher

Tuesday, November 20th, 2007

The previous post in this series outlined how KiUserApcDispatcher operates for the purposes of enabling user mode APCs. Unlike KiUserExceptionDispatcher, which is expected to modify the return information from the context of an interrupt (or exception) in kernel mode, KiUserApcDispatcher is intended to operate on the return context of an active system call.

In this vein, the third kernel mode to user mode NTDLL “callback” that we shall be investigating, KiRaiseUserExceptionDispatcher, is fairly similar. Although in some respects akin to KiUserExceptionDispatcher, at least relating to raising an exception in user mode on the behalf of a kernel mode caller, the KiRaiseUserExceptionDispatcher “callback” really has more in common with KiUserApcDispatcher from a usage and implementation standpoint. It also so happens that the implementation of this callback routine, which is quite simple, is completely representable in C, as a standard calling convention is used.

KiRaiseUserExceptionDispatcher is used if a system call wishes to raise an exception in user mode instead of simply return an NTSTATUS, as is the standard convention. It simply constructs a standard exception record using a supplied status code (which must be written to a well-known location in the current thread’s TEB beforehand), and passes the exception to RtlRaiseException (the same routine that is used internally by the Win32 RaiseException API).

This is a fairly atypical scenario, as most errors (including bad pointer parameters) will simply result in an appropriate status code being returned by the system call, such as STATUS_ACCESS_VIOLATION.

The one place the does currently use the services of KiRaiseUserExceptionDispatcher is NtClose, the system service responsible for closing handles (which implements the Win32 CloseHandle API). When a debugger is attached to a process and a protected handle (as set by a call to SetHandleInformation with HANDLE_FLAG_PROTECT_FROM_CLOSE) is passed to NtClose, then a STATUS_HANDLE_NOT_CLOSABLE exception is raised to user mode via KiRaiseUserExceptionDispatcher. For example, one might see the following when trying to close a protected handle with a debugger attached:

(2600.1994): Unknown exception - code c0000235 (first chance)
(2600.1994): Unknown exception - code c0000235 (second chance)
ntdll!KiRaiseUserExceptionDispatcher+0x3a:
00000000`776920e7 8b8424c0000000  mov eax,dword ptr [rsp+0C0h]
0:000> !error c0000235
Error code: (NTSTATUS) 0xc0000235 (3221226037) - NtClose was
called on a handle that was protected from close via
NtSetInformationObject.
0:000> k
RetAddr           Call Site
00000000`7746dadd ntdll!KiRaiseUserExceptionDispatcher+0x3a
00000000`01001955 kernel32!CloseHandle+0x29
00000000`01001e60 TestApp!wmain+0x35
00000000`7746cdcd TestApp!__tmainCRTStartup+0x120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

The exception can be manually continued in a debugger to allow the program to operate after this occurs, though such a bad handle reference is typically indicative of a serious bug.

A similar behavior is activated if a program is being debugged under Application Verifier and an invalid handle is closed. This behavior of raising an exception on a bad handle closure attempt is intended as a debugging aid, since most programs do not bother to check the return value of CloseHandle.

Incidentally, bad kernel handle closure references like this in kernel mode will result in a bugcheck instead of simply raising an exception that can be caught. Drivers do not have the “luxury” of continuing when they touch a bad handle, except when probing a user handle, of course. (Old versions of Process Explorer used to bring down the box with a bugcheck due to this, for instance, if you tried to close a protected handle. This bug has fortunately since been fixed.)

Next time, we’ll take a look at KiUserCallbackDispatcher, which is used to a great degree by win32k.sys.

A catalog of NTDLL kernel mode to user mode callbacks, part 3: KiUserApcDispatcher

Monday, November 19th, 2007

I previously described the behavior of the kernel mode to user mode exception dispatcher (KiUserExceptionDispatcher). While exceptions are arguably the most commonly seen of the kernel mode to user mode “callbacks” in NTDLL, they are not the only such event.

Much like exceptions that traverse into user mode, user mode APCs are all (with the exception of initial thread startup) funneled through a single dispatcher routine inside NTDLL, KiUserApcDispatcher. KiUserApcDispatcher is responsible for invoking the actual APC routine, re-testing for APC delivery (to allow for the entire APC queue to be drained at once), and returning to the return point of the alertable system call that was interrupted to deliver the user mode APC. From the perspective of the user mode APC dispatcher, the stack is logically arranged as if the APC dispatcher would “return” to the instruction immediately following the syscall instruction in the system call that began the alertable wait.

For example, if we breakpoint on KiUserApcDispatcher and examine the stack (in this case, after an alertable WaitForSingleObjectEx call that has been interrupted to deliver a user mode APC), we might see the following:

Breakpoint 0 hit
ntdll!KiUserApcDispatcher:
00000000`77691f40 488b0c24  mov     rcx,qword ptr [rsp]
0:000> k
RetAddr           Call Site
00000000`776902ba ntdll!KiUserApcDispatcher
00000000`7746d820 ntdll!NtWaitForSingleObject+0xa
00000000`01001994 kernel32!WaitForSingleObjectEx+0x9c
00000000`01001eb0 TestApp!wmain+0x64
00000000`7746cdcd TestApp!__tmainCRTStartup+0x120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

From an implementation standpoint, KiUserApcDispatcher is conceptually fairly simple. I’ve posted a translation of the assembler for those interested. Keep in mind, however, that as with the other kernel mode to user mode callbacks, the routine is actually written in assembler and utilizes constructs not expressible through C, such as a custom calling convention.

Note that the APC routine invoked by KiUserApcDispatcher corresponds roughly to a standard native APC routine, specifically a PKNORMAL_ROUTINE (except on x64, but more on that later). This is not compatible with the Win32 view of an APC routine, which takes a single parameter, as opposed to be three that a KNORMAL_ROUTINE takes. As a result, there is a kernel32 function, BaseDispatchAPC that wrappers all Win32 APCs, providing an SEH frame around the call (and activating the appropriate activation context, if necessary). BaseDispatchAPC also converts from the native APC calling convention into the Win32 APC calling convention.

For Win32 I/O completion routines, a similar wrapper routine (BasepIoCompletion) serves the purpose of converting from the standard NT APC calling convention to the Win32 I/O completion callback calling convention (which primarily includes unpackaging any I/O completion parameters from the IO_STATUS_BLOCK).

With Windows x64, the behavior of KiUserApcDispatcher changes slightly. Specifically, the APC routine invoked has four parameters instead of the standard set of three parameters for a PKNORMAL_ROUTINE. This is still compatible with standard NT APC routines due to a quirk of the x64 calling convention, whereby the first four arguments are passed by register. (This means that internally, any routines with zero through four arguments that fit within the confines of a standard pointer-sized argument slot are “compatible”, at least from a calling convention perspective.)

The fourth parameter added by KiUserApcDispatcher in the x64 case is a pointer to the context record that is to be resumed when the APC dispatching process is finished. This additional argument is used by Wow64.dll if the process is a Wow64 process, as Wow64.dll wrappers all 32-bit APC routines around a thunk routine (Wow64ApcRoutine). Wow64ApcRoutine internally uses this extra PCONTEXT argument to take control of resuming execution after the real APC routine is invoked. Thus in the 64-bit NTDLL Wow64 case, the NtContinue call following the call to the user APC routine never occurs.

Next time, we’ll take a look at the other kernel mode to user mode exception “callback”, KiRaiseUserExceptionDispatcher.

A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher

Friday, November 16th, 2007

Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each of them, in terms of both calling convention and what functionality they perform.

KiUserExceptionDispatcher is the routine responsible for calling the user mode portion of the SEH dispatcher. When an exception occurs, and it is an exception that would generate an SEH event, the kernel checks to see whether the exception occurred while running user mode code. If so, then the kernel alters the trap frame on the stack, such that when the kernel returns from the interrupt or exception, execution resumes at KiUserExceptionDispatcher instead of the instruction that raised the fault. The kernel also arranges for several parameters (a PCONTEXT and a PEXCEPTION_RECORD) that describe the state of the machine when the exception occurred to be passed to KiUserExceptionDispatcher upon the return to user mode. (This model of changing the return address for a return from kernel mode to user mode is a common idiom in the Windows kernel for several user mode event notification mechanisms.)

Once the kernel mode stack unwinds and control is transferred to KiUserExceptionDispatcher in user mode, the exception is processed locally via a call to RtlDispatchException, which is the core of the user mode exception dispatcher logic. If the exception was successfully dispatched (that is, an exception handler handled it), the final user mode context is realized with a call to RtlRestoreContext, which simply loads the registers in the given context into the processor’s architectural execution state.

Otherwise, the exception is “re-thrown” to kernel mode for last chance processing via a call to NtRaiseException. This gives the user mode debugger (if any) a final shot at handling the exception, before the kernel terminates the process. (The kernel internally provides the user mode debugger and kernel debugger a first chance shot at such exceptions before arranging for KiUserExceptionDispatcher to be run.)

I have posted a pseudo-C representation of KiUserExceptionDispatcher for the curious. Note that it uses a specialized custom calling convention, and by virtue of this, is actually an assembler function. However, a C representation is often easier to understand.

Not all user mode exceptions originate from kernel mode in this fashion; in many cases (such as with the RaiseException API), the exception dispatching process is originated entirely from user mode and KiUserExceptionDispatcher is not involved.

Incidentally, if one has been following along with some of the recent postings, the reason why the invalid parameter reporting mechanism of the Visual Studio 2005 CRT doesn’t properly break into into an already attached debugger should start to become clear now, given some additional knowledge of how the exception dispatching process works.

Because the VS2005 CRT simulates an exception by building a context and exception record, and passing these to UnhandledExceptionFilter, the normal exception dispatcher logic is not involved. This means that nobody makes a call to the NtRaiseException system service (as would normally be the case if UnhandledExceptionFilter were called as a part of normal exception dispatching), and thus there is no notification sent to the user mode debugger asking it to pre-process the simulated STATUS_INVALID_PARAMETER exception.

Update: Posted representation of KiUserExceptionDispatcher.

Next up: Taking a look at the user mode APC dispatcher (KiUserApcDispatcher).

A catalog of NTDLL kernel mode to user mode callbacks, part 1: Overview

Thursday, November 15th, 2007

As I previously mentioned, NTDLL maintains a set of special entrypoints that are used by the kernel to invoke certain functionality on the behalf of user mode.

In general, the functionality offered by these entrypoints is fairly simple, although having an understanding of how each are used provides useful insight into how certain features (such as user mode APCs) really work “under the hood”.

For the purposes of this discussion, the following are the NTDLL exported entrypoints that the kernel uses to communicate to user mode:

  1. KiUserExceptionDispatcher
  2. KiUserApcDispatcher
  3. KiRaiseUserExceptionDispatcher
  4. KiUserCallbackDispatcher
  5. LdrInitializeThunk
  6. RtlUserThreadStart
  7. EtwpNotificationThread
  8. LdrHotPatchRoutine

(There are other NTDLL exports used by the kernel, but not for direct user mode communication.)

These routines are generally used to inform user mode of a particular event occurring, though the specifics of how each routine is called vary somewhat.

KiUserExceptionDispatcher, KiUserApcDispatcher, and KiRaiseUserExceptionDispatcher are exclusively used when user mode has entered kernel mode, either implicitly, due to a processor interrupt (say, a page fault that will eventually trigger an access violation), or explicitly, due to a system call (such as NtWaitForSingleObject). The mechanism that the kernel uses to invoke these entrypoints is to alter the context that will be realized upon return from kernel mode to user mode. The return to user mode context information (a KTRAP_FRAME) is modified such that when kernel mode returns, instead of returning to the point upon which user mode invoked a kernel mode transition, control is transferred to one of the three dispatcher routines. Additional arguments are supplied to these dispatcher routines as necessary.

KiUserCallbackDispatcher is used to explicitly call out to user mode from kernel mode. This inverted mode of operation is typically discouraged in favor of models such as the pending IRP “inverted call model”. For historical design reasons, however, the Win32 subsystem (win32k.sys) uses this for a number of tasks (such as calling a user mode window procedure in response to a kernel mode window message operation). The user mode callout mechanism is not extensible to support arbitrary user mode callback destinations.

LdrInitializeThunk is the first instruction that any user mode thread runs, before the “actual” thread entrypoint. As such, it is the address at which every user mode thread system-wide begins execution.

RtlUserThreadStart is used on Windows Vista and Windows Server 2008 (and later OS’s) to form the initial entrypoint context for a thread started with NtCreateThreadEx (this is a marked departure from the approach taken by NtCreateThread, wherein the user mode caller supplies the initial thread context).

EtwpNotificationThread and LdrHotPatchRoutine both correspond to a standard thread entrypoint routine. These entrypoints are referenced by user mode threads that are created in the context of a particular process to carry out certain tasks on behalf of the kernel. As the latter two routines are generally only rarely encountered, this series does not describe them in detail.

Despite (or perhaps in spite of) the more or less completely unrealized promise of less reboots for hotfixes with Windows Server 2003, I think I’ve seen a grand total of one or two hotfixes in the entire lifetime of the OS that supported hotpatching. Knowing the effort that must have gone into developing hotpatching support, it is depressing to see security bulletin after security bulletin state “No, this hotfix does not support hotpatching. A reboot will be required.”. That is, however, a topic for another day.

Next up: Examining the operation of KiUserExceptionDispatcher in more detail.

Why do many elevation operations fail if a network share is involved?

Wednesday, November 14th, 2007

One fairly annoying aspect of elevation on Windows Vista is that one often cannot, in general, successfully elevate to perform many tasks when they involve a network share.

For example, if one tries to copy a file from a UNC share that one has provided explicit credentials to into a directory that requires administrative access (say, %systemroot%), the operation will often ultimately fail with an error message akin to the following:

[Folder Access Denied]

You don’t have permission to copy files to this location over the network. You can copy files to the Documents folder and then move them to this location.

[Cancel]

This message is somewhat confusing at first glance, considering that to get to this point, one had to elevate to an administrator in the first place!

What actually happens under the hood here is that the code that is elevated to perform the file copy loses access to the network share at the same time as it gains access to the desired destination directory on the local computer. The reason behind this is, however, slightly complicated, and to be perfectly fair, I’m not sure that I could really phase a better (concise) error message in this case myself. Nevertheless, I shall attempt to explain the underlying cause behind the failure in this case.

At a high level, this problem occurs because explicit network share credentials are stored on a per logon session basis in Windows, and by virtue of switching to the elevation token (whether that be a different user entirely, if one is logged on as a “plain user”, or simply the elevated “shadow token” if one is logged on as a non-elevated administrator), the code performing the file copying runs under a different logon session than the one in which network share credentials were supplied.

A token is associated with a logon session by means of an associated attribute called the authentication ID, which represents the LUID of the LSA logon session to which the token is attached. An LSA logon session is, confusingly enough, completely different from a Terminal Server session (though a TS session will, typically, have one or more LSA logon sessions associated with it). LSA logon sessions are generally created when a token is generated by LSA in response to an authentication attempt, such as a LogonUser call. More than one token can share a particular logon session; in the common case, the shell and most other parts of the end user Windows desktop all run under the context of a single LSA logon session.

Elevation changes this typical behavior of everything on the user’s desktop running under the same logon session, as code that has been elevated runs under a different security context. Conceptually, this process is in many ways very similar to what happens when one starts a program under alternate credentials using Run As (which, by the way, suffers a similar limitation with respect to network share access).

The result is that while the user may have provided credentials to the shell for a network share access, these credentials are attached to a specific logon session. Since the elevated code has its own logon session (with its own network share session credential “namespace”), the only credentials that will typically be available to it are the implicit NTLM or Kerberos credentials corresponding to the elevated user account. Given that most of the time, elevation is to a local computer account (an administrator), this account will typically not have any privileges on the network, unless there exists an account under the same name with the same password as the elevated account on the remote machine in question.

The error dialog already presents one workaround to this problem, which is to copy the data in question to a local drive and then elevate (with the understanding that a local computer administrator will likely have access to any location on the computer that one might choose to copy the file(s) in question to). Another workaround is to elevate a command prompt instance, use net use to provide explicit network credentials to the remote network share, and then perform the operation from the command prompt (or start a program from the command prompt to do the operation for you).

Elevation of actual programs residing on a network share is handled in a slightly better fashion, in that the consent UI will prompt for network credentials after asking for elevation credentials or elevation consent. This support is, however, not enabled for elevated file copy, or many other shell operations that have automatic elevation.

Why are certain DLLs required to be at the same base address system-wide?

Tuesday, November 13th, 2007

There are several Windows DLLs that are, for various reasons, required to be at the same base address system-wide (though for several of these, a case could be made that alternate base addresses could be used in different Terminal Server sessions). Although not explicitly documented by Microsoft (as far as I know), a number of programs rely on these “fixed base” DLLs.

That is not to say that the base addresses of these DLLs cannot change, but that while the system is running, all processes will have these DLLs mapped at the same base address (if they are indeed mapped at all).

The current set of DLLs that require the same base address system-wide includes NTDLL, kernel32, and user32, though the reasons for this requirement vary a bit between each DLL.

NTDLL must be at the same address system wide because there are a number of routines it exports that are used by the kernel to arrange for (indirect) calls to user mode. For example, ntdll!LdrInitializeThunk is the true start address of every user mode thread, and ntdll!KiUserApcDispatcher is used to invoke a user mode APC if one is ready to be processed while a thread is in a user mode wait. The kernel resolves the address of these (and other) “special” exports at system initialization time, and then uses these addresses when it needs to arrange for user mode code execution. Because the kernel caches these resolved function pointers, NTDLL cannot typically be based at a different address from process to process, as the kernel will always reference the same address for these special exports across all processes.

Additionally, some special NTDLL exports are used by user mode code in such a way that they are assumed to be at the same base address cross-process. For example, ntdll!DbgUiRemoteBreakIn is used by the debugger to break in to a process, and the debugger assumes that the local address of DbgUiRemoteBreakIn matches the remote address of DbgUiRemoteBreakIn in the target process.

Kernel32 is required to be at the same base address because there are a number of internal kernel32 routines that, similar to ntdll!DbgUiRemoteBreakIn, are used in cross-process thread injection. One example of this used to be the console control event handler In the case of console events, during kernel32.dll initialization, the address of the Ctrl-C event dispatcher is passed to WinSrv.dll (in CSRSS space).

Originally, WinSrv simply cached the dispatcher pointer after the first process was created (thus requiring kernel32 to be at the same base address across all processes in the session). On modern systems, however, WinSrv now tracks the client’s kernel32 dispatcher pointer on a per-process basis to account for the fact that the dispatcher is at a different address in the 32-bit kernel32 (versus the 64-bit kernel32). Ironically, the developer who made this change to WinSrv actually forgot to add support for using the current process’s dispatcher pointer in several corner cases (such as kernel32!SetLastConsoleActiveEvent and the corresponding winsrv!SrvConsoleNotifyLastClose and winsrv!RemoveConsole CSRSS-side routines). In the cases where WinSrv still incorrectly passes the cached (64-bit) Ctrl-C dispatcher value to CreateRemoteThread, wow64.dll has a special hack (wow64!MapContextAddress64TO32) that cleans up after WinSrv and fixes the thread start address to refer to the 32-bit kernel32.dll.

By the time this change to WinSrv and Ctrl-C processing was made, though, the application compatibility impact of removing the kernel32 base address to be the same system-wide would have been too severe to eliminate the restriction (virtually all third party code injection code now relies heavily on this assumption). Thus, for this (and other) reasons, kernel32 still remains with the restriction that it may not be relocated to a different base address cross-process.

User32 is required to be at the same address cross-process because there is an array of user32 function addresses provided to win32k.sys for the built-in window class window procedures (among other things). This function pointer array is captured via a call to NtUserInitializeClientPfnArrays at session start up time, when WinSrv is initializing win32k during CSRSS initialization. Wow64win.dll, the NtUser/win32k Wow64 support library, provides support for mapping 32-bit to 64-bit (and vice versa) for these function addresses, as necessary for the support of 32-bit processes on 64-bit platforms.

The user32 and kernel32 requirements could arguably be relaxed to only apply within a Terminal Server session, although the Windows XP (and later) cross-session debugging support muddies the waters with respect to kernel32 due to a necessity to support debugger break-in with debuggers that utilize DebugBreak for their break-in threads. (The Wow64 layer provides translation assistance for mapping DebugBreak to a 32-bit address if a 64-bit thread is created at the address of the 64-bit kernel32 DebugBreak export.)

Note that the ASLR support in Windows Vista does not run afoul of these restrictions, as Vista’s ASLR always picks the same base address for a given DLL per operating system boot. Thus, even though, say, NTDLL can have its base address randomized at boot time under Windows Vista, the particular randomized base address that was chosen is still used by all processes until the operating system is restarted.

Update: skape points out that I (somehow) neglected to mention the most important restriction on kernel32 base addressing, that being on Windows Server 2003 and earlier operating systems, internal kernel32 routines are used as the start address of new threads by CreateRemoteThread and CreateProcess.

Most Win32 applications are really multithreaded, even if they don’t know it

Monday, November 12th, 2007

Most Win32 programs out there operate with more than one thread in the process at least some of the time, even if the program code doesn’t explicitly create a thread. The reason for this is that a number of system APIs internally create threads for their purposes.

For example, console Ctrl-C/Ctrl-Break events are handled in a new thread by default. (Internally, CSRSS creates a new thread in the address space of the recipient process, which then calls a kernel32 function to invoke the active console handler list. This is one of the reasons why kernel32 is guaranteed to be at the same base address system wide, incidentally, as CSRSS caches the address of the Ctrl-C dispatcher in kernel32 and assumes that it will be valid across all client processes[1].) This can introduce potentially unexpected threading concerns depending on what a console control handler does. For example, if a console control handler writes to the console, this access will not necessarily be synchronized with any I/O the initial thread is doing with the console, unless some form of explicit synchronization is used.

This is one of the reasons why Visual Studio 2005 eschews the single threaded CRT entirely, instead favoring the multithreaded variants. At this point in the game, there is comparatively little benefit to providing a single threaded CRT that will break in strange ways (such as if a console control handler is registered), just to save a minuscule amount of locking overhead. In a program that is primarily single threaded, the critical section package is fast enough that any overhead incurred by the CRT acquiring locks is going to be minuscule compared to the actual processing overhead of whatever operation is being performed.

An even greater subset of Win32 APIs use worker threads internally, though these typically have limited visibility to the main process image. For example, WSAAsyncGetHostByName uses a worker thread to synchronously call gethostbyname and then post the results to the requester via a window message. Even debugging an application typically results in threads being created in the target process for purpose of debugger break-in processing.

Due to the general trend that most Win32 programs will encounter multiple threads in one fashion or another, other parts of the Win32 programming environment besides the CRT are gradually dropping the last vestiges of their support for purely single threaded operation as well. For instance, the low fragmentation heap (LFH) does not support HEAP_NO_SERIALIZE, which disables the usage of the heap’s built-in critical section. In this case, as with the CRT, the cases where it is always provably safe to go without locks entirely are so limited and provide such a small benefit that it’s just not worth the trouble to take out the call to acquire the internal critical section. (Entering an unowned critical section does not involve a kernel mode transition and is quite cheap. In a primarily single threaded application, any critical sections will by definition almost always be either unowned or owned by the current thread in a recursive call.)

[1] This is not strictly true in some cases with the advent of Wow64, but that is a topic for another day.

Viridian guest hypercall interface published

Friday, November 9th, 2007

Recently, Microsoft made a rather uncharacteristic move and (mostly) freely published the specifications for the Viridian hypercall interface (otherwise known as “Windows Server virtualization”). Publishing this documentation is, to be clear, a great thing for Microsoft to have done (in my mind, anyway).

The hypercall interface is in some respects analogus to the “native API” of the Windows kernel. Essentially, the hypercall interface is the mechanism by a privileged, virtualization-aware component running in a hypervisor partition can request assistance from the hypervisor for a particular task. In that respect, a hypercall is to a hypervisor as what a system call is to an operating system kernel.

It’s important to note that the documentation attempts to outline the hypercall interface from the perspective of documenting what one would need to implement a compatible hypervisor, and not from the perspective of how the Microsoft hypervisor implements said hypercall interface. However, it’s still worth a read and provides valuable insight into many aspects of how Viridian is architected at a high level.

I’m still working on digesting the whole specification (as the document is 241 pages long), but one thing that caught my eye was that there is special support in the hypervisor for debugging (in other words, kernel debugging). This support is implemented in the form of the HvPostDebugData, HvRetrieveDebugData, and HvResetDebugSession hypercalls (documented in the hypercall specification).

While I’m certainly happy to see that Microsoft is considering kernel debugging when it comes to the Viridian hypervisor, some aspects of how the Viridian hypercall interface works seem rather odd to me. After (re)reading the documentation for the debugging hypercalls a couple of times, I arrived at the conclusion that the Viridian debugging support is more oriented towards simply virtualizing and multiplexing an unreliable physical debugger link. The goal of this approach would seem to me to be that multiple partitions (operating system instances running under the hypervisor) would share the same physical connection between the physical machine hosting the hypervisor and the kernel debugger machine. Additionally, the individual partitions would be insulated from what actual physical medium the kernel debugger connection operates over (for example, 1394 or serial cable), such that only one kernel debugger transport module is needed per partition, regardless of what physical connection is used to connect the partition to the kernel debugger.

While this is a huge step forward from where shipping virtualization products are today with respect to kernel debugging (serial port debugging only), I think that this approach still falls short of completely ideal. There are a number of aspects of the debugging hypercalls that still carry much of the baggage of a physical, machine-to-machine kernel debugging interface, baggage that is arguably unnecessary and undesirable from a virtualization perspective. Besides the possibility of further improving the performance of the virtualized kernel debugger connection, it is possible to support partition-to-partition kernel debugging in a more convenient fashion than Viridian presently supports.

The debugging hypercalls, as currently defined, are in fact very much reminiscent of how I originally implemented VMKD. The hypercalls define an interface for a partition to send large chunks of kernel debugger data over as discrete units, without any guarantee of reception. Additionally, they provide a mechanism to notify the hypervisor that the partition is polling for kernel debugger data, so that the hypervisor can take action to reduce the resource consumption of the partition while it is awaiting new data (thus alleviating the CPU spin issue that one often runs into while broken into the kernel debugger with existing virtualization solutions, VMKD nonwithstanding of course).

The original approach that I took to VMKD is fairly similar to this. I essentially replaced the serial port I/O instructions in kdcom.dll with a mechanism that buffered data up until a certain point, and then transmitted (or received) data to the VMM in a discrete unit. Like the Viridian approach to debugging, this greatly reduces the number of VM exits (as compared to a traditional virtual serial port) and provides the VMM with an opportunity to reduce the CPU usage of the guest while it is awaiting new kernel debugger data.

However, I believe that it’s possible to improve upon the Viridian debugging hypercalls in much the same way as I improved upon VMKD before I arrived at the current release version. For instance, by dispensing with the provision that data posted to the debugging interface will not be reliably delivered, and enforcing several additional requirements on the debugger protocol, it is possible to further improve performance of partition kernel debugging. The suggested additional debugger protocol requirements include stipulating that data is transmitted or received by the debugging hypercalls are discrete protocol data units, and that both ends of the kernel debugger connection will be able to recover if an unexpected discrete PDU is received after a guest (or kernel debugger) reset.

These restrictions would further reduce VM exits by moving any data retransmit and recovery procedures outside of the partition being debugged. Furthermore, with the ability to reliably and transactionally transmit and receive (or fail in a transacted fashion) in as a function of the debugging hypercall itself, there is no longer a necessity for the hypervisor to ever schedule a partition that is frozen waiting for kernel debugger data until new data is available (or a transactional failure, such as a partition-defined timeout occurs). (This is, essentially, the approach that VMKD currently takes.)

In actuality, I believe that it should be possible to implement all of the above improvements by moving the Viridian debugging support out of the hypervisor and into the parent partition for a debuggee partition, with the parent partition being responsible for making hypercalls to set up a shared memory mapping for data transfer (HvMapGpaPages) and allow for event-driven communication with the debuggee partition (HvCreatePort and related APIs) that could be used to request that debugger command data in the shared memory region be processed. Above and beyond performance implications, this approach has the added advantage of more easily supporting partition-to-partition debugging (unless I’ve missed something in the documentation, the Viridian debugging hypercalls do not provide any mechanism to pass debugging data from one partition to another for processing).

Additionally, this approach would also completely eliminate the need to provide any specialized kernel debugging support at all in the hypervisor microkernel, instead moving this support into a parent (or the root) partition, leaving it to that partition to deal with the particulars of data transfer. If that partition, or another partition on the same physical computer as the debuggee partition is acting as the debugger, then data can be “transferred” using the shared memory mapping. Otherwise, the parent (or root) partition can implement whatever reliable transport mechanism it desired for the kernel debugger data (say, a TCP connection to a remote kernel debugger over an IP-based network). Thus, this proposed approach could potentially not only open up additional remote kernel debugger transport options, but also reduce code complexity of the hypervisor itself (which I would like to think is almost always a desirable thing, as non-existant code doesn’t have security holes, and the hypervisor is the absolute most trusted (software) component of the system when it is used).

Given that Viridian has some time yet before RTM, perhaps if we keep ours fingers crossed, we’ll yet see some further improvements to the Viridian kernel debugging scene.