Archive for the ‘Windows’ Category

A catalog of NTDLL kernel mode to user mode callbacks, part 4: KiRaiseUserExceptionDispatcher

Tuesday, November 20th, 2007

The previous post in this series outlined how KiUserApcDispatcher operates for the purposes of enabling user mode APCs. Unlike KiUserExceptionDispatcher, which is expected to modify the return information from the context of an interrupt (or exception) in kernel mode, KiUserApcDispatcher is intended to operate on the return context of an active system call.

In this vein, the third kernel mode to user mode NTDLL “callback” that we shall be investigating, KiRaiseUserExceptionDispatcher, is fairly similar. Although in some respects akin to KiUserExceptionDispatcher, at least relating to raising an exception in user mode on the behalf of a kernel mode caller, the KiRaiseUserExceptionDispatcher “callback” really has more in common with KiUserApcDispatcher from a usage and implementation standpoint. It also so happens that the implementation of this callback routine, which is quite simple, is completely representable in C, as a standard calling convention is used.

KiRaiseUserExceptionDispatcher is used if a system call wishes to raise an exception in user mode instead of simply return an NTSTATUS, as is the standard convention. It simply constructs a standard exception record using a supplied status code (which must be written to a well-known location in the current thread’s TEB beforehand), and passes the exception to RtlRaiseException (the same routine that is used internally by the Win32 RaiseException API).

This is a fairly atypical scenario, as most errors (including bad pointer parameters) will simply result in an appropriate status code being returned by the system call, such as STATUS_ACCESS_VIOLATION.

The one place the does currently use the services of KiRaiseUserExceptionDispatcher is NtClose, the system service responsible for closing handles (which implements the Win32 CloseHandle API). When a debugger is attached to a process and a protected handle (as set by a call to SetHandleInformation with HANDLE_FLAG_PROTECT_FROM_CLOSE) is passed to NtClose, then a STATUS_HANDLE_NOT_CLOSABLE exception is raised to user mode via KiRaiseUserExceptionDispatcher. For example, one might see the following when trying to close a protected handle with a debugger attached:

(2600.1994): Unknown exception - code c0000235 (first chance)
(2600.1994): Unknown exception - code c0000235 (second chance)
ntdll!KiRaiseUserExceptionDispatcher+0x3a:
00000000`776920e7 8b8424c0000000  mov eax,dword ptr [rsp+0C0h]
0:000> !error c0000235
Error code: (NTSTATUS) 0xc0000235 (3221226037) - NtClose was
called on a handle that was protected from close via
NtSetInformationObject.
0:000> k
RetAddr           Call Site
00000000`7746dadd ntdll!KiRaiseUserExceptionDispatcher+0x3a
00000000`01001955 kernel32!CloseHandle+0x29
00000000`01001e60 TestApp!wmain+0x35
00000000`7746cdcd TestApp!__tmainCRTStartup+0x120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

The exception can be manually continued in a debugger to allow the program to operate after this occurs, though such a bad handle reference is typically indicative of a serious bug.

A similar behavior is activated if a program is being debugged under Application Verifier and an invalid handle is closed. This behavior of raising an exception on a bad handle closure attempt is intended as a debugging aid, since most programs do not bother to check the return value of CloseHandle.

Incidentally, bad kernel handle closure references like this in kernel mode will result in a bugcheck instead of simply raising an exception that can be caught. Drivers do not have the “luxury” of continuing when they touch a bad handle, except when probing a user handle, of course. (Old versions of Process Explorer used to bring down the box with a bugcheck due to this, for instance, if you tried to close a protected handle. This bug has fortunately since been fixed.)

Next time, we’ll take a look at KiUserCallbackDispatcher, which is used to a great degree by win32k.sys.

A catalog of NTDLL kernel mode to user mode callbacks, part 3: KiUserApcDispatcher

Monday, November 19th, 2007

I previously described the behavior of the kernel mode to user mode exception dispatcher (KiUserExceptionDispatcher). While exceptions are arguably the most commonly seen of the kernel mode to user mode “callbacks” in NTDLL, they are not the only such event.

Much like exceptions that traverse into user mode, user mode APCs are all (with the exception of initial thread startup) funneled through a single dispatcher routine inside NTDLL, KiUserApcDispatcher. KiUserApcDispatcher is responsible for invoking the actual APC routine, re-testing for APC delivery (to allow for the entire APC queue to be drained at once), and returning to the return point of the alertable system call that was interrupted to deliver the user mode APC. From the perspective of the user mode APC dispatcher, the stack is logically arranged as if the APC dispatcher would “return” to the instruction immediately following the syscall instruction in the system call that began the alertable wait.

For example, if we breakpoint on KiUserApcDispatcher and examine the stack (in this case, after an alertable WaitForSingleObjectEx call that has been interrupted to deliver a user mode APC), we might see the following:

Breakpoint 0 hit
ntdll!KiUserApcDispatcher:
00000000`77691f40 488b0c24  mov     rcx,qword ptr [rsp]
0:000> k
RetAddr           Call Site
00000000`776902ba ntdll!KiUserApcDispatcher
00000000`7746d820 ntdll!NtWaitForSingleObject+0xa
00000000`01001994 kernel32!WaitForSingleObjectEx+0x9c
00000000`01001eb0 TestApp!wmain+0x64
00000000`7746cdcd TestApp!__tmainCRTStartup+0x120
00000000`7768c6e1 kernel32!BaseThreadInitThunk+0xd
00000000`00000000 ntdll!RtlUserThreadStart+0x1d

From an implementation standpoint, KiUserApcDispatcher is conceptually fairly simple. I’ve posted a translation of the assembler for those interested. Keep in mind, however, that as with the other kernel mode to user mode callbacks, the routine is actually written in assembler and utilizes constructs not expressible through C, such as a custom calling convention.

Note that the APC routine invoked by KiUserApcDispatcher corresponds roughly to a standard native APC routine, specifically a PKNORMAL_ROUTINE (except on x64, but more on that later). This is not compatible with the Win32 view of an APC routine, which takes a single parameter, as opposed to be three that a KNORMAL_ROUTINE takes. As a result, there is a kernel32 function, BaseDispatchAPC that wrappers all Win32 APCs, providing an SEH frame around the call (and activating the appropriate activation context, if necessary). BaseDispatchAPC also converts from the native APC calling convention into the Win32 APC calling convention.

For Win32 I/O completion routines, a similar wrapper routine (BasepIoCompletion) serves the purpose of converting from the standard NT APC calling convention to the Win32 I/O completion callback calling convention (which primarily includes unpackaging any I/O completion parameters from the IO_STATUS_BLOCK).

With Windows x64, the behavior of KiUserApcDispatcher changes slightly. Specifically, the APC routine invoked has four parameters instead of the standard set of three parameters for a PKNORMAL_ROUTINE. This is still compatible with standard NT APC routines due to a quirk of the x64 calling convention, whereby the first four arguments are passed by register. (This means that internally, any routines with zero through four arguments that fit within the confines of a standard pointer-sized argument slot are “compatible”, at least from a calling convention perspective.)

The fourth parameter added by KiUserApcDispatcher in the x64 case is a pointer to the context record that is to be resumed when the APC dispatching process is finished. This additional argument is used by Wow64.dll if the process is a Wow64 process, as Wow64.dll wrappers all 32-bit APC routines around a thunk routine (Wow64ApcRoutine). Wow64ApcRoutine internally uses this extra PCONTEXT argument to take control of resuming execution after the real APC routine is invoked. Thus in the 64-bit NTDLL Wow64 case, the NtContinue call following the call to the user APC routine never occurs.

Next time, we’ll take a look at the other kernel mode to user mode exception “callback”, KiRaiseUserExceptionDispatcher.

A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher

Friday, November 16th, 2007

Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each of them, in terms of both calling convention and what functionality they perform.

KiUserExceptionDispatcher is the routine responsible for calling the user mode portion of the SEH dispatcher. When an exception occurs, and it is an exception that would generate an SEH event, the kernel checks to see whether the exception occurred while running user mode code. If so, then the kernel alters the trap frame on the stack, such that when the kernel returns from the interrupt or exception, execution resumes at KiUserExceptionDispatcher instead of the instruction that raised the fault. The kernel also arranges for several parameters (a PCONTEXT and a PEXCEPTION_RECORD) that describe the state of the machine when the exception occurred to be passed to KiUserExceptionDispatcher upon the return to user mode. (This model of changing the return address for a return from kernel mode to user mode is a common idiom in the Windows kernel for several user mode event notification mechanisms.)

Once the kernel mode stack unwinds and control is transferred to KiUserExceptionDispatcher in user mode, the exception is processed locally via a call to RtlDispatchException, which is the core of the user mode exception dispatcher logic. If the exception was successfully dispatched (that is, an exception handler handled it), the final user mode context is realized with a call to RtlRestoreContext, which simply loads the registers in the given context into the processor’s architectural execution state.

Otherwise, the exception is “re-thrown” to kernel mode for last chance processing via a call to NtRaiseException. This gives the user mode debugger (if any) a final shot at handling the exception, before the kernel terminates the process. (The kernel internally provides the user mode debugger and kernel debugger a first chance shot at such exceptions before arranging for KiUserExceptionDispatcher to be run.)

I have posted a pseudo-C representation of KiUserExceptionDispatcher for the curious. Note that it uses a specialized custom calling convention, and by virtue of this, is actually an assembler function. However, a C representation is often easier to understand.

Not all user mode exceptions originate from kernel mode in this fashion; in many cases (such as with the RaiseException API), the exception dispatching process is originated entirely from user mode and KiUserExceptionDispatcher is not involved.

Incidentally, if one has been following along with some of the recent postings, the reason why the invalid parameter reporting mechanism of the Visual Studio 2005 CRT doesn’t properly break into into an already attached debugger should start to become clear now, given some additional knowledge of how the exception dispatching process works.

Because the VS2005 CRT simulates an exception by building a context and exception record, and passing these to UnhandledExceptionFilter, the normal exception dispatcher logic is not involved. This means that nobody makes a call to the NtRaiseException system service (as would normally be the case if UnhandledExceptionFilter were called as a part of normal exception dispatching), and thus there is no notification sent to the user mode debugger asking it to pre-process the simulated STATUS_INVALID_PARAMETER exception.

Update: Posted representation of KiUserExceptionDispatcher.

Next up: Taking a look at the user mode APC dispatcher (KiUserApcDispatcher).

A catalog of NTDLL kernel mode to user mode callbacks, part 1: Overview

Thursday, November 15th, 2007

As I previously mentioned, NTDLL maintains a set of special entrypoints that are used by the kernel to invoke certain functionality on the behalf of user mode.

In general, the functionality offered by these entrypoints is fairly simple, although having an understanding of how each are used provides useful insight into how certain features (such as user mode APCs) really work “under the hood”.

For the purposes of this discussion, the following are the NTDLL exported entrypoints that the kernel uses to communicate to user mode:

  1. KiUserExceptionDispatcher
  2. KiUserApcDispatcher
  3. KiRaiseUserExceptionDispatcher
  4. KiUserCallbackDispatcher
  5. LdrInitializeThunk
  6. RtlUserThreadStart
  7. EtwpNotificationThread
  8. LdrHotPatchRoutine

(There are other NTDLL exports used by the kernel, but not for direct user mode communication.)

These routines are generally used to inform user mode of a particular event occurring, though the specifics of how each routine is called vary somewhat.

KiUserExceptionDispatcher, KiUserApcDispatcher, and KiRaiseUserExceptionDispatcher are exclusively used when user mode has entered kernel mode, either implicitly, due to a processor interrupt (say, a page fault that will eventually trigger an access violation), or explicitly, due to a system call (such as NtWaitForSingleObject). The mechanism that the kernel uses to invoke these entrypoints is to alter the context that will be realized upon return from kernel mode to user mode. The return to user mode context information (a KTRAP_FRAME) is modified such that when kernel mode returns, instead of returning to the point upon which user mode invoked a kernel mode transition, control is transferred to one of the three dispatcher routines. Additional arguments are supplied to these dispatcher routines as necessary.

KiUserCallbackDispatcher is used to explicitly call out to user mode from kernel mode. This inverted mode of operation is typically discouraged in favor of models such as the pending IRP “inverted call model”. For historical design reasons, however, the Win32 subsystem (win32k.sys) uses this for a number of tasks (such as calling a user mode window procedure in response to a kernel mode window message operation). The user mode callout mechanism is not extensible to support arbitrary user mode callback destinations.

LdrInitializeThunk is the first instruction that any user mode thread runs, before the “actual” thread entrypoint. As such, it is the address at which every user mode thread system-wide begins execution.

RtlUserThreadStart is used on Windows Vista and Windows Server 2008 (and later OS’s) to form the initial entrypoint context for a thread started with NtCreateThreadEx (this is a marked departure from the approach taken by NtCreateThread, wherein the user mode caller supplies the initial thread context).

EtwpNotificationThread and LdrHotPatchRoutine both correspond to a standard thread entrypoint routine. These entrypoints are referenced by user mode threads that are created in the context of a particular process to carry out certain tasks on behalf of the kernel. As the latter two routines are generally only rarely encountered, this series does not describe them in detail.

Despite (or perhaps in spite of) the more or less completely unrealized promise of less reboots for hotfixes with Windows Server 2003, I think I’ve seen a grand total of one or two hotfixes in the entire lifetime of the OS that supported hotpatching. Knowing the effort that must have gone into developing hotpatching support, it is depressing to see security bulletin after security bulletin state “No, this hotfix does not support hotpatching. A reboot will be required.”. That is, however, a topic for another day.

Next up: Examining the operation of KiUserExceptionDispatcher in more detail.

Why do many elevation operations fail if a network share is involved?

Wednesday, November 14th, 2007

One fairly annoying aspect of elevation on Windows Vista is that one often cannot, in general, successfully elevate to perform many tasks when they involve a network share.

For example, if one tries to copy a file from a UNC share that one has provided explicit credentials to into a directory that requires administrative access (say, %systemroot%), the operation will often ultimately fail with an error message akin to the following:

[Folder Access Denied]

You don’t have permission to copy files to this location over the network. You can copy files to the Documents folder and then move them to this location.

[Cancel]

This message is somewhat confusing at first glance, considering that to get to this point, one had to elevate to an administrator in the first place!

What actually happens under the hood here is that the code that is elevated to perform the file copy loses access to the network share at the same time as it gains access to the desired destination directory on the local computer. The reason behind this is, however, slightly complicated, and to be perfectly fair, I’m not sure that I could really phase a better (concise) error message in this case myself. Nevertheless, I shall attempt to explain the underlying cause behind the failure in this case.

At a high level, this problem occurs because explicit network share credentials are stored on a per logon session basis in Windows, and by virtue of switching to the elevation token (whether that be a different user entirely, if one is logged on as a “plain user”, or simply the elevated “shadow token” if one is logged on as a non-elevated administrator), the code performing the file copying runs under a different logon session than the one in which network share credentials were supplied.

A token is associated with a logon session by means of an associated attribute called the authentication ID, which represents the LUID of the LSA logon session to which the token is attached. An LSA logon session is, confusingly enough, completely different from a Terminal Server session (though a TS session will, typically, have one or more LSA logon sessions associated with it). LSA logon sessions are generally created when a token is generated by LSA in response to an authentication attempt, such as a LogonUser call. More than one token can share a particular logon session; in the common case, the shell and most other parts of the end user Windows desktop all run under the context of a single LSA logon session.

Elevation changes this typical behavior of everything on the user’s desktop running under the same logon session, as code that has been elevated runs under a different security context. Conceptually, this process is in many ways very similar to what happens when one starts a program under alternate credentials using Run As (which, by the way, suffers a similar limitation with respect to network share access).

The result is that while the user may have provided credentials to the shell for a network share access, these credentials are attached to a specific logon session. Since the elevated code has its own logon session (with its own network share session credential “namespace”), the only credentials that will typically be available to it are the implicit NTLM or Kerberos credentials corresponding to the elevated user account. Given that most of the time, elevation is to a local computer account (an administrator), this account will typically not have any privileges on the network, unless there exists an account under the same name with the same password as the elevated account on the remote machine in question.

The error dialog already presents one workaround to this problem, which is to copy the data in question to a local drive and then elevate (with the understanding that a local computer administrator will likely have access to any location on the computer that one might choose to copy the file(s) in question to). Another workaround is to elevate a command prompt instance, use net use to provide explicit network credentials to the remote network share, and then perform the operation from the command prompt (or start a program from the command prompt to do the operation for you).

Elevation of actual programs residing on a network share is handled in a slightly better fashion, in that the consent UI will prompt for network credentials after asking for elevation credentials or elevation consent. This support is, however, not enabled for elevated file copy, or many other shell operations that have automatic elevation.

Why are certain DLLs required to be at the same base address system-wide?

Tuesday, November 13th, 2007

There are several Windows DLLs that are, for various reasons, required to be at the same base address system-wide (though for several of these, a case could be made that alternate base addresses could be used in different Terminal Server sessions). Although not explicitly documented by Microsoft (as far as I know), a number of programs rely on these “fixed base” DLLs.

That is not to say that the base addresses of these DLLs cannot change, but that while the system is running, all processes will have these DLLs mapped at the same base address (if they are indeed mapped at all).

The current set of DLLs that require the same base address system-wide includes NTDLL, kernel32, and user32, though the reasons for this requirement vary a bit between each DLL.

NTDLL must be at the same address system wide because there are a number of routines it exports that are used by the kernel to arrange for (indirect) calls to user mode. For example, ntdll!LdrInitializeThunk is the true start address of every user mode thread, and ntdll!KiUserApcDispatcher is used to invoke a user mode APC if one is ready to be processed while a thread is in a user mode wait. The kernel resolves the address of these (and other) “special” exports at system initialization time, and then uses these addresses when it needs to arrange for user mode code execution. Because the kernel caches these resolved function pointers, NTDLL cannot typically be based at a different address from process to process, as the kernel will always reference the same address for these special exports across all processes.

Additionally, some special NTDLL exports are used by user mode code in such a way that they are assumed to be at the same base address cross-process. For example, ntdll!DbgUiRemoteBreakIn is used by the debugger to break in to a process, and the debugger assumes that the local address of DbgUiRemoteBreakIn matches the remote address of DbgUiRemoteBreakIn in the target process.

Kernel32 is required to be at the same base address because there are a number of internal kernel32 routines that, similar to ntdll!DbgUiRemoteBreakIn, are used in cross-process thread injection. One example of this used to be the console control event handler In the case of console events, during kernel32.dll initialization, the address of the Ctrl-C event dispatcher is passed to WinSrv.dll (in CSRSS space).

Originally, WinSrv simply cached the dispatcher pointer after the first process was created (thus requiring kernel32 to be at the same base address across all processes in the session). On modern systems, however, WinSrv now tracks the client’s kernel32 dispatcher pointer on a per-process basis to account for the fact that the dispatcher is at a different address in the 32-bit kernel32 (versus the 64-bit kernel32). Ironically, the developer who made this change to WinSrv actually forgot to add support for using the current process’s dispatcher pointer in several corner cases (such as kernel32!SetLastConsoleActiveEvent and the corresponding winsrv!SrvConsoleNotifyLastClose and winsrv!RemoveConsole CSRSS-side routines). In the cases where WinSrv still incorrectly passes the cached (64-bit) Ctrl-C dispatcher value to CreateRemoteThread, wow64.dll has a special hack (wow64!MapContextAddress64TO32) that cleans up after WinSrv and fixes the thread start address to refer to the 32-bit kernel32.dll.

By the time this change to WinSrv and Ctrl-C processing was made, though, the application compatibility impact of removing the kernel32 base address to be the same system-wide would have been too severe to eliminate the restriction (virtually all third party code injection code now relies heavily on this assumption). Thus, for this (and other) reasons, kernel32 still remains with the restriction that it may not be relocated to a different base address cross-process.

User32 is required to be at the same address cross-process because there is an array of user32 function addresses provided to win32k.sys for the built-in window class window procedures (among other things). This function pointer array is captured via a call to NtUserInitializeClientPfnArrays at session start up time, when WinSrv is initializing win32k during CSRSS initialization. Wow64win.dll, the NtUser/win32k Wow64 support library, provides support for mapping 32-bit to 64-bit (and vice versa) for these function addresses, as necessary for the support of 32-bit processes on 64-bit platforms.

The user32 and kernel32 requirements could arguably be relaxed to only apply within a Terminal Server session, although the Windows XP (and later) cross-session debugging support muddies the waters with respect to kernel32 due to a necessity to support debugger break-in with debuggers that utilize DebugBreak for their break-in threads. (The Wow64 layer provides translation assistance for mapping DebugBreak to a 32-bit address if a 64-bit thread is created at the address of the 64-bit kernel32 DebugBreak export.)

Note that the ASLR support in Windows Vista does not run afoul of these restrictions, as Vista’s ASLR always picks the same base address for a given DLL per operating system boot. Thus, even though, say, NTDLL can have its base address randomized at boot time under Windows Vista, the particular randomized base address that was chosen is still used by all processes until the operating system is restarted.

Update: skape points out that I (somehow) neglected to mention the most important restriction on kernel32 base addressing, that being on Windows Server 2003 and earlier operating systems, internal kernel32 routines are used as the start address of new threads by CreateRemoteThread and CreateProcess.

Most Win32 applications are really multithreaded, even if they don’t know it

Monday, November 12th, 2007

Most Win32 programs out there operate with more than one thread in the process at least some of the time, even if the program code doesn’t explicitly create a thread. The reason for this is that a number of system APIs internally create threads for their purposes.

For example, console Ctrl-C/Ctrl-Break events are handled in a new thread by default. (Internally, CSRSS creates a new thread in the address space of the recipient process, which then calls a kernel32 function to invoke the active console handler list. This is one of the reasons why kernel32 is guaranteed to be at the same base address system wide, incidentally, as CSRSS caches the address of the Ctrl-C dispatcher in kernel32 and assumes that it will be valid across all client processes[1].) This can introduce potentially unexpected threading concerns depending on what a console control handler does. For example, if a console control handler writes to the console, this access will not necessarily be synchronized with any I/O the initial thread is doing with the console, unless some form of explicit synchronization is used.

This is one of the reasons why Visual Studio 2005 eschews the single threaded CRT entirely, instead favoring the multithreaded variants. At this point in the game, there is comparatively little benefit to providing a single threaded CRT that will break in strange ways (such as if a console control handler is registered), just to save a minuscule amount of locking overhead. In a program that is primarily single threaded, the critical section package is fast enough that any overhead incurred by the CRT acquiring locks is going to be minuscule compared to the actual processing overhead of whatever operation is being performed.

An even greater subset of Win32 APIs use worker threads internally, though these typically have limited visibility to the main process image. For example, WSAAsyncGetHostByName uses a worker thread to synchronously call gethostbyname and then post the results to the requester via a window message. Even debugging an application typically results in threads being created in the target process for purpose of debugger break-in processing.

Due to the general trend that most Win32 programs will encounter multiple threads in one fashion or another, other parts of the Win32 programming environment besides the CRT are gradually dropping the last vestiges of their support for purely single threaded operation as well. For instance, the low fragmentation heap (LFH) does not support HEAP_NO_SERIALIZE, which disables the usage of the heap’s built-in critical section. In this case, as with the CRT, the cases where it is always provably safe to go without locks entirely are so limited and provide such a small benefit that it’s just not worth the trouble to take out the call to acquire the internal critical section. (Entering an unowned critical section does not involve a kernel mode transition and is quite cheap. In a primarily single threaded application, any critical sections will by definition almost always be either unowned or owned by the current thread in a recursive call.)

[1] This is not strictly true in some cases with the advent of Wow64, but that is a topic for another day.

Viridian guest hypercall interface published

Friday, November 9th, 2007

Recently, Microsoft made a rather uncharacteristic move and (mostly) freely published the specifications for the Viridian hypercall interface (otherwise known as “Windows Server virtualization”). Publishing this documentation is, to be clear, a great thing for Microsoft to have done (in my mind, anyway).

The hypercall interface is in some respects analogus to the “native API” of the Windows kernel. Essentially, the hypercall interface is the mechanism by a privileged, virtualization-aware component running in a hypervisor partition can request assistance from the hypervisor for a particular task. In that respect, a hypercall is to a hypervisor as what a system call is to an operating system kernel.

It’s important to note that the documentation attempts to outline the hypercall interface from the perspective of documenting what one would need to implement a compatible hypervisor, and not from the perspective of how the Microsoft hypervisor implements said hypercall interface. However, it’s still worth a read and provides valuable insight into many aspects of how Viridian is architected at a high level.

I’m still working on digesting the whole specification (as the document is 241 pages long), but one thing that caught my eye was that there is special support in the hypervisor for debugging (in other words, kernel debugging). This support is implemented in the form of the HvPostDebugData, HvRetrieveDebugData, and HvResetDebugSession hypercalls (documented in the hypercall specification).

While I’m certainly happy to see that Microsoft is considering kernel debugging when it comes to the Viridian hypervisor, some aspects of how the Viridian hypercall interface works seem rather odd to me. After (re)reading the documentation for the debugging hypercalls a couple of times, I arrived at the conclusion that the Viridian debugging support is more oriented towards simply virtualizing and multiplexing an unreliable physical debugger link. The goal of this approach would seem to me to be that multiple partitions (operating system instances running under the hypervisor) would share the same physical connection between the physical machine hosting the hypervisor and the kernel debugger machine. Additionally, the individual partitions would be insulated from what actual physical medium the kernel debugger connection operates over (for example, 1394 or serial cable), such that only one kernel debugger transport module is needed per partition, regardless of what physical connection is used to connect the partition to the kernel debugger.

While this is a huge step forward from where shipping virtualization products are today with respect to kernel debugging (serial port debugging only), I think that this approach still falls short of completely ideal. There are a number of aspects of the debugging hypercalls that still carry much of the baggage of a physical, machine-to-machine kernel debugging interface, baggage that is arguably unnecessary and undesirable from a virtualization perspective. Besides the possibility of further improving the performance of the virtualized kernel debugger connection, it is possible to support partition-to-partition kernel debugging in a more convenient fashion than Viridian presently supports.

The debugging hypercalls, as currently defined, are in fact very much reminiscent of how I originally implemented VMKD. The hypercalls define an interface for a partition to send large chunks of kernel debugger data over as discrete units, without any guarantee of reception. Additionally, they provide a mechanism to notify the hypervisor that the partition is polling for kernel debugger data, so that the hypervisor can take action to reduce the resource consumption of the partition while it is awaiting new data (thus alleviating the CPU spin issue that one often runs into while broken into the kernel debugger with existing virtualization solutions, VMKD nonwithstanding of course).

The original approach that I took to VMKD is fairly similar to this. I essentially replaced the serial port I/O instructions in kdcom.dll with a mechanism that buffered data up until a certain point, and then transmitted (or received) data to the VMM in a discrete unit. Like the Viridian approach to debugging, this greatly reduces the number of VM exits (as compared to a traditional virtual serial port) and provides the VMM with an opportunity to reduce the CPU usage of the guest while it is awaiting new kernel debugger data.

However, I believe that it’s possible to improve upon the Viridian debugging hypercalls in much the same way as I improved upon VMKD before I arrived at the current release version. For instance, by dispensing with the provision that data posted to the debugging interface will not be reliably delivered, and enforcing several additional requirements on the debugger protocol, it is possible to further improve performance of partition kernel debugging. The suggested additional debugger protocol requirements include stipulating that data is transmitted or received by the debugging hypercalls are discrete protocol data units, and that both ends of the kernel debugger connection will be able to recover if an unexpected discrete PDU is received after a guest (or kernel debugger) reset.

These restrictions would further reduce VM exits by moving any data retransmit and recovery procedures outside of the partition being debugged. Furthermore, with the ability to reliably and transactionally transmit and receive (or fail in a transacted fashion) in as a function of the debugging hypercall itself, there is no longer a necessity for the hypervisor to ever schedule a partition that is frozen waiting for kernel debugger data until new data is available (or a transactional failure, such as a partition-defined timeout occurs). (This is, essentially, the approach that VMKD currently takes.)

In actuality, I believe that it should be possible to implement all of the above improvements by moving the Viridian debugging support out of the hypervisor and into the parent partition for a debuggee partition, with the parent partition being responsible for making hypercalls to set up a shared memory mapping for data transfer (HvMapGpaPages) and allow for event-driven communication with the debuggee partition (HvCreatePort and related APIs) that could be used to request that debugger command data in the shared memory region be processed. Above and beyond performance implications, this approach has the added advantage of more easily supporting partition-to-partition debugging (unless I’ve missed something in the documentation, the Viridian debugging hypercalls do not provide any mechanism to pass debugging data from one partition to another for processing).

Additionally, this approach would also completely eliminate the need to provide any specialized kernel debugging support at all in the hypervisor microkernel, instead moving this support into a parent (or the root) partition, leaving it to that partition to deal with the particulars of data transfer. If that partition, or another partition on the same physical computer as the debuggee partition is acting as the debugger, then data can be “transferred” using the shared memory mapping. Otherwise, the parent (or root) partition can implement whatever reliable transport mechanism it desired for the kernel debugger data (say, a TCP connection to a remote kernel debugger over an IP-based network). Thus, this proposed approach could potentially not only open up additional remote kernel debugger transport options, but also reduce code complexity of the hypervisor itself (which I would like to think is almost always a desirable thing, as non-existant code doesn’t have security holes, and the hypervisor is the absolute most trusted (software) component of the system when it is used).

Given that Viridian has some time yet before RTM, perhaps if we keep ours fingers crossed, we’ll yet see some further improvements to the Viridian kernel debugging scene.

The default invalid parameter behavior for the VC8 CRT doesn’t break into the debugger

Thursday, November 8th, 2007

One of the problems that confuses people from time to time here at work is that if you happen to hit a condition that trips the “invalid parameter” handler for VC8, and you’ve got a debugger attached to the process that fails, then the process mysteriously exits without giving the debugger a chance to inspect the condition of the program in question.

For those unfamiliar with the concept, the “invalid parameter” handler is a new addition to the Microsoft CRT, which kills the process if various invalid states are encountered. For example, dereferencing a bogus iterator in a release build might trip the invalid parameter handler if you’re lucky (if not, you might see random memory corruption, of course).

The reason why there is no debugger interaction here is that the default CRT invalid parameter handler (present in invarg.c if you’ve got the CRT source code handy) invokes UnhandledExceptionFilter in an attempt to (presumably) give the debugger a crack at the exception. Unfortunately, in reality, UnhandledExceptionFilter will just return immediately if a debugger is attached to the process, assuming that this will cause the standard SEH dispatcher logic to pass the event to the debugger. Because the default invalid parameter handler doesn’t really go through the SEH dispatcher but is in fact simply a direct call to UnhandledExceptionFilter, this results in no notification to the debugger whatsoever.

This counter-intuitive behavior can be more than a little bit confusing when you’re trying to debug a problem, since from the debugger, all you might see in a case like a bad iterator dereference would be this:

0:000:x86> g
ntdll!NtTerminateProcess+0xa:
00000000`7759053a c3              ret

If we pull up a stack trace, then things become a bit more informative:

0:000:x86> k
RetAddr           
ntdll32!ZwTerminateProcess+0x12
kernel32!TerminateProcess+0x20
MSVCR80!_invoke_watson+0xe6
MSVCR80!_invalid_parameter_noinfo+0xc
TestApp!wmain+0x10
TestApp!__tmainCRTStartup+0x10f
kernel32!BaseThreadInitThunk+0xe
ntdll32!_RtlUserThreadStart+0x23

However, while we can get a stack trace for the thread that tripped the invalid parameter event in cases like this with a simple single threaded program, adding multiple threads will throw a wrench into the debuggability of this scenario. For example, with the following simple test program, we might see the following when running the process under the debugger after we continue the initial process breakpoint (this example is being run as a 32-bit program under Vista x64, though the same principle should apply elsewhere):

0:000:x86> g
ntdll!RtlUserThreadStart:
sub     rsp,48h
0:000> k
Call Site
ntdll!RtlUserThreadStart

What happened? Well, the last thread in the process here happened to be the newly created thread instead of the thread that called TerminateProcess. To make matters worse, the other thread (which was the one that caused the actual problem) is already gone, killed by TerminateProcess, and its stack has been blown away. This means that we can’t just figure out what’s happened by asking for a stack trace of all threads in the process:

0:000> ~*k

.  0  Id: 1888.1314 Suspend: -1 Unfrozen
Call Site
ntdll!RtlUserThreadStart

Unfortunately, this scenario is fairly common in practice, as most non-trivial programs use multiple threads for one reason or another. If nothing else, many OS-provided APIs internally create or make use of worker threads.

There is a way to make out useful information in a scenario like this, but it is unfortunately not easy to do after the fact, which means that you’ll need to have a debugger attached and at your disposal before the failure happens. The simplest way to catch the culprit red-handed here is to just breakpoint on ntdll!NtTerminateProcess. (A conditional breakpoint could be employed to check for NtCurrentProcess ((HANDLE)-1) in the first parameter if the process frequently calls TerminateProcess, but this is typically not the case and often it is sufficient to simply set a blind breakpoint on the routine.)

For example, in the case of the provided test program, we get much more useful results with the breakpoint in place:

0:000:x86> bp ntdll32!NtTerminateProcess
0:000:x86> g
Breakpoint 0 hit
ntdll32!ZwTerminateProcess:
mov     eax,29h
0:000:x86> k
RetAddr           
ntdll32!ZwTerminateProcess
kernel32!TerminateProcess+0x20
MSVCR80!_invoke_watson+0xe6
MSVCR80!_invalid_parameter_noinfo+0xc
TestApp!wmain+0x2e
TestApp!__tmainCRTStartup+0x10f
kernel32!BaseThreadInitThunk+0xe
ntdll32!_RtlUserThreadStart+0x23

That’s much more diagnosable than a stack trace for the completely wrong thread.

Note that from an error reporting perspective, it is possible to catch these errors by registering an invalid parameter handler (via _set_invalid_parameter_handler), which is rougly analogus to the mechanism one uses to register a custom handler for pure virtual function call failures.

Linux features that I’d like to see in Windows: iptables

Wednesday, November 7th, 2007

One of the things that I really miss about Linux-based boxen when I’m working with Windows from time to time is the fact that the built-in Windows firewall capabilities are just downright anemic when compared to the power and flexibility of iptables.

Sure, there’s Windows Firewall, RRAS, Advanced TCP/IP Filtering (which is anything but advanced), and IPSec Policies that come with Windows and allow you to firewall things off. Unfortunately, while Windows Firewall and RRAS (with respect to “Basic Firewall” in Windows Server 2003) do a passable job of an inbound host firewall, there is really just nothing that comes with Windows that is reasonably good at managing a complicated network (e.g. multi-machine) firewall.

RRAS has built-in static packet filtering, but it’s downright ridiculously limited given the fact that it’s something that it is ostensibly oriented towards network administrators (who should, theoretically, know what they’re doing). You essentially have the option of creating either an allow list with a default deny, or a deny list with a default allow, and that’s it. (There’s also not really any support for stateful packet filtering in this mode of RRAS, as is available from Basic Firewall, although you can at least differentiate between established and non-established TCP packets. Barely.)

IPSec Policies are slightly less limited than RRAS static packet filtering, but they’re still nowhere near expressive enough for any sort of non-trivial network firewall configuration. You can at least mix and match allow and deny rules, but the ordering is only based on netmask and, as far as I know, is otherwise not user controllable.

Iptables and ip_conntrack, on the other hand, are highly expressive and allow one to comparatively easily create rules that are either downright impossible or extremely difficult to do (e.g. requiring convoluted use of both RRAS static packet filtering and IPSec Policies) in any managable fashion with the built-in Windows firewall tools. As and added bonus, they also have highly flexible NAT capabilities built-in that easily integrate and cooperate with firewall rules.

Now, it’s not really a matter of there being anything that is technically wrong or deficient with the Windows networking stack that would prevent there being a reasonably high quality firewall, but more that just nobody has gone out and done it and shipped it with the platform. (No, I don’t count the “personal firewall” type things that ship with XYZ AV/”Home Security” product as anywhere in this category. I don’t trust those far enough to not create security holes, much less act competently as a firewall.)

There are a number of various third party firewall packages out there, but I tend to be fairly suspicious of installing third party code on my boxes in general, much less third party kernel level code that is facing the network outside of any firewall or packet filtering. Most of them don’t seem to have anywhere near the sort of capabilities that iptables provides, anyway.