Archive for the ‘Debugging’ Category

Fast kernel debugging for VMware, part 6: Roadmap to Future Improvements

Thursday, October 11th, 2007

Yesterday’s article described how VMKD currently communicates with DbgEng.dll in order to complete the high-speed connection between a local kernel debugger and the KD stub code running in a VM. At this point, VMKD is essentially operational, with significant improvements over conventional virtual serial port kernel debugging.

That is not to say, however, that nothing remains that could be improved in VMKD. There are a number of areas where significant steps forward could be taken with respect to either performance or end user experience, given a native (in the OS and in VMMs) implementation of the basic concepts behind VMKD. For example, despite the greatly accelerated data rate of VMKD-style kernel debugging, the 1394 kernel debugger transport still outpaces it for writing dump files. (Practically speaking, all operations except writing dump files are much faster on VMKD when compared to 1394.)

This is because the 1394 KD transport can “cheat” when it comes to physical memory reads. As the reader may or may not be aware, 1394 essentially provides an interface to directly access the target’s raw physical memory. DbgEng takes advantage of this capability, and overrides the normal functionality for reading physical memory on the target. Where all other transports send a multitude of DbgKdReadPhysicalMemoryApi packets to the target computer, requesting chunks of physical memory 4000 bytes at a time (4000 bytes is the maximum size of a KD packet across any transport), the 1394 KD client in DbgEng simply pulls the target computer’s physical memory directly “off the wire”, without needing to invoke the DbgKdReadPhysicalMemoryApi request for every 4000 bytes.

This optimization turns out to present very large performance improvements with respect to reading physical memory, as a request to write a dump file is at heart essentially just a large memcpy request, asking to copy the entire contents of physical memory of the target computer to the debugger so that the data can be written to a file. The 1394 KD client approach greatly reduces the amount of code that needs to run for every 4000 bytes of memory, especially in the VM case where every KD request and response pair involve separate VM exits and all the code that such operations involve, on top of all the processing logic guest-side when handling the DbgKdReadPhysicalMemoryApi request and sending the response data.

The same sort of optimization can of course be done in principal for virtual machine kernel debugging, but DbgEng lacks a pluggable interface to perform the highly optimized transfer of raw physical memory contents across the wire. One optimization that could be done without the assistance of DbgEng would be to locally interpret the DbgKdReadPhysicalMemoryApi request VMM-side and handle it without ever passing the request on to the guest-side code, but even this is suboptimal as it introduces a (admittedly short for a local KD process) round trip for every 4000 bytes of physical memory. If the DbgEng team stepped up to the plate and provided such an extensible interface, it would be much easier to provide the sort of speeds that one sees with writing dumps based on local KD.

Another enhancement that could be done Microsoft-side would be a better interface for replacing KD transport modules. Right now, due to the fact that ntoskrnl is static linked to KDCOM.DLL, the OS loader has a hardcoded hack that interprets the KD type in the OS loader options, loads one of the (hardcoded filenames) “kdcom.dll”, “kd1394.dll”, or “kdusb2.dll” modules, and inserts them into the loaded module list under the name “kdcom.dll”. Additionally, the KD transport module appears to be guarded by PatchGuard on Windows x64 editions (at least from the standpoint of PatchGuard 3), and on Windows Vista, Winload.exe enforces a signature check on the KD transport module. These checks are, unfortunately, not particularly conducive to allowing a third party to easily plug themselves into the KD transport path. (Unless virtualization vendors standardize on a way to signal the VMM that the guest wants attention, each virtualization platform is likely to need some slightly different code to effect a VM exit on each KdSendPacket and KdReceivePacket operation.)

Similarly, there are a number of enhancements that virtualization platform vendors could make VMM-side to make the VMKD-style approach more performant. For example, documented pluggable interfaces for communicating with the guest would be a huge step forward (although the virtualization vendor could just implement the whole KD transport replacement themselves instead of relying on a third party solution). VMware appears to be exploring this approach with VMCI, although this interface is unfortunately not supported on VMware Server or any other platforms besides VMware Workstation 6 to the best of my knowledge. Additionally, VMM authors are in the best position to provide documented and supported interfaces to allow pluggable code designed to interface with a VMM to directly access the register, physical, and virtual memory contexts of a given VM.

Virtualization vendors are also in a better position to integrate the installation and activation process for VMM plugins than a third party operating with no support or documentation. For example, the clumsy vmxinject.exe approach that VMKD takes to load its plugin code into the VMware VMM could be completely eliminated by a native architecture for installing, configuring, and loading VMM plugins (VMCI promises to take care of some of this, though not entirely to the extent that I’d hope).

I would strongly encourage Microsoft and virtualization vendors to work together on this front, as at least from the debugging experience (which is a non-trivial, popular use of virtual machines), there’s a significant potential for a better customer experience in the VM kernel debugging arena with a little cooperation here and there. VMKD is essentially a proof of concept showing that vast kernel debugging is absolutely technically possible for Windows virtual machines. Furthermore, with “inside knowledge” of either the kernel or the VMM, it would likely be trivial to implement the sort of pluggable interfaces that would have made the development and testing of VMKD a virtual walk in the park. In other words, if VMKD can be done without help from either Microsoft or VMware, it should be simple for virtualization vendors and Microsoft to implement similar functionality if they work together.

Next time: Parting shots, and thoughts on other improvements beyond simply fast kernel debugging in the virtualization space.

Fast kernel debugging for VMware, part 5: Bridging the Gap to DbgEng.dll

Wednesday, October 10th, 2007

The previous article in the virtualized kernel debugging series described some of the details behind how VMKD communicates with the outside world via some modifications to the VMware VMM.

Although getting from the operating system kernel to the outside world is certainly an important milestone with respect to improved virtual machine kernel debugging, it is still necessary to somehow connect the last set of dots between the modified VMware VMM and the debugger (DbgEng.dll). The debugger doesn’t really have native support for what VMKD attempts to do. Like the KD transport module code that runs in kernel mode on the target, DbgEng expects to be talking to a hardware interface (through a user mode accessible API) in order to send and receive data from the target.

There is some support in DbgEng that can be salvaged to communicate with the VMM-side portion of VMKD, which is the “native” support for debugging over named pipe (or TCP, the latter being apparently functional but completely undocumented, which is perhaps unsurprising as there’s no public server end for that as there is for named pipe kernel debugging). However, there’s a problem with using this part of DbgEng’s support for kernel debugging. Although it does allow us to easily talk to DbgEng without having to patch it (a definite plus, as patching two completely isolated programs from two different vendors is a recipe for an extremely brittle program), this support for named pipe or TCP transports for KD is not without its downsides.

Specifically, the named pipe and TCP transport logic is essentially a bolt-on, after-the-fact addition to the serial port (KDCOM) support in DbgEng. (This is why, among other things, kernel debugging over named pipe always starts with kd -k com:pipe,….) What this means in terms of VMKD is that DbgEng expects that anything that it would be speaking to over named pipe or TCP is going to be implementing the low-level KDCOM framing protocol, with the high level KD protocol running on top of that. The KDCOM framing protocol is unfortunately a fairly unwieldy and messy protocol (it is designed to work with minimal dependencies and without a reliable way of knowing whether the remote end of the serial port is even connected, much less receiving data).

Essentially, the KDCOM framing protocol is to TCP as the high level KD protocol is to, say, HTTP in terms of networking protocols. KDCOM takes care of all the low level goo with respect to retransmits, acknowledgments, and everything else about establishing and maintaining a reliable connection with the remote end of the kernel debugger connection. While KDCOM is nowhere near as complex as TCP (thankfully!), it is not without its share of nuances, and it is certainly not publicly documented. (There was originally some partial documentation of an ancient version of the KDCOM protocol released in the NT 3.51 DDK in terms of source code to a partially operational kernel debugger client, with additional aspects covered in the Windows 2000 DDK. There is unfortunately no mention at all of any of this in any recent DDKs or WDKs, as KDCOM has long since disappeared into “no longer documented land”, an irritating habit of many old kernel APIs.)

The fact that there is no way to directly inject high level KD protocol data into DbgEng (aside from patching non-exported internal routines in the debugger, which is certainly no desirable from a future compatibility standpoint) presents a rather troublesome problem. By virtue of taking the approach of replacing KdSendPacket and KdReceivePacket in the guest, the code that was formerly responsible for maintaining the server-end of the low-level KDCOM link is no longer in use. That is, the data coming out of the kernel is raw high-level KD protocol data and not KDCOM data, and yet DbgEng can only interpret KDCOM-framed data over TCP or named pipe.

The solution that I ended up developing to counteract this problem, while a logical consequence of this limitation, is nonetheless unwieldy. In order to communicate with DbgEng, VMKD essentially had to re-implement the entire low-level KDCOM framing protocol so that KD packets can be transferred to and received from an unmodified DbgEng using the already-existing KDCOM over named pipe support. This approach entailed a rather unfortunate amount of extra baggage that needed to be carried around as many features of the KDCOM protocol are unnecessary in light of the new environment that local virtual machine kernel debugging presents.

In the interests of both reducing the complexity of the kernel mode code running in guest operating systems and improving the performance of VMKD (with respect to any possible “overhead traffic”, such as retransmits or resynchronization related requests), the KDCOM framing protocol is implemented host-side in the logic that is running in the modified VMM. This approach also had the additional advantage (during the development process) of the KDCOM framing logic being relatively easily debuggable with a conventional debugger on the host machine. This is in stark contrast to almost all of the guest-side kernel mode driver, which by virtue of being right in the middle of the communication path with the kernel debugger itself, happens to be unfortunately immune to being debugged via the conventional kernel debugger. (A VMM-implemented OS-agnostic debugger, similar in principal to a hardware debugger (ICE) could conceivably have been used to debug the guest-side driver logic if necessary, but putting the KDCOM framing code in user mode host-side is simply much more convenient.)

The KDCOM framing code in the host-side portion of VMKD fulfills the last major piece of the project. At this point, it is now possible to get kernel debugger send and receive requests into and out of the kernel with an accelerated interface that is optimized for execution in a VM, and successfully transport this data to and from DbgEng for consumption by the kernel debugger.

One alternative that I explored to intercepting execution at KdSendPacket and KdReceivePacket guest-side was to simply hook the internal KDCOM routines for sending and receiving characters from the serial port. This approach, while saving the trouble of having to reimplement KDCOM essentially from the ground up, proved problematic and less reliable than I had initially hoped (I suspect timing and buffering differences from a real 16-byte-buffering UART were at fault for the reliability issues this approach encountered). Furthermore, such an approach was in general somewhat less performant than the KdSendPacket and KdReceivePacket solution, as all of the KDCOM “meta-traffic” with respect to resends, acknowledgments, and the like needed to traverse the guest to host boundary instead of being confined to just the host as in the model that VMKD uses currently.

Next time: Future directions that could be taken on VMKD’s general approach to improving the kernel debugging experience with respect to virtual machines.

Fast kernel debugging for VMware, part 4: Communicating with the VMware VMM

Tuesday, October 9th, 2007

Yesterday, I outlined some of the general principles behind how guest to host communication in VMs work, and why the virtual serial port isn’t really all that great of a way to talk to the outside world from a VM. Keeping this information in mind, it should be possible to do much better in a VM, but it is first necessary to develop a way to communicate with the outside world from within a VMware guest.

It turns out that, as previously mentioned, that there happen to be a lot of things already built-in to VMware that need to escape from the guest in order to notify the host of some special event. Aside from the enhanced (virtualization-aware) hardware drivers that ship with VMware Tools (the VMware virtualization-aware addon package for VMware guests), for example, there are a number of “convenience features” that utilize specialized back-channel communication interfaces to talk to code running in the VMM.

While not publicly documented by VMware, these interfaces have been reverse engineered and pseudo-documented publicly by enterprising third parties. It turns out the VMware has a generalized interface (a “fake” I/O port) that can be accessed to essentially call a predefined function running in the VMM, which performs the requested task and returns to the VM. This “fake” I/O port does not correspond to how other I/O ports work (in particular, additional registers are used). Virtually all (no pun intended) of the VMware Tools “convenience features”, from mouse pointer tracking to host to guest time synchronization use the VMware I/O port to perform their magic.

Because there is already information publicly available regarding the I/O port, and because many of the tasks performed using it are relatively easy to find host-side in terms of the code that runs, the I/O port is an attractive target for a communication mechanism. The mechanisms by which to use it guest-side have been publicly documented enough to be fairly easy to use from a code standpoint. However, there’s still the problem of what happens once the I/O port is triggered, as there isn’t exactly a built-in command that does anything like take data and magically send it to the kernel debugger.

For this, as alluded to previously, it is necessary to do a bit of poking around in the VMware VMM in order to locate a handler for an I/O port command that would be feasible to replace for purposes of shuttling data in and out of the VM to the kernel debugger. Although the VMware Tools I/O port interface is likely not designed (VMM-side, anyway) for high performance, high speed data transfers (at least compared to the mechanisms that, say, the virtualization-aware NIC driver might use), it is at the very least orders of magnitude better than the virtual serial port, certainly enough to provide serious performance improvements with respect to kernel debugging, assuming all goes according to plan.

Looking through the list of I/O port commands that have been publicly documented (if somewhat unofficially), there are one or two that could possibly be replaced without any real negative impacts on the operation of the VM itself. One of these commands (0x12) is designed to pop-up the “Operating System Not Found” dialog. This command is actually used by the VMware BIOS code in the VM if it can’t find a bootable OS, and not typically by VMware Tools itself. Since any VM that anyone would possibly be kernel debugging must by definition have a bootable operating system installed, axing the “OS Not Found” dialog is certainly no great loss for that case. As an added bonus, because this I/O port command displays UI and accesses string resources, the handler for it happened to be fairly easy to locate in the VMM code.

In terms of the VMM code, the handler for the OS Not Found dialog command looks something like so:

int __cdecl OSNotFoundHandler()
   if (!IsVMInPrivilegedMode()) /* CPL=0 check */
     log error message;
     return -1;

   load string resources;
   display message box;

   return somefunc();

Our mission here is really to just patch out the existing code with something that knows how to talk to take data from the guest and move it to the kernel debugger, and vice versa. A naive approach might be to try and access the guest’s registers and use them to convey the data words to transfer (it would appear that many of the I/O port handlers do have access to the guest’s registers as many of the I/O port commands modify the register data of the guest), but this approach would incur a large number of VM exits and therefore be suboptimal.

A better approach would be to create some sort of shared memory region in the VM and then simply use the I/O port command as a signal that data is ready to be sent or that the VM is now waiting for data to be received. (The execution of the VM, or at least the current virtual CPU, appears to be suspended while the I/O port handler is running. In the case of the kernel debugger, all but one CPU would be halted while a KdSendPacket or KdReceivePacket call is being made, making the call essentially one that blocks execution of the entire VM until it returns.)

There’s a slight problem with this approach, however. There needs to be a way to communicate the address of the shared memory region from the guest to the modified VMM code, and then the modified VMM code needs to be able to translate the address supplied by the guest to an address in the VMM’s address space host-side. While the VMware VMM most assuredly has some sort of capability to do this, finding it and using it would make the (already somewhat invasive) patches to the VMM even more likely to break across VMware versions, making such an address translation approach undesirable from the perspective of someone doing this without the help of the actual vendor.

There is, however, a more unsophisticated approach that can be taken: The code running in the guest can simply allocate non-paged physical memory, fill it with a known value, and then have the host-side code (in the VMM) simply scan the entire VMM virtual address space for the known value set by the guest in order to locate the shared memory region in host-relative virtual address space. The approach is slow and about the farthest thing from elegant, but it does work (and it only needs to be done once per boot, assuming the VMM doesn’t move pinned physical pages around in its virtual address space). Even if the VMM does occasionally move pages around, it is possible to compensate for this, and assuming such moves are infrequent still achieve acceptable performance.

The astute reader might note that this introduces a slight hole whereby a user mode caller in the VM could spoof the signature used to locate the shared memory block and trick the VMM-side code into talking to it instead of the KD logic running in kernel mode (after creating the spoofed signature, a malicious user mode process would wait for the kernel mode code to try and contact the VMM, and hope that its spoofed region would be located first). This could certainly be solved by tighter integration with the VMM (and could be quite easily eliminated by having the guest code pass an address in a register which the VMM could translate instead of doing a string hunt through virtual address space), but in the interest of maintaining broad compatibility across VMware VMMs, I have not chosen to go this route for the initial release.

As it turns out, spoofing the link with the kernel debugger is not really all that much of a problem here, as due the way VMKD is designed, it is up to the guest-side code to actually act on the data that is moved into the shared memory region, and a non-privileged user mode process would have limited ability to do so. It could certainly attempt to confuse the kernel debugger, however.

After the guest-side virtual address of the shared memory region is established, the guest and the host-side code running in the VMM can now communicate by filling the shared memory region with data. The guest can then send the I/O port command in order to tell the host-side code to send the data in the shared memory region, and/or wait for and copy in data destined from a remote kernel debugger to the code running in the guest.

With this model, the guest is entirely responsible for driving the kernel debugger connection in that the VMM code is not allowed to touch the shared memory region unless it has exclusive access (which is true if and only if the VM is currently waiting on an I/O port call to the patched handler in the VMM). However, as the low-level KD data transmission model is synchronous and not event-driven, this does not pose a problem for our purposes, thus allowing a fairly simple and yet relatively performant mechanism to connect the KD stub in the kernel to the actual kernel debugger.

Now that data can be both received from the guest and sent to the guest by means of the I/O port interface combined with the shared memory region, all that remains is to interface the debugger (DbgEng.dll) with the patched I/O port handler running in the VMM.

It turns out that there’s a couple of twists relating to this final step that make it more difficult than what one might initially expect, however. Expect to see details on that (and more) on the next installment of the VMKD series…

Fast kernel debugging for VMware, part 3: Guest to Host Communication Overview

Monday, October 8th, 2007

In the previous installment of this series, I outlined the basic architecture of the KD transport module interface from the perspective of the kernel. This article focuses upon the next step in the fast VM KD project, which is getting data out of the VM (or in and out of VMs in general, when dealing with virtualization).

At this point, it’s all but decided that the way to gain access to KD traffic kernel-side is to either intercept or replace KdSendPacket and KdReceivePacket, or some dependecy of those routines through which KD data traffic flows. However, once access to KD traffic is gained, there is still the matter of getting data out of the VM and to the host (and from there to the kernel debugger), as well as moving data from the kernel debugger to the host and then the VM. To better understand how it is possible to improve upon a virtual serial port in terms of kernel debugging in a VM, however, it is first necessary to gain a basic working understanding of how VMMs virtualize hardware devices.

While a conventional kernel debugger transport module uses a physical I/O device that is connected via cable to the kernel debugger computer, this is essentially an obsolete way of approaching the matter for a virtual machine, as the VMM running on the host has direct access to the guest’s memory. Furthermore, practically all VMMs (of which VMware is certainly no exception) typically have a “back-channel” communication mechanism by which the guest can communicate with the host and vice versa, without going through the confines of hardware interfaces that were designed and implemented for physical devices. For example, the accelerated video and network drivers that most virtualization products ship typically use such a back-channel interface to move data into and out of the guest in an efficient fashion.

The typical optimization that is done with these accelerated drivers is to buffer data in some sort of shared or pre-registered memory region until a complete “transaction” of some sort is finished, following which the guest (or host) is notified that there is data to be processed. This approach is typically taken because it is advantageous to avoid any VM exit, that is, any event that diverts control flow away from the guest code and into code living in the VMM. Such actions are necessary because the VM by design does not have truly direct access to physical hardware devices (and indeed in many cases, such as with the NICs that are virtualized by most VM software, the hardware interface presented to the guest are likely to be completely incompatible with those in use in reality on the physical host for such devices).

A VMM handles this sort of case by arranging to gaining control when a dangerous or privileged operation, such as a hardware access, is being attempted. Now, although this does allow the VMM to emulate the operation and return control to the guest after performing the requested operation, many pre-existing hardware interfaces are not typically all that optimal for direct emulation by a VMM. The reason for this, it turns out, is typically that a VM exit is a fairly expensive operation (relative to many types of direct hardware access on a physical machine, such as a port I/O attempt), and many pre-existing hardware interfaces tend to assume that talking to the hardware is a fairly high-speed operation. Although emerging virtualization technologies (such as TLB tagging) are working to reduce the cost of VM exits, reducing them is definitely a good thing for performance, especially with current-generation virtualization hardware and software. To provide an example of how VM exits can incur a non-trivial performance overhead, with a serial port, one typically needs to perform an I/O access on every character, which means that a VM exit is being performed on every single character sent or received, at least in the fashion that KDCOM.DLL uses the serial port.

Furthermore, other peculiars of hardware interfaces, such as a requirement to, say, busy-spin for a certain number of microseconds after making a particular request to ensure that the hardware has had time to process the request are also typically undesirable in a guest VM. (The fact that a VM exit happens on every character isn’t really the only performance drag in the case of a virtual serial port, either. There is also the fact that serial ports are only designed to operate at a set speed, and the VMware serial port will attempt to restrain itself to the baud rate. Together, these two aspects of how the virtual serial port function conspire to drag down the performance of virtual serial port kernel debugging to the familiar low-speeds that one might expect with kernel debugging over physical serial ports, if even that level of performance is reached.)

The strategy of buffering data until a complete “transaction” is ready to be sent or received is, as previously mentioned, one common way to improve the performance of virtual hardware device I/O; for example, a VM-optimized NIC driver might use a shared memory region to wait until complete packets are ready to be sent or received, and then use a VMM back-channel interface to arrange for the data to be processed in an event-driven fashion.

This is the general approach that would be ideal to take with the VMKD project, that is, to buffer data until a complete KD packet (or at least a sizable amount of data) is available, and then transmit (or receive) the data. Additionally, the programmed baud rate is really not something that VMKD needs to restrict itself to, the KD protocol doesn’t really have any inherent speed limits other than the fact that KDCOM.DLL is designed to talk to a real serial port, which does not have unlimited transfer speed.

In fact, if both KdSendPacket and KdReceivePacket were completely reimplemented guest-side, all the anachronisms associated with the aging serial port hardware interface could be completely discarded and the host-side transport logic for the VM could be freed to consider the most optimal mechanism to move data into and out of the VM. This is in principal much like how enhanced network card drivers provided by many virtualization vendors for their guests operate. To accomplish this, however, a workable method for communicating with the outside world from the perspective of the guest is needed.

Next time: The specifics of one approach for getting data into and out of VMware guests.

Fast kernel debugging for VMware, part 2: KD Transport Module Interface

Friday, October 5th, 2007

In the last post in this series, I outlined some of the basic ideas behind my project to speed kernel debugging on VMware. This posting expands upon some of the details of the kernel debugger API interface itself (from a kernel perspective).

The low level (transport or framing) portion of the KD interface is, as previously mentioned, abstracted from the kernel through a uniform interface. This interface consists of a standard set of routines that are exported from a particular KD transport module DLL, which are used to both communicate with the remote kernel debugger and notify the transport module of certain events (such as power transitions) so that the KD I/O hardware can be programmed appropriately. The following routines comprise this interface:

  1. KdD0Transition
  2. KdD3Transition
  3. KdDebuggerInitialize0
  4. KdDebuggerInitialize1
  5. KdReceivePacket
  6. KdRestore
  7. KdSave
  8. KdSendPacket

Most of these routines are related to power transition or KD enable/disable events (or system initialization). For the purposes of VMKD, these events are not really of a whole lot of interest as we don’t have any real hardware to program.

However, there are two APIs that are relevant: KdReceivePacket and KdSendPacket. These two routines are responsible for all of the communication between the kernel itself and the remote kernel debugger.

Although the KD transport module interface is not officially documented (or supported), it is not difficult to create prototypes for the exports that are relevant. The following are the prototypes that I have created for KdReceivePacket and KdSendPacket:

	__in ULONG PacketType,
	__inout_opt PKD_BUFFER PacketData,
	__inout_opt PKD_BUFFER SecondaryData,
	__out_opt PULONG PayloadBytes,
	__inout_opt PKD_CONTEXT KdContext

	__in ULONG PacketType,
	__in PKD_BUFFER PacketData,
	__in_opt PKD_BUFFER SecondaryData,
	__inout PKD_CONTEXT KdContext

typedef struct _KD_BUFFER
	USHORT  Length;
	USHORT  MaximumLength;
	PUCHAR  Data;

typedef struct _KD_CONTEXT
	ULONG   RetryCount;
	BOOLEAN BreakInRequested;

typedef enum _KD_RECV_CODE
	KD_RECV_CODE_OK       = 0,

At a high level glance, both of these routines operate synchronously and only return once the operation either times out or the data has been successfully transmitted or received by the remote end of the KD connection. Both routines are normally expected to be called with the kernel’s internal KD lock held and all but the current processor halted.

In general, both routines normally guarantee successful transmission or reception of data before they return. There are a couple of one-off exceptions to this rule, however:

  • KdSendPacket normally guarantees that the data has been received and acknowledged by the remote end. However, if the request being sent is a debug print notification, a symbol load notification, or a .kdfiles request, KdSendPacket contains a special exemption that allows the transmission to time out and silently fail after several attempts. This is because these requests can happen normally and don’t always warrant a debugger break in. (For example, a debug print doesn’t halt the system until you attach the kernel debugger because of this exemption.)
  • KdReceivePacket will keep trying to receive the packet until it either times out (i.e. there is no activity on the KD link for a specific amount of read attempts), successfully receives a good packet and acknowledges it, or receives a resend or resynchronize request (the latter two being specific to the KD protocol module and its internal framing protocol used “on the wire”).
  • KdReceivePacket supports a special type of request by the caller to check if there is a debugger present and requesting a break in at the instant in time when it is called (returning immediately with the result). This mode is used by the kernel on the system timer tick to periodically check if the kernel debugger is requesting to break in.

While the system is running normally, the kernel periodically invokes KdReceivePacket with the special mode that directs the KD transport to check for an immediate break in request. If no break in request is currently detected, then execution continues normally, otherwise the kernel enters into a loop where it waits for commands from the remote kernel debugger and acts upon any received requests, sending the responses back to the remote kernel debugger via KdSendPacket.

After the system is in the break-in receive loop, KdReceivePacket is called repeatedly while the rest of the system is halted and the KD lock is held. Commands that are successfully received are dispatched to the appropriate high level KD protocol request handler, and any response data is transmitted back to the kernel debugger, following which the receive loop continues so that the next request can be handled (unless the last request was one to resume execution, of course). The kernel can also enter the break-in receive loop in response to an exception, which is how tracing and breakpoint events are signalled to the remote kernel debugger.

Given all of this, it’s not difficult to see that reimplementing KdSendPacket and KdReceivePacket (or capturing the serial I/O that KDCOM does in its implementation of these routines) guest-side is the clear way to go for high speed kernel debugging for VMs. (Among other things, at the KdSendPacket and KdReceivePacket level, the contents of the high level KD protocol are essentially all but transparent to the transport module, which means that the high level KD protocol is free to continue to evolve without much chance of serious compatibility breaks with the low level KD transport modules.)

However, things are unfortunately not quite as simple as they might seem with that front. More on that in a future post.

Fast kernel debugging for VMware, part 1: Overview

Thursday, October 4th, 2007

Note: If you are just looking for the high speed VMware kernel debugging program, then you can find it here. This post series outlines the basic design principles behind VMKD.

One of the interesting talks at Blue Hat was one about virtualization and security. There’s a lot of good stuff that was touched on (such as the fact that VMware still implements a hub, meaning that VMs on the same VMnet can still sniff eachother’s traffic).

Anyways, watching the talk got me thinking again about how much kernel debugging VMs is a slow and painful experience today, especially if you’ve used 1394 debugging frequently.

While kd-over-1394 is quite fast (as is local kd), you can’t do that in any virtualization software that I’m aware of today (none of them virtualize 1394, and furthermore, as far as I know none of them even support USB2 debugging either, even VMware Workstation 6).

This means that if you’re kernel debugging a VM in today’s world, you’re pretty much left high and dry and have to use the dreaded virtual serial port approach. Because the virtual serial port has to act like a real serial port, it’s also slow, just like a real serial port (otherwise, timings get off and programs that talk to the serial port break all over the place). This means that although you might be debugging a completely local VM, you’re still throttled to serial port speeds (115200bps). Although it should certainly technically be possible to do better in a VM, none of the virtualization vendors support the virtual hardware required for the other, faster KD transports.

However, that got me thinking a bit. Windows doesn’t really need a virtual serial port or a virtual 1394 port to serve as a kernel debugging target because of any intrinsic, special property of serial or 1394 or USB2. Those interfaces are really just mechanisms to move bits from the target computer to the debugger computer and back again, while requiring minimal interaction with the rest of the system (it is important that the kernel debugger code in the target computer be as minimalistic and self-contained as possible or many situations where you just can’t debug code because it is used by the kernel debugger itself would start cropping up – this is why there isn’t a TCP transport for kernel debugging, among other things).

Now with a VM (as opposed to a physical computer), getting bits to and from an external kernel debugger and the kernel running in the VM is really quite easy. After all, the VM monitor can just directly read and write from the VM’s physical memory, just like that, without a need for indirecting through a real (or virtual) I/O interconnect interface.

So I got to be thinking that it should theoretically be possible to write a kernel debugger transport module that instead of talking to a serial, 1394, or USB2 port, talks to the local VMM and asks it to copy memory to and from the outside world (and thus the kernel debugger). After the data is safely out of the VM, it can be transported over to the kernel debugger (and back) with the mechanism of choice.

It turns out that Windows KD support is implemented in a way that is fairly conducive to this approach. The KD protocol is divided up into essentially two different parts. There’s the high level half, which is essentially a command set that allows the kernel debugger to request that the KD stub in the kernel perform an operation (like change the active register set, set a breakpoint, write memory, or soforth). The high level portion of the KD protocol sits on top of what I call the low level or (framing) portion of the KD protocol, which is a (potentially hardware dependant) transport interface that provides for reliable delivery of high level KD requests and responses between the KD program on a remote computer and the KD stub in the kernel of the target computer.

The low level KD protocol is abstracted out via a set of kernel debugger protocol modules (which are simple kernel mode DLLs) that are used by the kernel to talk to the various pieces of hardware that are supported for kernel debugging. For example, there is a module to talk to the serial port (kdcom.dll), and a module to talk to the 1394 controller (kd1394.dll).

These modules export a uniform API that essentially allows the kernel to request reliable (“mostly reliable”) transport of a high level KD request (say a notification that an exception has occured) from the kernel to the KD program, and back again.

This interface is fortunate from the perspective of someone who might want to, say, develop a high speed kernel debugger module for a VM running under a known VM monitor. Such a KD protocol module could take advantage of the fact that it knows that it’s running under a specific VM monitor, and use the VM monitor’s built-in VM exit / VM enter capabilities to quickly tell the VM monitor to copy data into and out of the VM. (Most VMs have some sort of “backdoor” interface for optimized drivers and enhanced guest capabilities, such as a way for the guest to tell the host when its mouse pointer has left the guest’s screen. For example, in the case of VMware, there is a “VMware Tools” program that you can install which provides this capability through a special “backdoor” interface to that allows the VM to request a “VM exit” for the purposes of having the VM monitor perform a specialized task.)

Next time: Examining the KD module interface, and more.

Common WinDbg problems and solutions

Monday, September 24th, 2007

When you’re debugging a program, the last thing you want to have to deal with is the debugger not working properly. It’s always frustrating to get sidetracked on secondary problems when you’re trying to focus on tracking down a bug, and especially so when problems with your debugger cause you to lose a repro or burn excessive amounts of time waiting around for the debugger to finish doing who knows what that is taking forever.
This is something that I get a fair amount of questions about from time to time, and so I’ve compiled a short list of some common issues that one can easily get tripped up by (and how to avoid or solve them).

  1. I’m using ntsd and I can’t get symbols to load, or most of the debugger extension commands (!commands) don’t work. This usually means that you launched the ntsd that ships with the operating system (prior to Windows Vista), which is much older than the one shipping with the debugger package. Because it is in the system directory, it will be in your executable search path.

    To fix this problem, use the ntsd executable in the debugger installation directory.

  2. WinDbg takes a very long time to process module load events, and it is using max processor time (spinning) on one CPU. This typically happens if you have many unqualified breakpoints that track module load events (created via bu) saved in your workspace. This problem is especially noticible when you are working with programs that have a very large number of decorated C++ symbols, such as debug builds of programs that make heavy use of the STL or other template classes. Unqualified breakpoints are expensive in general due to forcing immediate symbol loads of all modules, but moreover they also force the debugger to undecorate and perform pattern matches against every symbol in a module that is being loaded, for every unresolved breakpoint.

    If you allow a large number of unqualified breakpoints to become saved in a default workspace, this can make the debugger appear to be extremely slow no matter what program you are debugging.

    To avoid getting bitten by this problem, don’t use unqualified breakpoints (breakpoints without a modulename! prefix on their address expression) unless absolutely necessary. Also, it’s typically a good idea to clear all your breakpoints before you save your workspace if you don’t need them to be saved for your next debugging session with that debugger workspace (by default, bu breakpoints are persisted in the debugger workspace, unlike bp breakpoints which go away after every debugging session). If you are in the habit of saving the workspace every time you attach to a running process, and you often use bu breakpoints, this will tend to clutter up the user default workspace and can quickly lead to very poor debugger performance if you’re not careful.

    You can use the bc command to delete breakpoints (bc * to remove all breakpoints), although you will need to save the workspace to persist the changes. If the problem has gotten to the point where it’s not possible to even get past module loading in a reasonable amount of time so that you can use bc * to clear out saved breakpoints, you can remove the contents of the HKCU\Software\Microsoft\Windbg\Workspaces registry key and subkeys to return WinDbg to a pristine state. This will wipe out your saved debugger window positions and other saved debugger settings, so use it as a last resort.

  3. WinDbg takes a very long time to process module load events, but it is not consuming a lot of processor time. This typically means that your symbol path includes either a broken HTTP symbol store link or a broken UNC symbol store path. A non-responsive path in your symbol path will cause any operation that tries to load symbols for a module to take a long time to complete as a network timeout will be occuring over and over again.

    Use !sym noisy, followed by .reload /f to determine what part of your symbol path is not working correctly. Then, fix or remove the offending part of the symbol path.

    This problem can also occur when you are debugging a program that is in the packet path for packets destined to a location on the symbol path. In this case, the typical workaround I recommend is to set an empty symbol path, attach to the process in question, write a dump file, and then detach from the process. Then, restore the normal symbol path and open the dump file in the debugger, and issue a .reload /f command to force all symbols to be pre-cached ahead of time. After all symbols are pre-cached in the downstream store cache, change the symbol path to only reference the downstream store cache location and not any UNC or HTTP symbol server paths, and attach the debugger to the process in the packet path for symbol server access.

  4. WinDbg refuses to load symbols for a module that I know the symbol server has symbols for. This issue can occur if WinDbg has previously tried (and failed) to download symbols for a module. There appears to be a bug in dbghelp’s symbol server support which can sometimes result in partially downloaded PDB files being left in the downstream store cache. If this happens, future attempts to access symbols for the module will fail with an error saying that symbols for the module cannot be found.

    If you turn on noisy symbol loading (!sym noisy), a more descriptive error is typically given. If you see a complaint about E_PDB_CORRUPT, then you are probably falling victim to this issue. The debugger output that indicates this problem would look like something along the lines of this:

    DBGHELP: c:\symbols­\ntdll.pdb­\2744327E50A64B24A87BDDCFC7D435A02­\ntdll.pdb – E_PDB_CORRUPT

    If you encounter this problem, simply delete the .pdb named in the error message and retry loading symbols via the .reload /f <modulename> command.

  5. WinDbg hangs and never comes back when I attach to a specific process, such as an svchost instance. If you’re sure that you aren’t experiencing a problem with a broken symbol path or unqualified module load tracking breakpoints being saved in your workspace, and the debugger never comes back when attaching to a certain process (or almost always hangs after the first command when attaching to the process in question), then the process you are debugging may be in a code path responsible for symbol loading.

    This problem is especially common if you are debugging an svchost instance, as there are a lot of important but unrelated pieces of code running in the various svchost instances, some of which are critical for network symbol server support to work. If you are debugging a process in the critical path for network symbol server support, and you have a symbol path with a network component set, then you may cause the debugger to deadlock (hang forever) the first time you try and load symbols.

    One example of a situation that can cause this is if you are debugging code in the same svchost instance as the DNS cache service. In this case, when you try to load symbols and you have an HTTP symbol server link in your symbol path, the debugger will deadlock because it will try and make an RPC call to the DNS cache service when it tries to resolve the hostname of the server referenced in your symbol path. Because the DNS cache service will never respond until the debugger resumes the process, and the debugger will never resume the process until it gets a response from the RPC request to the DNS cache service, your debugging session will hang indefinitely.

    Note that if you are simply debugging something in the packet path of a symbol server store, you will typically see the debugger become unresponsive for long periods of time but not hang completely. This is because the debugger can handle network timeouts (if somewhat slowly) and will eventually fail the request to the network symbol path. However, if the debugger tries to make an IPC request of some sort to the process being debugged, and the IPC request doesn’t have any built-in timeout (most local IPC mechanisms do not), then the debugger session will be lost for good.

    This problem can be worked around similarly to how I typically recommend users deal with slow module loading or failed symbol server accesses with a program in the packet path for a symbol server referenced in the symbol path. Specifically, it is possible to pre-cache all symbols for the process by creating a dump of the process from a debugger instance with an empty symbol path, and then detaching and opening the dump with the full symbol path and forcing a download of all symbols. Then, start a debugging session on the live process with a symbol path that references only the local downstream store into which symbols were being downloaded to in order to prevent any dangerous network accesses from happening.

    Another common way to get yourself into this sort of debugger deadlock problem is to use the clipboard to paste into WinDbg while you are debugging a program that has placed something into the clipboard. This results in a similar deadlock as WinDbg may get blocked on a DDE request to the clipboard owner, which will never respond by virtue of being debugged. In that case, the workaround is simply to be careful about copying or pasting text into or out of WinDbg.

  6. Remote debugging with -remote or .server is flaky or stops working properly after awhile. This can happen if all debuggers in the session aren’t running the same debugger version.

    Make sure that all peers in the remote debugging scenario are using the (same) latest debugger version. If you mix and match debugger versions with -remote, things will often break in strange and hard to diagnose ways in my experience (there doesn’t seem to be a whole lot of graceful support for backwards or forwards compatibility with respect to the debugger remoting protocol).

    Also, several recent releases of the debugger package didn’t work at all in remote debugging mode on Windows 2000. This is, as far as I know, fixed in the latest release.

Most of these problems are simple to fix or avoid once you know what to look for (although they can certainly burn a lot of time if you’re caught unaware, having done that myself while learning about these “gotchas”).

If you’re experiencing a weird WinDbg problem, you should also not be shy about debugging the malfunctioning debugger instance itself. Often, taking a stack trace of all threads in the problematic debugger instance will be enough to give you an idea of what sort of problem is holding things up (remember that the Microsoft public symbol server has symbols for the debugger binaries as well as the OS binaries).

Useful debugger commands: .writemem and .readmem

Thursday, September 20th, 2007

From time to time, it can be useful to save a chunk of memory for whatever reason when you’re debugging a program. For instance, you might need to capture a long buffer argument to a function for later analysis (perhaps with a custom analysis tool outside the scope of the debugger).

There are a couple of options built in to the debugger to do this. For example, if you just want to save the contents of memory for later perusal, you could always write a complete minidump of the target. However, this has a few downsides; for one, unless you build in dump file processing capability into your analysis program, dump files are typically going to be less than easily accessible to simple analysis tools. (Although one could write a program utilizing MiniDumpReadDumpStream, this is more work than necessary.)

Furthermore, complete dumps tend to be large, and in the case of a kernel debugger connection over serial port, it can take many hours to save a kernel memory dump just to gain access to a comparatively small region of memory.

Instead of writing a dump file, another option is to use one of the display memory commands to save the contents of memory to a debugger log file. For instance, one might use “db address len“, write it to a log file, and parse the output. This is much less time-consuming than a kernel memory dump over kd, and in some cases it might be desirable to have the hex dump for you (that db provides) in plain text, but if one just wants the raw memory contents, that too is less than ideal.

Fortunately, there’s a third option: the .writemem command, which as the name implies, writes an arbitrary memory range to a file in raw binary form. There are two arguments, a filename and a range. For instance, one usage might be:

.writemem C:\\Users\\User\\Stack.bin @rsp L1000

This command would write 0x1000 bytes of stack to the file. (Remember that address ranges may include a space-delimited component to specify the length.)

The command works on all targets, including when one is using the kernel debugger, making it the command of choice for writing out arbitrary chunks of memory.

There also exists a command to perform the inverse operation, .readmem, which takes the same arguments, but instead reads memory from the file given and writes it to the specified address range. This can be useful for anything between substituting large arguments to a function out at run-time to applying large patches to replace non-trivial sections of code as a whole.

Furthermore, because the memory image format used by both commands is just the raw bits from the target, it becomes easy to work with the written out data with a standard hex editor, or even a disassembler. (For instance, another common use case of .writemem is to when dealing with self-modifying code, write the code out to a file after it has been finalized, and then load the resulting raw memory image up as raw opcodes in a more full-featured disassembler than the debugger.)

You can write a (complete) minidump of a process from TaskMgr in Vista

Friday, August 10th, 2007

One of the often overlooked enhancements that was made to Windows with the release of Vista is the capability to write a complete (large) minidump describing the state of a process from Task Manager. To use this functionality, switch to the Processes tab in Task Manager, access the right click (context) menu for a process for which your user account has access to, and select Create Dump File.

Although not as handy as having an “in-box” debugger (ntsd was most regrettably removed from the Vista distribution on the grounds that too many users were getting it and the (typically more up to date) DTW distribution of ntsd confused), Microsoft has thrown the developer crowd at least something of a bone with the dump file support in Task Manager. (It’s at least easier to talk a non-developer through getting a dump via Task Manager than via ntsd, or so one would suppose.)

The create dump file option writes a full minidump out to %temp%\exename.dmp. The dump is large and describes a fairly complete state of the process, so it would be a good idea to compress it before transfer. (I don’t know of any option to generate summary dumps that ships with the OS. However, to be honest, unless space is of an extreme concern, I don’t know why anyone would want to write a summary dump if they are manually gathering a dump – full state information is definitely worth a few minutes wait of transfer time on a typical cable/DSL connection.)

While I’d still prefer having a full debugger shipped with the OS (still powerful, even without symbol support), the new Task Manager support is definitely better than nothing. Though, I still object to the line of thought that it’s better to remove developer tools from the default install because people might accidentally run an older version that shipped with the OS instead of the newer version they installed. Honestly, if a person is enough of a developer to understand how to work ntsd, they had better damn well be able to know which version they are starting (which is really not that much of a feat, considering that the start up banner for ntsd prints out the version number on the first line). If someone is really having that much trouble with launching the wrong version of the debugger, in my expert opinion that is going to be the least of their problems in effectively debugging a problem.

(</silly_rant> – still slightly annoyed at losing out on the debuggers on the default Vista install [yep, I used them!])

Handy debugger commands: !uniqstack

Thursday, August 2nd, 2007

Often times while analyzing more complicated crash or hang problems, one needs to take a survey of what the overall state of a particular program is in order to get a better handle on the problem. A typical way to do this is to grab a call stack of all threads, such as via “~*kv“, or similar commands. While this works, there is often a lot of noise in the output.

Part of the problem is that in most non-trivial programs, you’ll tend to find a lot of worker threads that are for the most part “uninteresting” as to the overall state of the program and whatever issue you are trying to diagnose. For example, RPC likes to create several worker threads, several of which might all be blocked waiting for a new work item to pick up. Or a program could use the thread pooling APIs, or even have its own from-scratch thread pool.

All of these extra worker threads sitting around doing nothing are a nuisance, because they clutter up a summary of all thread stacks unnecessarily and can make it more difficult to find the threads that we’re really interested in.

One useful debugger command that can help cut through the mess in cases like this is !uniqstack. It works similar to “~*k“, except that it filters the output to only include call stacks that are not exact duplicates of eachother (in terms of return addresses on the stack). After all thread stacks are listed, the extension displays a count of duplicate call stacks and enumerates each thread that had a duplicate stack.

Since most of these duplicate call stacks are not interesting to us anyway, this can often help gain some better visibility into a program with many worker threads.