Archive for December, 2006

Think before you optimize

Friday, December 29th, 2006

“Premature optimization is the root of all evil” is a famous quote in computer science, and it absolutely holds true. Before optimizing a problem, you must make sure that you are optimizing the bottleneck, and that your optimization doesn’t actually make things worse.

These rules may seem obvious, but not everyone adheres to them; you’d be surprised how many newsgroup postings I see where people are asking how to solve the wrong problem because they didn’t take the time to profile their program and locate their real bottleneck.

One example of this kind of premature (or perhaps just not all the way thought-through) optimization that bothers me on a daily basis is in the Microsoft Terminal Server client (mstsc.exe). Terminal Server is a remote windowing protocol, and as such it is designed to take great pains to improve responsiveness to users. In most cases, improving responsiveness over the network involves minimizing the amount of data sent from the server to the client. In this spirit, the designers of Terminal Server implemented an innocent-seeming optimization, wherein the Terminal Server client detects when it has been minimized. If this occurs, the Terminal Server client sends a special message to the server asking that it stop sending window updates to the client. When the user restores the Terminal Server client window, the server will resync with the client.

This may seem like a clever little optimization with no downsides at first, but it turns out that it actually worsens the user experience (at least in my opinion) when you look at things a little bit closer. First, there’s how Terminal Server resynchronizes with the client when the client requests that it again wants to receive windowing data. Windows follows the model of not saving data that can be recalculated on demand in its user interface design. In many ways, this is a perfectly valid model, and there are a number of valid reasons for it (especially given that as you open more windows, it starts to become non-trivially-expensive to cache bitmap data for every window on the screen – even more especially on the very low end systems that 16-bit Windows has to work on). As a result, when Windows wants to retrieve the contents of a window on screen, the typical course of action is that a WM_PAINT message is sent to the window. This message asks the window to draw itself into a device context, or a storage area where the bits can then be transferred to the screen, a printer, or any other visual display device.

If you’ve been paying attention, then you might be seeing where this starts to go wrong with Terminal Server. When you restore a minimized Terminal Server client window, the client asks the server to resynchronize. This is necessary because the server has since stopped sending updates to the client, which means that the client has to assume that its display data is now stale. In order to do this resynchronization, Terminal Server has to figure out what contents have changed on the overall desktop bitmap that describes the entire screen. Terminal Server is free to cache the entire contents of a session’s interactive desktop as a whole (and indeed this is necessary so that during resynchronization, the entire desktop doesn’t have to be transferred as a bitmap to the client). However, it still needs to compare the last copy of the bitmap that was sent to a client with the “current” view of the desktop. In order to do that, Terminal Server essentially does something along the lines of asking each visible window on the desktop to paint itself. Then, Terminal Server can update the client with new display data for each window.

The problem here is that many programs don’t repaint themselves in a very graceful fashion. Many programs have unpleasant tendencies like triggering multiple draw operations over the same region before the end result is achieved, something that manifests itself as a very slightly annoying flicker when a window repaints. Even Microsoft programs exhibit this problem; for instance, Visual Studio 2005 tends to do this, as does Internet Explorer when drawing pages with foreground images overlayed with background images.

Now, while this may be a minor annoyance when working locally, it turns out to be a big problem when a program is running over Terminal Server. What would otherwise be an innocuous flicker over the course of a couple of milliseconds on a “glass terminal” display turns into multiple draw commands being sent and realized over the network to the Terminal Server client. This translates to bandwidth waste as redundant draw commands are transmitted, and even worse, a lack of responsiveness when restoring a minimized Terminal Server client window (due to having to wait on the programs on Terminal Server to finish updating themselves in the resynchronization process). If you have several programs running on the Terminal Server, this can correspond to three or four seconds of waiting before the Terminal Server session is responsive to input from the client.

While this is annoying in and of itself, it may still not seem all that bad. After all, this problem only happens if you minimize and restore a window, and you generally don’t just minimize and restore windows all the time, right? It turns out that with the Terminal Server client, most people do just that, if they are working in fullscreen mode. Remember that fullscreen Terminal Server obscures the task bar on the physical client computer, and in many cases, results in task switching keystrokes such as Alt+Tab or the Windows key being sent to the remote Terminal Server session and not the physical client system. In order to switch to another program on the physical client computer, then, one needs to either minimize the Terminal Server window or (perhaps temporarily) take it out of fullscreen mode. At least for me, if I want to switch to a program on the physical client system, the logical choice is to hit the minimize button on the Terminal Server client info bar at the top of the fullscreen Terminal Server client window. Unfortunately, that little minimize button invokes the clever redraw optimization that stops the server from updating the client. This means that when I switch back to the Terminal Server session, I need to wait several seconds while programs running in the Terminal Server session finish redrawing themselves and transmitting the draw operations to the client (which is especially painful if you are dealing with bitmaps, such as Internet Explorer on a page with foreground images overlaying a background image).

As a result, thanks to somebody’s “clever optimization”, my Terminal Server sessions now take several seconds to come back when I switch away from them to work on something locally (perhaps to copy and paste some text from the remote system to my computer) and then switch back.

Now, Terminal Server is a great example of a highly optimized program on the whole (and it’s absolutely usable, don’t get me wrong about that). It beats the pants off of VNC and any of the other remote windowing systems that I have ever used any day of the week, for one. However, this just goes to show that even with the best of intentions, one little optimization can blow up in unintended (negative) ways if you are not careful.

Oh, and if you run into this little annoyance as frequently as I do, there is one thing that you may be able to do that alleviates it (at least looking to the future, anyway). When using the Windows Vista (or later) Terminal Server client to connect to a Windows Vista or Windows Server “Longhorn” Terminal Server (or Remote Desktop), you can prevent this lack of responsiveness when restoring minimized Terminal Server windows by enabling desktop composition on the Terminal Server connection. This may seem a bit counter-intuitive at first (enabling 3D transparent windows would sure make you think that a lot more data would need to be transferred, thus slowing down the experience as a whole), but if you are on a high-bandwidth, low-latency link to the target computer, it turns out that desktop composition improves responsiveness when restoring minimized Terminal Server windows. This is because with desktop composition enabled, Windows breaks from the traditional model of not saving data that you can recalculate. Instead, with desktop composition enabled, Windows will save the contents of all windows on the screen for future reference, so that if Windows needs to access the bits of a window, it doesn’t need to ask that window to redraw. (This allows all sorts of neat tricks, such as how you can have a window appearing to be drawn twice with the new Alt+Tab window on Windows Vista, with the live preview, without a major performance hit – try it out with a 3D game in windowed mode to see what I mean). Because of this caching of window data, when resynchronizing with the client after a minimize and restore operation, the server end of Terminal Server doesn’t need to ask every program to redraw itself; all it needs to do is fetch the bits out of the cache that is created for each window by desktop composition (and thus the differences sent to the client will only show “real” differences, not multiple layers of a redraw operation. Try this with an Internet Explorer window open on a page with foreground images overlaying background images, and the difference is immediately visible between Terminal Server with desktop composition enabled and Terminal Server without desktop composition.) This means that there are no more painful multi-step-redraw operations that are visible in real time on the client, at least when it comes to pathological bitmap drawing cases, such as Internet Explorer (and no annoying flicker in the less severe cases).

Upgrades, and taking the plunge to x64 full-time

Wednesday, December 27th, 2006

I’ve finally gotten around to getting a new computer; an XPS M1710 with a 2.33GHz Core 2 Duo processor, 2GB RAM, and a GeForce 7950 Go GTX (512MB RAM). It’s been working fairly well so far performance-wise (and with dual core, my computer was even responsive during VS2005SP1 setup!). There were a couple of nice unexpected upgrades that I happened to get in the process; the new laptop has a built-in smart card reader (good-bye clunky USB smart card reader), and the LCD backlight brightness is much improved over my Inspiron 9300 (a relatively dim backlight and a glossy display made that laptop difficult to read sometimes, depending on outside lighting conditions). Also, the programmatic LCD brightness API seems to be functional with the new video card (I’ve got some plans for that, primarily relating to replicating the neat MacBook Pro behavior of automatically dimming the display when idle – assuming no program has put the system in contiguous display mode, like with video playback or slideshow presentations).

Having a dual core chip on my main workstation is definitely a noticible benefit for me; the difference is immediately visible when running two intensive programs at the same time (which is a common occurance for me, a multimon addict). If you do serious multitasking on one computer, then you’re a prime candidate for seeing real benefit from dual core, in terms of keeping the system responsive while running more than one intensive program simultaneously (provided your programs aren’t all I/O-bound).

Perhaps the most fundamental change, however, is that I decided to go with Windows Vista x64 as the operating system. This means that for all general-purpose computing, I am effectively cutting over to a 64-bit platform entirely (in the past, I’ve only really used x64 for development and not as a general-purpose platforms). This includes the whole gamut, from development and debugging to games and email and whatnot. I’ve already tried Vista out as the operating system on my primary workstation before and it worked out well enough, but taking the leap to 64-bit is another story entirely. Since there’s no backwards compatibility with 32-bit drivers, this is a bit more of a change than just switching to Vista as far as having working hardware goes.

It turned out that things ended up much better than I thought, however. Vista ended up shipping with native x64 drivers for almost all of my hardware (even the Bluetooth 2.0 card and draft 802.11n WLAN card – seems the people at Broadcom have been busy working on their drivers lately). The only hardware that didn’t have native drivers out of the box were several of the non-SD memory card reader devices (i.e. Picture Card), which I don’t really happen to care about, and the video card. Getting the video card to work was a bit of a hassle; I first grabbed NVIDIA’s latest official Vista x64 beta drivers (release 96.85), but these turned out to be fairly buggy (strange graphical corruption in fullscreen mode in Neverwinter Nights 2). Eventually, I dug up a copy of some more recent Vista x64 drivers (release 97.46) which were a bit more stable (unfortunately, I had to go manually install the drivers as the inf didn’t properly list my card as supported). That little bit of unpleasantness aside, the Vista x64 experience has been pretty good so far. World of Warcraft and Neverwinter Nights 2 run fine under Wow64, and all of the usual development tools (VS2005 and friends) work as well, though I had prior experience getting VS2005 working under Windows Server 2003 x64.

The only piece of hardware that I’m still not sure about on the new laptop is the ExpressCard slot (which might be what the Ricoh 1180:0843 device that I haven’t been able to identify is). I don’t see this as an immediate problem, however; the only generally available ExpressCards that I have seen thus far are EVDO/EDGE/HSDPA modems, and as Bluetooth works out of the box, I can get EVDO/1x modem capability just fine using my cell phone instead of through a dedicated ExpressCard modem.

While there have been a couple of rough edges (having to manually override PnP for installing new video card drivers, and a couple of installers that were confused about the difference between c:\windows\system32 and c:\windows\syswow64), it seems that native x64 Windows support on real hardware (as opposed to just running in VMs) has come a long way recently. If you are thinking about trying x64 Windows outside of a VM, though, I would try to make sure that you have relatively recent hardware. Since you won’t be able to rely on backwards compatibility with old 32-bit drivers released that may have been released many years ago, any hardware vendors that have provided you hardware will in general need to have been still actively supporting that hardware relatively recently for there to be a shot at it having x64 drivers.

All in all, things look good so far with the new hardware (and running everything under x64). I’ll post back with any showstoppers I encounter with doing x64 full-time, but things are looking well thus far.

A case study in bad end user setup experience: The Visual Studio 2005 SP1 installer

Wednesday, December 20th, 2006

Yesterday, I decided to install Visual Studio 2005 SP1 on two of my systems that I do major development work on. This turned out to be a painful mess that ate several hours of my time that day. If you ever decide to install VS 2005 SP1, plan to have an hour or two of time to waste ahead of you… (at least in my case, I learned from my mistakes of expecting a reasonable setup user experience the first time around and was better prepared for the second system that I had to install SP1 on).

Specifically, I ran into several different issues during setup. First of all, it was slow, and by slow, I mean slow. It took something on the order of 15-20 minutes just to get through the Windows Installer initialization dialog, during this time it was pegging CPU and I/O for my box, thus making it fairly unusable. The computer that I was installing VS 2005 SP1 on may not be the latest and greatest, but I should hope that it should be possible for one to design a setup package that won’t totally destroy the performance of a 1.5-year-old top-of-the-line laptop for 15 minutes during setup initialization. (It’s worth noting that the original installer for VS 2005 didn’t have this problem, nor did the installer for the Vista Platform SDK, the Vista WDK, Office 2007, or any of the other programs I’ve ever installed on my box.)

As if making my system unusable for 15 minutes wasn’t bad enough, there is a nice little confirmation dialog box that asks if you really want to install VS 2005 SP1 which occurs after you just had to spend 15+ minutes staring at a dialog that doesn’t update with progress information in any meaningful fashion. (Or more likely, after you have given up waiting and went to do something else while setup finished doing its thing. This is particularly frustrating, since given a setup program that is just sitting there grinding away without any UI feedback, as an end user you are very likely to just go read a book, strike up a conversation, or do something else besides just stare at your screen for minutes on end with nothing visibly happening. It would be okay if setup prompted you to continue before it decided to just waste 15+ minutes of your time. As it stands, resist the temptation to leave your computer completely unattended, or you’ll just come back to a dialog box that was preventing the bulk of setup from completing, meaning you’ll have even more wasted time waiting for setup to finish ahead of you.)

By this point, I was already thinking that this setup experience was downright terrible, but the worst was yet to come in this setup story of sorrow. It turns out that setup (undocumentedly) likes to use ~2.5GB or so on your system partition (even if all of your Visual Studio / Platform SDK products are installed on a different partition). Unfortunately, nobody bothered to mention this anywhere in the release notes at the time when I started my installation, and setup doesn’t even bother to check for the amount of free space ahead of time. To make matters even worse, you are most likely to run into this problem after the 15+ minute initialization period, which makes it all the more painful if it causes your setup instance to completely fail, as described below. It turns out that if setup runs out of space at a particular moment in time, you may get stuck at a abort/retry dialog asking you to free more space. However, in at least one instance of running out of space (which I hit several times in the course of trying to free up enough space on the partition that setup shouldn’t have even been touching), setup gets broken and will just refuse to continue to copy files. If you click “retry”, you’ll simply get the same out of space dialog again (regardless of whether you have actually freed up the necessary space), and your only option from that point forward is to cancel setup (which takes another 10 minutes to finish rolling back) and restart it (which will, again, involve at least 15+ minutes of waiting before a confirmation dialog). It should be noted that the second time setup failed due to out of space, it did properly recover after I told it to retry, so it seems that only certain parts of setup get into this non-recoverable state where an out of space error is really fatal.

After repeatedly hitting the out of space error, I was rather interested to find out just why setup needed 2.5GB of space on my system partition despite the fact that VS 2005 was installed on a different partition. It turns out that most of the space requirement comes from the fact that the ~450MB installer makes no less than three copies of itself in your %TEMP% directory during setup (look for 450MB .msp files in %TEMP% while it’s running), even if the original .msi was present on a local hard drive and not a removable storage medium. Furthermore, setup will then need to make a fourth copy of the .msi in your %SystemRoot%\Installer directory for permanent storage in case you need to make modifications to your VS 2005 SP1 install without having the original .msi present. Now, I am certainly no expert in MSI-based installers, but it seems to me that making four copies of a 450MB installer file in your system partition is just ever so slightly excessive…

There are certainly some cool features and bug fixes (especially relating to PGO) in VS 2005 SP1, such as an ASLR-aware linker that make it a compelling upgrade, but I give the setup experience a failing grade. It’s by far the worst that I have ever seen out of any shipping Microsoft product to date.

Programming against the x64 exception handling support, part 2: A description of the new unwind APIs

Tuesday, December 19th, 2006

Last time, I described many of the structures and prototypes necessary to program against the new x64 exception handling (EH) support. This posting continues that series, and describes how to manually initiate an unwind of procedure frames (and when and why you might want to do this).

Because x64 has built-in support for data-driven unwinding, there are a great many interesting things that you can do with unwinding functions at arbitrary points in execution. Unlike x86, you don’t have to either assume that all functions use a frame pointer (which is typically not the case in many programs), and you don’t need to call code with a certain register context setup in the correct way (with the right local variables at the right displacements from the stack pointer) in order to initiate an unwind of a function that had registered an unwind or exception handler.

If you’ve been reading some of my recent postings about performing stack traces on x86, then you one of the first things that might come to mind is designing an approach that can create a “perfect” call stack in all situations without symbols. There are other benefits to this data-driven unwind data approach, however, than simply being able to take accurate call stacks at arbitrary points in the execution process. For instance, there are particularly interesting benefits as far as instrumentation and code analysis go (such as an improved ability to detect most functions in an image programmatically with a great deal of certainty based on unwind data), and there are interesting implications for techniques such as function patching and modification on the fly as well.

First things first, however. The initial step is to get familiar with the new unwinding APIs that Win64 exposes on x64. Although these APIs can be manually duplicated by explicit parsing of all unwind information, I would recommend calling the APIs directly instead of doing all of the work to manually emulate unwinds yourself. The reason that I make that recommendation is that while the unwind metadata is documented, there is still a significant amount of work involved in reimplementing them from scratch, and the unwind APIs themselves are (mostly) documented on MSDN and thus unlikely to change.

There are several APIs in particular that you’ll frequently find yourself using for unwind support on x64. These APIs are available in both user mode and kernel mode (and aside for a lack of support for dynamically generated unwind data) the two operating environments use exactly the same semantics for unwinding. Thus, for the most part, you can interact with unwind metadata in the same fashion for both user mode and kernel mode.

  1. RtlLookupFunctionEntry: The first API that you’ll likely end up having to call for any unwind-related operation is RtlLookupFunctionEntry. This routine is the basis of all unwind operations in that it allows the caller to translate a raw 64-bit RIP value into two important values: An image base for any associated image in the address space of the caller, and a pointer to the RUNTIME_FUNCTION structure associated with the RIP value passed in. For virtually all cases on x64, you’ll be able to retrieve a valid RUNTIME_FUNCTION structure for the current RIP value. The exception to this rule relates to what are known as leaf functions, or functions that both make no direct modifications to the stack pointer (or any nonvolatile registers), and do not call any subfunctions. For these leaf functions only, the emission of unwind metadata is optional by the compiler. To handle this case, it is typical to read the first ULONG64 from the current RSP value (i.e. the return address of the current leaf function). This address can then be passed to RtlLookupFunctionEntry. Because leaf functions do not touch any nonvolatile registers or alter the stack pointer or call any subfunctions, they can be safely skipped in the unwind process in this fashion. (Virtually all functions in a given x64 binary are non-leaf functions (otherwise known as frame functions), or functions that do not meet the previously described three criteria. In either case, however, the restrictions on leaf functions mean that they do not impact the ability to perform complete unwinds despite the lack of unwind metadata associated with them.)

    The typical usage case for RtlLookupFunctionEntry is simply to retrieve the function entry for the currently executing function. (For leaf functions, it may be necessary to retrieve the function entry for the caller, if there is no unwind metadata for the current function, as described above.) Then, the PRUNTIME_FUNCTION returned is typically passed to one of the “high level” unwind support routines, although if necessary, it can be manually interpreted directly (this is typically not required, however).

  2. RtlVirtualUnwind: The RtlVirtualUnwind API is the core of the Win64 x64 unwind support. This API implements the lowest level interface exposed for interacting with unwind metadata through a RUNTIME_FUNCTION. In particular, it implements all of the code necessary to interpret UNWIND_CODEs and adjust the stack and nonvolatile register context according to the unwind information specified via a RUNTIME_FUNCTION. It also has logic to locate and execute exception or unwind handlers for a given function.

    RtlVirtualUnwind provides the infrastructure upon which higher level exception and unwind handling support is implemented. It exposes the concept of a virtual unwind (as one might guess, given the routine’s name). The virtual unwind concept is one that is entirely new to x64 (and IA64), and does not exist in any form on x86. This is due entirely to the fact that IA64 and x64 have data-driven unwind support, while x86 has code-driven unwind support.

    The distinction is important in that on x64 and IA64, it is possible to simulate an unwind, at an arbitrary point in time, without running code with potentially unknown side effects (or unknown entry conditions, as with x86 exception or unwind handlers that utilize local variables). This is accomplished by interpreting the unwind codes described by a RUNTIME_FUNCTION and associated UNWIND_INFO blocks. This is the essence of what a virtual unwind is; a simulated unwind operation that can operate on an arbitrary, isolated register context without affecting (or otherwise impacting) the actual realized state of the program. In its purest form, a virtual unwind can be accomplished by invoking RtlVirtualUnwind with a register context that you wish to have the unwind applied to, and the UNW_FLAG_NHANDLER flag value for the HandlerType parameter (which suppresses the invokation of any unwind or exception handlers registered by the function).

    This is a very powerful capability indeed, as it allows for a much more complete and thorough traversal of call frames than ever possible on x86. With the ability to describe and undo the changes to nonvolatile registers given an initial register context and stack, virtual unwinding allows programmatic, completely-reliable access to not only the return address, stack frame, and arguments of arbitrary functions at any point in an active call stack, but also access to nonvolatile register values at any point in a call stack. If you have ever debugged optimized code where parameter values and intermediate values are frequently only present in registers, then you can immediately see how valuable this particular benefit of virtual unwinding is to debugging (it is important to note that as volatile registers are not saved anywhere, it is not necessarily possible to reconstruct their values at any point in the call frame).

    It is also possible to use RtlVirtualUnwind to effect a “realized” unwind, and indeed, RtlVirtualUnwind is the cornerstore on which the rest of the unwinding architecture in Win64 x64 is built. By directing RtlVirtualUnwind to call unwind (or exception) handlers, as appropriate, and then further altering the returned context (such as by specifying a return value), it is possible to perform a complete “realized” unwind from a procedure at an arbitrary point in execution.

  3. RtlUnwindEx: RtlUnwindEx supplants the RtlUnwind API that exists on x86 for purposes of implementing a “hard unwind” that alters the realized execution state of the program. RtlUnwindEx is a natural extension of RtlUnwind that includes support for features new to 64-bit exception handling support. Unlike RtlUnwind, it can operate on a register context other than the current register context.

    RtlUnwindEx implements an unwind that calls all of the necessary unwind handlers necessary to unwind to a particular point. It also adjusts the register context based on the unwind metadata at the given procedure frame being unwound. Internally, RtlUnwindEx is essentially implemented as a wrapper that calls RtlVirtualUnwind and registered unwind handlers as necessary for each frame in between the active frame and the target frame. It also houses all of the logic necessary to deal with some of the other subtleties of unwinding, such as detection of a bogus stack pointer value in the passed in register context.

    RtlUnwindEx is useful if you are needing to execute a complete unwind (and only a complete unwind) of a particular procedure frame or set of procedure frames. In most cases where you would be doing this, it is usually sufficient to just be relying on the language-level exception handling support, so I consider RtlUnwindEx as relatively uninteresting (at least when compared to RtlVirtualUnwind). Many of the more interesting use cases for directly calling the x64 exception handling support thus require the use of RtlVirtualUnwind directly (although selectively unwinding past certain procedure frames with complete support for calling unwind handlers is made easier by direct usage of RtlUnwindEx).

  4. RtlCaptureStackBackTrace: The RtlCaptureStackBackTrace routine is essentially a high-level implementation of a stack walking routine that utilizes the lower level unwind support (in particular, RtlVirtualUnwind). Unlike StackWalk64, RtlCaptureStackBackTrace is very light-weight and does not use symbols (it is implemented entirely with the unwind metadata present on x64). As such, it does not exist on x86. It is, however, handy for quickly capturing stack traces (and can be used in both user mode and kernel mode in the same fashion). RtlCaptureStackBackTrace does not return non-volatile register contexts for each frame being traced, however, so if you require this functionality, then you would need to implement your own stack trace mechanism on top of RtlVirtualUnwind. (It is worth noting that this is sort of mechanism is essentially what functionality like handle tracing and page heap tracing are built on top of, to give you an idea of how useful it can be.) If you only need return addresses for each frame, however, then RtlCaptureStackBackTrace is an excellent API to consider for use if you need to log stack traces at periodic locations in your own programs for later analysis (especially since it doesn’t require anything as invasive as loading symbols).

That’s all there is in this posting. More details on how to use the new unwind support next time…

Vista ASLR is not on by default for image base addresses

Saturday, December 16th, 2006

This little tidbit seems to be missed in all of the press about Vista’s ASLR implementation: Vista ASLR (when speaking of randomizing image base addresses) does not apply to image bases by default. This is a sacrifice for application compatibility’s sake, in an effort to make fewer programs break “out of the box” on Vista. Most notably, this is the case even for images with base relocations.

Unfortunately, the mechanism to mark an executable image as “ASLR aware” (such that it can be freely rebased by Vista’s ASLR) is not at present documented. Furthermore, the linker version that is included with Visual Studio 2005 and the Windows Vista Platform SDK does not support the option necessary to mark as image as ASLR aware (though you could technically modify the image by hand with a hex editor or the like to enable it).

The WDK linker does support the new ASLR-enabling linker option, however (though it too does not appear to document it anywhere). You can find references to this new linker option in makefile.new:

!if defined(NO_DYNAMICBASE)
DYNAMICBASE_FLAG=
!else
! if $(_NT_TOOLS_VERSION) >= 0x800
DYNAMICBASE_FLAG=/dynamicbase
! else
DYNAMICBASE_FLAG=
! endif
!endif

Passing /dynamicbase to the WDK version of link.exe (8.00.50727.215) or later will set the 0x40 DllCharacteristics value in the PE header of the output binary. This corresponds to a newly-defined constant which is at present only documented briefly in the WDK version of ntimage.h:

#define IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE 0x0040
     // DLL can move.

If this flag is set, then the base address of an image can be randomized by Vista’s ASLR; if the flag is clear, however, then no ASLR-style randomizations are performed to the image base address of a particular image (in this case, however, it is important to note that heap and stack allocations are still randomized – it is only the image base address that does not become randomized).

Now, virtually all of the Microsoft PE images that ship with the operating system are built with /dynamicbase, so they will take full advantage of Vista’s ASLR with respect to image base randomization. However, third party (ISV)-built programs will not, by default, gain all the benefits of ASLR due to this application compatibility sacrifice. This is where the potential problem is, as effectively all existing third party PE images will need to be recompiled to enable ASLR on image base addresses. (Technically, you could use link /edit with the WDK linker to do this without a rebuild, or hex edit binaries, but this is not a real solution in my mind. In Microsoft’s defense, many third-party .exe files are often built without base relocations, which means that even if Microsoft had enabled ASLR by default, many third party programs would still not be getting the full benefit. This does not, however, mean that I fully agree with their decision…)

I can understand where Microsoft is coming from with an application compatibility perspective as far as ASLR’s impact on poorly written programs (of which there are an abundance of in the Windows world), but it is a bit unfortunate that there is no real way to administratively enable ASLR globally, or at least administratively make it an opt-out instead of opt-in setting.

So, if you are an ISV, here’s a heads up to be on the lookout for a link.exe version shipping with Visual Studio that supports /dynamicbase. When such becomes available, I would highly recommend enabling /dynamicbase for all of your projects (so long as you aren’t doing anything terribly stupid in your programs, enabling image base randomizations should be fairly harmless in most cases). You should also mark your .exe files as /FIXED:NO such that they contain a relocation section. This, when combined with /dynamicbase, will allow your .exe files to be randomized by ASLR (just the same as with DLLs that have relocation information and are built with /dynamicbase).

Update: Visual Studio 2005 SP1 has shipped. This update to Visual Studio includes a newer version of the linker, which supports the /dynamicbase option described above. So, be sure to rebuild your programs with /dynamicbase and /fixed:no with VS 2005 SP1 in order to take full advantage of ASLR on Vista.

Programming against the x64 exception handling support, part 1: Definitions for x64 versions of exception handling support

Wednesday, December 13th, 2006

This is a series dealing with how to use the new x64 exception handling support from a programmatic perspective (that is, how to write programs that take advantage of the new support, instead of the perspective of how to understand it while reverse engineering or disassembling something. Those topics have been covered in the past on this site already.)

To get started with programming against the new x64 EH support, you’ll need to have the structure and prototype definitions for the standard x64 EH related functions and structures. One’s first instinct here is to go to MSDN. Be warned, that if you are dealing with the low-level SEH routines (such as RtlUnwindEx), the documentation on MSDN is still missing / wrong for x64. For the most part, excepting RtlVirtualUnwind (which is actually correctly documented now), the exception handler support is only properly documented for IA64 (so don’t be surprised if things don’t work out how you would hope when calling RtlUnwindEx with the MSDN prototype).

For a recent project, I had to do some in-depth work with the inner workings of exception handling support on x64. So, if you’ve been ever having to deal with the low-level EH internals on x64 and have been frustrated by documentation on MSDN that is either incomplete or just plain wrong, here’s some of the things that I have run into along the way as far as things that are either missing or incorrect on MSDN while relating to x64 EH support:

  1. When processing an UNWIND_INFO structure, if the UNW_FLAG_CHAININFO flag is set, then there is an additional undocumented possibility for how unwind information can be chained. Specifically, if the low bit is set in the UnwindInfoAddress of the IMAGE_RUNTIME_FUNCTION_ENTRY structure referring to by the parent UNWIND_INFO structure, UnwindInfoAddress is actually the RVA of another IMAGE_RUNTIME_FUNCTION_ENTRY structure after zeroing the first bit (instead of the RVA of an UNWIND_INFO structure). This is used to help more efficiently chain exception data across a binary with minimal waste of space (credits go to skape for telling me about this).
  2. The prototype on MSDN for RtlUnwindEx is only for IA64 and does not apply to x64. The correct prototype is something more on the lines of this:
    VOID
    NTAPI
    RtlUnwindEx(
       __in_opt ULONG64               TargetFrame,
       __in_opt ULONG64               TargetIp,
       __in_opt PEXCEPTION_RECORD     ExceptionRecord,
       __in     PVOID                 ReturnValue,
       __out    PCONTEXT              OriginalContext,
       __in_opt PUNWIND_HISTORY_TABLE HistoryTable
       );
  3. MSDN’s definition of DISPATCHER_CONTEXT (a structure that is passed to the language specific handler) is incomplete. There are some additional fields beyond HandlerData, which is the last field documented in MSDN. You can see this if you disassemble _C_specific_handler, which uses the undocumented ScopeIndex field. Additional credits go to Alex Ionescu for information on a couple of the undocumented DISPATCHER_CONTEXT fields. Here’s the correct definition of this structure for x64:
    typedef struct _DISPATCHER_CONTEXT {
        ULONG64               ControlPc;
        ULONG64               ImageBase;
        PRUNTIME_FUNCTION     FunctionEntry;
        ULONG64               EstablisherFrame;
        ULONG64               TargetIp;
        PCONTEXT              ContextRecord;
        PEXCEPTION_ROUTINE    LanguageHandler;
        PVOID                 HandlerData;
        PUNWIND_HISTORY_TABLE HistoryTable;
        ULONG                 ScopeIndex;
        ULONG                 Fill0;
    } DISPATCHER_CONTEXT, *PDISPATCHER_CONTEXT;
  4. Not all of the flags passed to an exception handler (primarily relating to unwinding) are properly documented on MSDN. These additional flags are included in winnt.h, however, and are actually the same for both x86 and x64. Here’s a listing of the missing flags that apply to the ExceptionFlags member of the EXCEPTION_RECORD structure (only the EXCEPTION_NONCONTINUABLE flag value is documented on MSDN):
    #define EXCEPTION_NONCONTINUABLE   0x0001
    #define EXCEPTION_UNWINDING        0x0002
    #define EXCEPTION_EXIT_UNWIND      0x0004
    #define EXCEPTION_STACK_INVALID    0x0008
    #define EXCEPTION_NESTED_CALL      0x0010
    #define EXCEPTION_TARGET_UNWIND    0x0020
    #define EXCEPTION_COLLIDED_UNWIND  0x0040
    #define EXCEPTION_UNWIND           0x0066

    In particular, EXCEPTION_UNWIND is a bitmask of other flags that indicates all possible flags that are used to signify an unwind operation. This is probably the most interesting bitmask/flag to you, as you’ll need it if you are distinguishing from an exception or an unwind operation from the perspective of an exception handler.

  5. The definition for the C scope-table information emitted by CL for __try/__except/__finally and implicit exception handlers is not documented. Here’s the definition of the scope table used for C exception handling support:
    typedef struct _SCOPE_TABLE {
    	ULONG Count;
    	struct
    	{
    		 ULONG BeginAddress;
    		 ULONG EndAddress;
    		 ULONG HandlerAddress;
    		 ULONG JumpTarget;
    	} ScopeRecord[ 1 ];
     } SCOPE_TABLE, *PSCOPE_TABLE;
    

    This structure was briefly documented in a beta release of the WDK, although it has since disappeared from the RTM build. The ScopeRecord field describes a variable-sized array whose length is given by the Count field.
    You’ll need this structure definition if you are interacting with _C_specific_handler, or implementing assembler routines that are intended to use _C_specific_handler as their language specific handler.
    All of the above addresses are RVAs. BeginAddress and EndAddress are the RVAs for which the current scope record is effective for. HandlerAddress is the RVA of a C-specific exception handler (more on that below) that implements the __except filter routine in C exception support, or the hardcoded value 0x1 to indicate that this is the __except filter unconditionally accepts the exception (this is also set to 0x1 for a __finally block). The JumpTarget member is the RVA of where control is transferred if the C exception handler indicates the address of the body of an __except block (or a __finally block).

  6. The C exception handler routine whose RVA is given by the HandlerAddress of the C scope table for a code block is defined as follows:
    typedef
    LONG
    (NTAPI * PC_LANGUAGE_EXCEPTION_HANDLER)(
       __in    PEXCEPTION_POINTERS    ExceptionPointers,
       __in    ULONG64                EstablisherFrame
       );

    The ExceptionPointers argument is the familiar EXCEPTION_POINTERS structure that the GetExceptionInformation macro returns. The EstablisherFrame argument contains the stack pointer value for the routine associated with the C exception handler in question at the point in which the exception occured. (If the exception occured in a subfunction called by the function that the exception is now being inspected at, then the stack pointer should be relative to the point just after the call to the faulting function was made.) The EstablisherFrame argument is typically used to allow transparent access to the local variables of the current function from within the exception filter, even though technically the exception filter is not part of the current function but actually a completely different function itself. This is the mechanic by which you can access local variables within an __except expression.
    The function definition deserves a bit more explanation than just the parameter value meanings, however, as it is really dual-purpose. There are two modes for this routine, exception handling mode and unwind handling mode. If the low byte of the ExceptionPointers argument is set to the hardcoded value 0x1, then the handler is being called for an unwind operation. In this case, the rest of the ExceptionPointers argument is meaningless, and only the EstablisherFrame argument holds a meaningful value. In addition, when operating in unwind mode, the return value of the exception handler routine is ignored (the compiler often doesn’t even initialize it for that code path). In exception handling mode (where the ExceptionPointers argument’s low byte is not equal to the hardcoded value 0x1), both arguments are significant, and the return value is also used. In this case, the return value is one of the familiar EXCEPTION_EXECUTE_HANDLER, EXCEPTION_CONTINUE_SEARCH, and EXCEPTION_CONTINUE_EXECUTION constants that are returned by an __except filter expression. If EXCEPTION_EXECUTE_HANDLER is returned, then control will eventually be transferred to the JumpTarget member of the current scope table entry.

  7. The definition of the UNWIND_HISTORY_TABLE structure (and associated substructures) for x64 is as follows (this structure is used as a cache to speed up repeated exception handling lookups, and is typically optional as far as usage with RtlUnwindEx goes – though certainly recommended from a performance perspective):
    #define UNWIND_HISTORY_TABLE_SIZE 12
    
    typedef struct _UNWIND_HISTORY_TABLE_ENTRY {
            ULONG64           ImageBase;
            PRUNTIME_FUNCTION FunctionEntry;
    } UNWIND_HISTORY_TABLE_ENTRY,
    *PUNWIND_HISTORY_TABLE_ENTRY;
    
    #define UNWIND_HISTORY_TABLE_NONE 0
    #define UNWIND_HISTORY_TABLE_GLOBAL 1
    #define UNWIND_HISTORY_TABLE_LOCAL 2
    
    typedef struct _UNWIND_HISTORY_TABLE {
            ULONG                      Count;
            UCHAR                      Search;
            ULONG64                    LowAddress;
            ULONG64                    HighAddress;
            UNWIND_HISTORY_TABLE_ENTRY
               Entry[ UNWIND_HISTORY_TABLE_SIZE ];
    } UNWIND_HISTORY_TABLE, *PUNWIND_HISTORY_TABLE;
  8. There are inconsistencies regarding the usage of RUNTIME_FUNCTION and IMAGE_RUNTIME_FUNCTION in various places in the documentation. These two structures are synonymous for x64 and may be used interchangeably.

Most of the other x64 exception handling information on the latest version of MSDN is correct (specifically, parts dealing with dealing with function tables, such as RtlLookupFunctionTableEntry.) Remember that the MSDN documentation also includes IA64 definitions on the same page, though (and the IA64 definition is typically the one presented at the top with all of the arguments explained, where you would expect it). You’ll typically need to scroll through the remarks section to find information on the x64 versions of these routines. Be wary of using your locally installed Platform SDK help with the functions that are correctly documented on MSDN, though, as to my knowledge only the very latest SDK version (e.g. the Vista SDK) actually has correct information for any of the x64 exception handling information; older versions, such as the Platform SDK that shipped with Visual Studio 2005, only include IA64 information for routines like RtlVirtualUnwind or RtlLookupFunctionTableEntry. In general, anywhere you see a reference to a FRAME_POINTERS or Gp structure or value in the documentation, this is a good hint that the documentation is talking exclusively about IA64 and does not directly apply to x64.

That’s all for this installment. More on how to use this information from a programmatic perspective next time…

Debugger internals: How loaded module names are communicated to the debugger

Monday, December 11th, 2006

If you’ve ever used the Win32 debugging API, you’ll notice that the WaitForDebugEvent routine, when returning a LOAD_DLL_DEBUG_EVENT style of event, gives you the address of an optional debuggee-relative string pointer containing the name of the DLL that is being loaded. In case you’ve ever wondered just where that string comes from, you’ll be comforted to know that this mechanism for communicating module name strings to the remote debugger is built upon a giant hack.

To give a bit of background information on how loading of DLLs works, most of the heavy-lifting with respect to loading DLLs (referred to as “mapping an image”) is done by the memory manager subsystem in kernel mode – specifically, in the “MiMapViewOfImageSection” internal routine. This routine is responsible for taking a section object (known as a file mapping object in the Win32 world) that represents a PE image on disk, and setting up the in-memory layout of the PE image in the specified process address space (in the case of Win32, always the address space of the caller). This includes setting up PE image subsections with the correct alignment, zero-filling “bss”-style sections, and setting up the protections of each PE image subsection. It is also responsible for supplying the “magic” necessary to allow shared PE subsections to work. All of this behavior is controlled by the SEC_IMAGE flag being passed to NtMapViewOfSection (this behavior is visible by Win32 via passing SEC_IMAGE to MapViewOfFile, and can be used to achieve the same result of “just” mapping an image in-memory without going through the loader). Internally, the loader routine in NTDLL (LdrLoadDll and its associated subfunctions, which are called by the LoadLibrary family of routines in kernel32) utilizes NtMapViewOfSection to create the in-memory layout of the DLL being requested. After performing this task, the user-mode NTDLL-based loader then performs tasks such as applying base relocations, resolving imports to other modules (and loading dependent modules if necessary), allocating TLS data slots, making DLL initializer callouts, and soforth.

Now, the way that the debugger is notified of module load events is via a kernel mode hook that is called by NtMapViewOfSection (DbgkMapViewOfSection). This hook is responsible for detecting if a debugger (user mode or kernel mode) is present, and if so, forwarding the event to the debugger.

This is all well and good, but there’s a catch here. Both the user mode and kernel mode debuggers display the full path name to the DLL being loaded, but we’re now at the wrong level of abstraction, so to speak, to retrieve this information. All MiMapViewOfSection has is a handle to a section object (in actuality, a PSECTION_OBJECT and not a handle at this point). Now, the section object *does* have a reference to the PFILE_OBJECT associated with the file backing the section object (the reference is stored in the CONTROL_AREA of the section object), but there isn’t necessarily a good way to get the original filename that was passed to LoadLibrary out of the FILE_OBJECT (for starters, at this point, that path has already been converted to a native path instead of a Win32 path, and there is some potential ambiguouity when trying to convert native paths back to Win32 paths).

To work around this little conundrum, the solution the developers chose is to temporarily borrow a field of the NT_TIB portion of the TEB of the calling thread for use as a way to signal the name of a DLL that is being loaded (if SEC_IMAGE is being passed to NtMapViewOfSection). Specifically, NT_TIB.ArbitraryUserPointer is temporarily replaced with a string pointer (in Windows NT, this is always a unicode string) to the original filename passed to LdrLoadDll. Normally, the ArbitraryUserPointer field is reserved exclusively for use by user mode as a sort of “free TLS slot” that is available at a known location for every thread. Although this particular value is rarely used in Windows, the loader does make the effort to preserve its value across calls to LdrLoadDll. This works (since the loader knows that none of the code that it is calling will use NT_TIB.ArbitraryUserPointer), so long as you don’t have cross-thread accesses to a different thread’s NT_TIB.ArbitraryUserPointer (to date, I have never seen a program that tries to do this – and a good thing to, or it would randomly fail when DLLs are being loaded). Because the original value of NT_TIB.ArbitraryUserPointer is restored, the calling thread is typically none-the-wiser that this substitution has been performed.

Disassembling the part of the NTDLL loader responsible for mapping the DLL into the address space via NtMapViewOfSection (a subroutine named “LdrpMapViewOfDllSection” on Windows Vista), we can see this behavior in action:

ntdll!LdrpMapViewOfDllSection:
[...]
;
; Find the TEB address for the current thread.
; esi = NtCurrentTeb()->NtTib.Self
;
77f0e2ee 648b3518000000  mov     esi,dword ptr fs:[18h]
77f0e2f5 8365fc00        and     dword ptr [ebp-4],0
77f0e2f9 57              push    edi
77f0e2fa bf00000020      mov     edi,20000000h
77f0e2ff 857d18          test    dword ptr [ebp+18h],edi
77f0e302 c745f804000000  mov     dword ptr [ebp-8],4
77f0e309 0f85ce700400    jne     LdrpMapViewOfDllSection+0x26

ntdll!LdrpMapViewOfDllSection+0x42:
77f0e30f 8b4514          mov     eax,dword ptr [ebp+14h]
;
; Save away the previous ArbitraryUserPointer value.
;
; ebx = Teb->NtTib.ArbitraryUserPointer
77f0e312 8b5e14          mov     ebx,dword ptr [esi+14h]
77f0e315 6a04            push    4
77f0e317 ff7518          push    dword ptr [ebp+18h]
;
; Set the ArbitraryUserPointer value to the string pointer
; referring to the DLL name passed to LdrLoadDll.
; Teb->NtTib.ArbitraryUserPointer = (PVOID)DllNameString;
; 
77f0e31a 894614          mov     dword ptr [esi+14h],eax
77f0e31d 6a01            push    1
77f0e31f ff7510          push    dword ptr [ebp+10h]
77f0e322 33c0            xor     eax,eax
77f0e324 50              push    eax
77f0e325 50              push    eax
77f0e326 50              push    eax
77f0e327 ff750c          push    dword ptr [ebp+0Ch]
77f0e32a 6aff            push    0FFFFFFFFh
77f0e32c ff7508          push    dword ptr [ebp+8]
;
; Call NtMapViewOfSection to map the image and perform the
; debugger notification.
;
77f0e32f e830180300      call    NtMapViewOfSection
77f0e334 857d18          test    dword ptr [ebp+18h],edi
77f0e337 5f              pop     edi
;
; Restore the previous value of
; Teb->NtTib.ArbitraryUserPointer.
;
77f0e338 895e14          mov     dword ptr [esi+14h],ebx
77f0e33b 5e              pop     esi
77f0e33c 894514          mov     dword ptr [ebp+14h],eax
77f0e33f 5b              pop     ebx
77f0e340 0f85bc700400    jne     LdrpMapViewOfDllSection+0x75

Sure enough, the user mode loader uses the current thread’s NT_TIB.ArbitraryUserPointer to communicate the DLL name string pointer (in this context, the “eax” value loaded into NT_TIB.ArbitraryUserPointer is the dll name string.) We can easily verify this in the debugger:

Breakpoint 0 hit
eax=0017ecfc ebx=00000000 ecx=0017ecd8
edx=774951b4 esi=c0000135 edi=0017ed80
eip=773fe2e5 esp=0017ec10 ebp=0017ed18
iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b
gs=0000             efl=00000246
ntdll!LdrpMapViewOfDllSection:
773fe2e5 8bff            mov     edi,edi
0:000> g 773fe31a 
eax=001db560 ebx=00000000 ecx=0017ecd8
edx=774951b4 esi=7ffdf000 edi=20000000
eip=773fe31a esp=0017ebf0 ebp=0017ec0c
iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b
gs=0000             efl=00000246
ntdll!LdrpMapViewOfDllSection+0x4d:
773fe31a 894614          mov     dword ptr [esi+14h],eax
0:000> du @eax
001db560  "C:\\Windows\\system32\\CLBCatQ.DLL"

Looking in the kernel, we can clearly see the call to DbgkMapViewOfSection:

ntoskrnl!NtMapViewOfSection+0x21a:
0060a9b6 50              push    eax
0060a9b7 8b55e0          mov     edx,dword ptr [ebp-20h]
0060a9ba 8b4dd8          mov     ecx,dword ptr [ebp-28h]
0060a9bd e86e1c0100      call    ntoskrnl!DbgkMapViewOfSection

Additionally, we can see the references to NT_TIB in DbgkMapViewOfSection:

ntoskrnl!DbgkMapViewOfSection+0x65:
;
; Load eax with the address of the current thread's
; KTHREAD object.
;
; Here, fs refers to the KPCR.
;    +0x120 PrcbData         : _KPRCB
;  (in KPRCB)
;    +0x004 CurrentThread    : Ptr32 _KTHREAD
;
0061c695 64a124010000    mov     eax,dword ptr fs:[00000124h]
;
; Load esi with the address of the current thread's
; user mode PTEB.
;
; Here, we have the following layout in KTHREAD:
;    +0x084 Teb              : Ptr32 Void
;
0061c69b 8bb084000000    mov     esi,dword ptr [eax+84h]
0061c6a1 eb02            jmp     DbgkMapViewOfSection+0x75
ntoskrnl!DbgkMapViewOfSection+0x75:
0061c6a5 3bf3            cmp     esi,ebx
0061c6a7 7421            je      DbgkMapViewOfSection+0x9a
0061c6a9 3b8a44010000    cmp     ecx,dword ptr [edx+144h]
0061c6af 7519            jne     DbgkMapViewOfSection+0x9a
0061c6b1 56              push    esi
0061c6b2 e82c060200      call    DbgkpSuppressDbgMsg
0061c6b7 85c0            test    eax,eax
0061c6b9 0f85bf000000    jne     DbgkMapViewOfSection+0x144
0:000> u
ntoskrnl!DbgkMapViewOfSection+0x8f:
;
; Recall that 14 is the offset of the
; ArbitraryUserPointer member in NT_TIB,
; and that NT_TIB is the first member of TEB.
;
;    +0x000 NtTib            : _NT_TIB
;  (in NT_TIB)
;    +0x014 ArbitraryUserPointer : Ptr32 Void
;
0061c6bf 83c614          add     esi,14h
;
; [ebp-90h] is now the current thread's value of
; NtCurrentTeb()->NtTib.ArbitraryUserPointer
;
0061c6c2 89b570ffffff    mov     dword ptr [ebp-90h],esi

Thus is the story of how the filename that you pass to LoadLibrary ends up being communicated to the debugger, in a rather round-about and hackish way.

It is also worth noting that the kernel cannot trust the user mode supplied filename for use with opening the file handle to the DLL passed to the debugger process. This is because the kernel uses ZwOpenFile which bypasses normal security checks. As a result, the kernel needs to retrieve the filename via querying the section’s associated PFILE_OBJECT anyway, although for different purposes than providing the filename to the debugger.

An introduction to kernrate (the Windows kernel profiler)

Thursday, December 7th, 2006

One useful utility for tracking down performance problems that you might not have heard of until now is kernrate, the Windows kernel profiler. This utility currently ships with the Windows Server 2003 Resource Kit Tools package (though you can use kernate on Windows XP is well) and is freely downloadable. Currently, you’ll have to match the version of kernrate you want to use with your processor architecture, so if you are using your processor in x64 mode with an x64 Windows edition, then you’ll have to dig up an x64 version of kernrate (the one that ships with the Srv03 resource kit tools is x86); KrView (see below) ships with an x64 compatible version of kernrate.

Kernrate requires that you have the SeProfilePrivilege assigned (which is typically only granted to administrators), so in most cases you will need to be a local administrator on your system in order to use it. This privilege allows access to the (undocumented) profile object system services. These APIs allow programmatic access to sample the instruction pointer at certain intervals (typically, a profiler program selects the timer interrupt for use with instruction pointer sampling). This allows you to get a feel for what the system is doing over time, which is in turn useful for identifying the cause of performance issues where a particular operation appears to be processor bound and taking longer than you would like.

There are a multitude of options that you can give kernrate (and you are probably best served by experimenting with them a bit on your own), so I’ll just cover the common ones that you’ll need to get started (use “kernrate -?” to get a list of all supported options).

Kernrate can be used to profile both user mode and kernel mode performance issues. By default, it operates only on kernel mode code, but you can override this via the -a (and -av) options, which cause kernrate to include user mode code in its profiling operations in addition to kernel mode code. Additionally, by default, kernrate operates over the entire system at once; to get meaningful results with profiling user mode code, you’ll want to specify a process (or group of processes) to profile, with the “-p pid” and/or “-n process-name” arguments. (The process name is the first 8 characters of a process’s main executable filename.)

To terminate collection of profiling data, use Ctrl-C. (On pre-Windows-Vista systems where you might be running kernrate.exe via runas, remember that Ctrl-C does not work on console processes started via runas.) Additionally, you can use the “-s seconds” argument to specify that profling should be automagically stopped after a given count of seconds have elapsed.

If you run kernrate on kernel mode code only, or just specify a process (or group of processes) as described above, you’ll notice that you get a whole lot of general system-wide output (information about interrupt counts, global processor time usage, context switch counts, I/O operation counts) in addition to output about which modules used a noteworthy amount of processor time. Here’s an example output of running kernrate on just the kernel on my system, as described above (including just the module totals):

D:\\Programs\\Utilities>kernrate
Kernrate User-Specified Command Line:
kernrate


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on
the Total Hits for the Kernel

Time   197 hits, 25000 events per hit --------
 Module    Hits   msec  %Total  Events/Sec
intelppm     67        980    34 %     1709183
ntkrnlpa     52        981    26 %     1325178
win32k       35        981    17 %      891946
hal          19        981     9 %      484199
dxgkrnl       6        980     3 %      153061
nvlddmkm      6        980     3 %      153061
fanio         3        981     1 %       76452
bcm4sbxp      2        981     1 %       50968
portcls       2        980     1 %       51020
STAC97        2        980     1 %       51020
bthport       1        981     0 %       25484
BTHUSB        1        981     0 %       25484
Ntfs          1        980     0 %       25510

Using kernrate in this fashion is a good first step towards profiling a performance problem (especially if you are working with someone else’s program), as it quickly allows you to narrow down a processor hog to a particular module. While this is useful as a first step, however, it doesn’t really give you a whole lot of information about what specific code in a particular mode is taking a lot of processor time.

To dig in deeper as to the cause of the problem (beyond just tracing it to a particular module), you’ll need to use the “-z module-name” option. This option tells kernrate to “zoom in” on a particular module; that is, for the given module, kernrate will track instruction pointer locations within the module to individual functions. This level of granularity is often what you’ll need for tracking down a performance issue (at least as far as profiling is concerned). You can repeat the “-z” option multiple times to “zoom in” to multiple modules (useful if the problem you are tracking down involves high processor usage across multiple DLLs or binaries).

Because kernrate is resolving instruction pointer sampling down to a more granular level than modules (with the “-z” option), you’ll need to tell it how to load symbols for all affected modules (otherwise, the granularity for profiler output will typically be very poor, often restricted to just exported functions). There are two ways to do this. First, you can use the “-j symbol-path” command line option; this option tells kernrate to pass a particular symbol path to DbgHelp for use with loading symbols. I recommend the second option, however, which is to configure your _NT_SYMBOL_PATH before-hand so that it points to a valid DbgHelp symbol path. This relieves you of having to manually tell kernrate a symbol path every time you execute it.

Continuing with the example I gave above, we might be interested in just what the “win32k” (the Win32 kernel mode support driver for USER/GDI) module is doing that was taking up 17% of the processor time spent in kernel mode on my system (for the interval that I was profiling). To do that, we can use the following command line (the output has been truncated only include information that we are interested in):

D:\\Programs\\Utilities>kernrate -z win32k

Kernrate User-Specified Command Line:
kernrate -z win32k


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
CallBack: Finished Attempt to Load symbols for
90a00000 \\SystemRoot\\System32\\win32k.sys

Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on the
Total Hits for the Kernel

Time   2465 hits, 25000 events per hit --------
 Module      Hits   msec  %Total  Events/Sec
ntkrnlpa     1273      14799    51 %     2150483
win32k        388      14799    15 %      655449
intelppm      263      14799    10 %      444286
hal           236      14799     9 %      398675
bcm4sbxp       66      14799     2 %      111494
spsys          55      14799     2 %       92911
nvlddmkm       48      14799     1 %       81086
STAC97         31      14799     1 %       52368

[...]


===> Processing Zoomed Module win32k.sys...


----- Zoomed module win32k.sys (Bucket size = 16 bytes,
Rounding Down) --------
Percentage in the following table is based on the
Total Hits for this Zoom Module

Time   388 hits, 25000 events per hit --------
 Module                  Hits   msec  %Total  Events/Sec
xxxInternalDoPaint         44      14799    10 %       74329
XDCOBJ::bSaveAttributes    20      14799     4 %       33786
DelayedDestroyCacheDC      20      14799     4 %       33786
HANDLELOCK::vLockHandle    15      14799     3 %       25339
mmxAlphaPerPixelOnly       15      14799     3 %       25339
XDCOBJ::RestoreAttributes  13      14799     2 %       21960
DoTimer                    12      14799     2 %       20271
_SEH_prolog4               11      14799     2 %       18582
memmove                     9      14799     2 %       15203
_GetDCEx                    6      14799     1 %       10135
HmgLockEx                   6      14799     1 %       10135
XDCOBJ::bCleanDC            5      14799     1 %        8446
XEPALOBJ::ulIndexToRGB      5      14799     1 %        8446
HmgShareCheckLock           4      14799     0 %        6757
RGNOBJ::bMerge              4      14799     0 %        6757

[...]

This should give you a feel for the kind of information that you’ll get from kernrate. Although the examples I gave were profiling kernel mode code, the whole process works the same way for user mode if you use the “-p” or “-n” options as I mentioned earlier. In conjunction with a debugger, the information that kernrate gives you can often be a great help in narrowing down CPU usage performance problems (or at the very least point you in the general direction as to where you’ll need to do further research).

There are also a variety of other options that are available in kernrate, such as features for gathering information about “hot” locks that have a high degree of contention, and support for launching new processes under the profiler. There is also support for outputting the raw sampled profile data, which can be used to graph the output (such as you might see used with tools like KrView).

Although kernrate doesn’t have all the “bells and whistles” of some of the high-end profiling tools (like Intel’s vTune), it’s often enough to get the job done, and it’s also available to you at no extra cost (and can be quickly and easily deployed to help find the source of a problem). I’d highly recommend giving it a shot if you are trying to analyze a performance problem and don’t already have a profiling solution that you are using.

Frame pointer omission (FPO) optimization and consequences when debugging, part 2

Wednesday, December 6th, 2006

This series is about frame pointer omission (FPO) optimization and how it impacts the debugging experience.

  1. Frame pointer omission (FPO) and consequences when debugging, part 1.
  2. Frame pointer omission (FPO) and consequences when debugging, part 2.

Last time, I outlined the basics as to just what FPO does, and what it means in terms of generated code when you compile programs with or without FPO enabled. This article builds on the last, and lays out just what the impacts of having FPO enabled (or disabled) are when you end up having to debug a program.

For the purposes of this article, consider the following example program with several do-nothing functions that shuffle stack arguments around and call eachother. (For the purposes of this posting, I have disabled global optimizations and function inlining.)

__declspec(noinline)
void
f3(
   int* c,
   char* b,
   int a
   )
{
   *c = a * 3 + (int)strlen(b);

   __debugbreak();
}

__declspec(noinline)
int
f2(
   char* b,
   int a
   )
{
   int c;

   f3(
      &c,
      b + 1,
      a - 3);

   return c;
}

__declspec(noinline)
int
f1(
   int a,
   char* b
   )
{
   int c;
   
   c = f2(
      b,
      a + 10);

   c ^= (int)rand();

   return c + 2 * a;
}

int
__cdecl
wmain(
   int ac,
   wchar_t** av
   )
{
   int c;

   c = f1(
      (int)rand(),
      "test");

   printf("%d\\n",
      c);

   return 0;
}

If we run the program and break in to the debugger at the hardcoded breakpoint, with symbols loaded, everything is as one might expect:

0:000> k
ChildEBP RetAddr  
0012ff3c 010015ef TestApp!f3+0x19
0012ff4c 010015fe TestApp!f2+0x15
0012ff54 0100161b TestApp!f1+0x9
0012ff5c 01001896 TestApp!wmain+0xe
0012ffa0 77573833 TestApp!__tmainCRTStartup+0x10f
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Regardless of whether FPO optimization is turned on or off, since we have symbols loaded, we’ll get a reasonable call stack either way. The story is different, however, if we do not have symbols loaded. Looking at the same program, with FPO optimizations enabled and symbols not loaded, we get somewhat of a mess if we ask for a call stack:

0:000> k
ChildEBP RetAddr  
WARNING: Stack unwind information not available.
Following frames may be wrong.
0012ff4c 010015fe TestApp+0x15d8
0012ffa0 77573833 TestApp+0x15fe
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Comparing the two call stacks, we lost three of the call frames entirely in the output. The only reason we got anything slightly reasonable at all is that WinDbg’s stack trace mechanism has some intelligent heuristics to guess the location of call frames in a stack where frame pointers are used.

If we look back to how call stacks are setup with frame pointers (from the previous article), the way a program trying to walk the stack on x86 without symbols works is by treating the stack as a sort of linked list of call frames. Recall that I mentioned the layout of the stack when a frame pointer is used:

[ebp-01]   Last byte of the last local variable
[ebp+00]   Old ebp value
[ebp+04]   Return address
[ebp+08]   First argument...

This means that if we are trying to perform a stack walk without symbols, the way to go is to assume that ebp points to a “structure” that looks something like this:

typedef struct _CALL_FRAME
{
   struct _CALL_FRAME* Next;
   void*               ReturnAddress;
} CALL_FRAME, * PCALL_FRAME;

Note how this corresponds to the stack layout relative to ebp that I described above.

A very simple stack walk function designed to walk frames that are compiled with frame pointer usage might then look like so (using the _AddressOfReturnAddress intrinsic to find “ebp”, assuming that the old ebp is 4 bytes before the address of the return address):

LONG
StackwalkExceptionHandler(
   PEXCEPTION_POINTERS ExceptionPointers
   )
{
   if (ExceptionPointers->ExceptionRecord->ExceptionCode
      == EXCEPTION_ACCESS_VIOLATION)
      return EXCEPTION_EXECUTE_HANDLER;

   return EXCEPTION_CONTINUE_SEARCH;
}

void
stackwalk(
   void* ebp
   )
{
   PCALL_FRAME frame = (PCALL_FRAME)ebp;

   printf("Trying ebp %p\\n",
      ebp);

   __try
   {
      for (unsigned i = 0;
          i < 100;
          i++)
      {
         if ((ULONG_PTR)frame & 0x3)
         {
            printf("Misaligned frame\\n");
            break;
         }

         printf("#%02lu %p  [@ %p]\\n",
            i,
            frame,
            frame->ReturnAddress);

         frame = frame->Next;
      }
   }
   __except(StackwalkExceptionHandler(
      GetExceptionInformation()))
   {
      printf("Caught exception\\n");
   }
}

#pragma optimize("y", off)
__declspec(noinline)
void printstack(
   )
{
   void* ebp = (ULONG*)_AddressOfReturnAddress()
     - 1;

   stackwalk(
      ebp);
}
#pragma optimize("", on)

If we recompile the program, disable FPO optimizations, and insert a call to printstack inside the f3 function, the console output is something like so:

Trying ebp 0012FEB0
#00 0012FEB0  [@ 0100185C]
#01 0012FED0  [@ 010018B4]
#02 0012FEF8  [@ 0100190B]
#03 0012FF2C  [@ 01001965]
#04 0012FF5C  [@ 01001E5D]
#05 0012FFA0  [@ 77573833]
#06 0012FFAC  [@ 7740A9BD]
#07 0012FFEC  [@ 00000000]
Caught exception

In other words, without using any symbols, we have successfully performed a stack walk on x86.

However, this all breaks down when a function somewhere in the call stack does not use a frame pointer (i.e. was compiled with FPO optimizations enabled). In this case, the assumption that ebp always points to a CALL_FRAME structure is no longer valid, and the call stack is either cut short or is completely wrong (especially if the function in question repurposed ebp for some other use besides as a frame pointer). Although it is possible to use heuristics to try and guess what is really a call/return address record on the structure, this is really nothing more than an educated guess, and tends to be at least slightly wrong (and typically missing one or more frames entirely).

Now, you might be wondering why you might care about doing stack walk operations without symbols. After all, you have symbols for the Microsoft binaries that your program will be calling (such as kernel32) available from the Microsoft symbol server, and you (presumably) have private symbols corresponding to your own program for use when you are debugging a problem.

Well, the answer to that is that you will end up needing to record stack traces without symbols in the course of normal debugging for a wide variety of problems. The reason for this is that there is a lot of support baked into NTDLL (and NTOSKRNL) to assist in debugging a class of particularly insidious problems: handle leaks (and other problems where the wrong handle value is getting closed somewhere and you need to find out why), memory leaks, and heap corruption.

These (very useful!) debugging features offer options that allow you to configure the system to log a stack trace on each heap allocation, heap free, or each time a handle is opened or closed. Now the way these features work is that they will capture the stack trace in real time as the heap operation or handle operation happens, but instead of trying to break into the debugger to display the results of this output (which is undesirable for a number of reasons), they save a copy of the current stack trace in-memory and then continue execution normally. To display these saved stack traces, the !htrace, !heap -p, and !avrf commands have functionality that locates these saved traces in-memory and prints them out to the debugger for you to inspect.

However, NTDLL/NTOSKRNL needs a way to create these stack traces in the first place, so that it can save them for later inspection. There are a couple of requirements here:

  1. The functionality to capture stack traces must not rely on anything layed above NTDLL or NTOSKRNL. This already means that anything as complicated as downloading and loading symbols via DbgHelp is instantly out of the picture, as those functions are layered far above NTDLL / NTOSKRNL (and indeed, they must make calls into the same functions that would be logging stack traces in the first place in order to find symbols).
  2. The functionality must work when symbols for everything on the call stack are not even available to the local machine. For instance, these pieces of functionality must be deployable on a customer computer without giving that computer access to your private symbols in some fashion. As a result, even if there was a good way to locate symbols where the stack trace is being captured (which there isn’t), you couldn’t even find the symbols if you wanted to.
  3. The functionality must work in kernel mode (for saving handle traces), as handle tracing is partially managed by the kernel itself and not just NTDLL.
  4. The functionality must use a minimum amount of memory to store each stack trace, as operations like heap allocation, heap deallocation, handle creation, and handle closure are extremely frequent operations throughout the lifetime of the process. As a result, options like just saving the entire thread stack for later inspection when symbols are available cannot be used, since that would be prohibitively expensive in terms of memory usage for each saved stack trace.

Given all of these restrictions, the code responsible for saving stack traces needs to operate without symbols, and it must furthermore be able to save stack traces in a very concise manner (without using a great deal of memory for each trace).

As a result, on x86, the stack trace saving code in NTDLL and NTOSKRNL assumes that all functions in the call frame use frame pointers. This is the only realistic option for saving stack traces on x86 without symbols, as there is insufficient information baked into each individual compiled binary to reliably perform stack traces without assuming the use of a frame pointer at each call site. (The 64-bit platforms that Windows supports solve this problem with the use of extensive unwind metadata, as I have covered in a number of past articles.)

So, the functionality exposed by pageheap’s stack trace logging, and handle tracing are how stack traces without symbols end up mattering to you, the developer with symbols for all of your binaries, when you are trying to debug a problem. If you make sure to disable FPO optimization on all of your code, then you’ll be able to use tools like pageheap’s stack tracing on heap operations, UMDH (the user mode heap debugger), and handle tracing to track down heap-related problems and handle-related problems. The best part of these features is that you can even deploy them on a customer site without having to install a full debugger (or run your program under a debugger), only later taking a minidump of your process for examination in the lab. All of them rely on FPO optimizations being disabled (at least on x86), though, so remember to turn FPO optimizations off on your release builds for the increased debuggability of these tough-to-find problems in the field.

Use a custom symbol server in conjunction with IDA with Vladimir Scherbina’s IDA plugin

Tuesday, December 5th, 2006

Vladimir Scherbina has recently released a useful IDA plugin that enhances IDA’s now built-in support for loading symbols via the symbol server to allow custom symbol server paths. This is something I personally have been wanting for some time; IDA’s PDB loading mechanism overrides _NT_SYMBOL_PATH with a hardcoded value of the Microsoft symbol server. This breaks my little trick for injecting symbol server support into programs that do not already support it, which is fairly annoying. Now, with Vladimir’s plugin, you can have IDA use a custom symbol server without having to hack the PDB plugin and change its hardcoded string constant for the Microsoft symbol server path. (Plus, you can have IDA use your local downstream store cache as well – another disadvantage to how IDA normally loads symbols via PDB.)