Archive for the ‘Windows’ Category

Fast kernel debugging for VMware, part 1: Overview

Thursday, October 4th, 2007

Note: If you are just looking for the high speed VMware kernel debugging program, then you can find it here. This post series outlines the basic design principles behind VMKD.

One of the interesting talks at Blue Hat was one about virtualization and security. There’s a lot of good stuff that was touched on (such as the fact that VMware still implements a hub, meaning that VMs on the same VMnet can still sniff eachother’s traffic).

Anyways, watching the talk got me thinking again about how much kernel debugging VMs is a slow and painful experience today, especially if you’ve used 1394 debugging frequently.

While kd-over-1394 is quite fast (as is local kd), you can’t do that in any virtualization software that I’m aware of today (none of them virtualize 1394, and furthermore, as far as I know none of them even support USB2 debugging either, even VMware Workstation 6).

This means that if you’re kernel debugging a VM in today’s world, you’re pretty much left high and dry and have to use the dreaded virtual serial port approach. Because the virtual serial port has to act like a real serial port, it’s also slow, just like a real serial port (otherwise, timings get off and programs that talk to the serial port break all over the place). This means that although you might be debugging a completely local VM, you’re still throttled to serial port speeds (115200bps). Although it should certainly technically be possible to do better in a VM, none of the virtualization vendors support the virtual hardware required for the other, faster KD transports.

However, that got me thinking a bit. Windows doesn’t really need a virtual serial port or a virtual 1394 port to serve as a kernel debugging target because of any intrinsic, special property of serial or 1394 or USB2. Those interfaces are really just mechanisms to move bits from the target computer to the debugger computer and back again, while requiring minimal interaction with the rest of the system (it is important that the kernel debugger code in the target computer be as minimalistic and self-contained as possible or many situations where you just can’t debug code because it is used by the kernel debugger itself would start cropping up – this is why there isn’t a TCP transport for kernel debugging, among other things).

Now with a VM (as opposed to a physical computer), getting bits to and from an external kernel debugger and the kernel running in the VM is really quite easy. After all, the VM monitor can just directly read and write from the VM’s physical memory, just like that, without a need for indirecting through a real (or virtual) I/O interconnect interface.

So I got to be thinking that it should theoretically be possible to write a kernel debugger transport module that instead of talking to a serial, 1394, or USB2 port, talks to the local VMM and asks it to copy memory to and from the outside world (and thus the kernel debugger). After the data is safely out of the VM, it can be transported over to the kernel debugger (and back) with the mechanism of choice.

It turns out that Windows KD support is implemented in a way that is fairly conducive to this approach. The KD protocol is divided up into essentially two different parts. There’s the high level half, which is essentially a command set that allows the kernel debugger to request that the KD stub in the kernel perform an operation (like change the active register set, set a breakpoint, write memory, or soforth). The high level portion of the KD protocol sits on top of what I call the low level or (framing) portion of the KD protocol, which is a (potentially hardware dependant) transport interface that provides for reliable delivery of high level KD requests and responses between the KD program on a remote computer and the KD stub in the kernel of the target computer.

The low level KD protocol is abstracted out via a set of kernel debugger protocol modules (which are simple kernel mode DLLs) that are used by the kernel to talk to the various pieces of hardware that are supported for kernel debugging. For example, there is a module to talk to the serial port (kdcom.dll), and a module to talk to the 1394 controller (kd1394.dll).

These modules export a uniform API that essentially allows the kernel to request reliable (“mostly reliable”) transport of a high level KD request (say a notification that an exception has occured) from the kernel to the KD program, and back again.

This interface is fortunate from the perspective of someone who might want to, say, develop a high speed kernel debugger module for a VM running under a known VM monitor. Such a KD protocol module could take advantage of the fact that it knows that it’s running under a specific VM monitor, and use the VM monitor’s built-in VM exit / VM enter capabilities to quickly tell the VM monitor to copy data into and out of the VM. (Most VMs have some sort of “backdoor” interface for optimized drivers and enhanced guest capabilities, such as a way for the guest to tell the host when its mouse pointer has left the guest’s screen. For example, in the case of VMware, there is a “VMware Tools” program that you can install which provides this capability through a special “backdoor” interface to that allows the VM to request a “VM exit” for the purposes of having the VM monitor perform a specialized task.)

Next time: Examining the KD module interface, and more.

Common WinDbg problems and solutions

Monday, September 24th, 2007

When you’re debugging a program, the last thing you want to have to deal with is the debugger not working properly. It’s always frustrating to get sidetracked on secondary problems when you’re trying to focus on tracking down a bug, and especially so when problems with your debugger cause you to lose a repro or burn excessive amounts of time waiting around for the debugger to finish doing who knows what that is taking forever.
a
This is something that I get a fair amount of questions about from time to time, and so I’ve compiled a short list of some common issues that one can easily get tripped up by (and how to avoid or solve them).

  1. I’m using ntsd and I can’t get symbols to load, or most of the debugger extension commands (!commands) don’t work. This usually means that you launched the ntsd that ships with the operating system (prior to Windows Vista), which is much older than the one shipping with the debugger package. Because it is in the system directory, it will be in your executable search path.

    To fix this problem, use the ntsd executable in the debugger installation directory.

  2. WinDbg takes a very long time to process module load events, and it is using max processor time (spinning) on one CPU. This typically happens if you have many unqualified breakpoints that track module load events (created via bu) saved in your workspace. This problem is especially noticible when you are working with programs that have a very large number of decorated C++ symbols, such as debug builds of programs that make heavy use of the STL or other template classes. Unqualified breakpoints are expensive in general due to forcing immediate symbol loads of all modules, but moreover they also force the debugger to undecorate and perform pattern matches against every symbol in a module that is being loaded, for every unresolved breakpoint.

    If you allow a large number of unqualified breakpoints to become saved in a default workspace, this can make the debugger appear to be extremely slow no matter what program you are debugging.

    To avoid getting bitten by this problem, don’t use unqualified breakpoints (breakpoints without a modulename! prefix on their address expression) unless absolutely necessary. Also, it’s typically a good idea to clear all your breakpoints before you save your workspace if you don’t need them to be saved for your next debugging session with that debugger workspace (by default, bu breakpoints are persisted in the debugger workspace, unlike bp breakpoints which go away after every debugging session). If you are in the habit of saving the workspace every time you attach to a running process, and you often use bu breakpoints, this will tend to clutter up the user default workspace and can quickly lead to very poor debugger performance if you’re not careful.

    You can use the bc command to delete breakpoints (bc * to remove all breakpoints), although you will need to save the workspace to persist the changes. If the problem has gotten to the point where it’s not possible to even get past module loading in a reasonable amount of time so that you can use bc * to clear out saved breakpoints, you can remove the contents of the HKCU\Software\Microsoft\Windbg\Workspaces registry key and subkeys to return WinDbg to a pristine state. This will wipe out your saved debugger window positions and other saved debugger settings, so use it as a last resort.

  3. WinDbg takes a very long time to process module load events, but it is not consuming a lot of processor time. This typically means that your symbol path includes either a broken HTTP symbol store link or a broken UNC symbol store path. A non-responsive path in your symbol path will cause any operation that tries to load symbols for a module to take a long time to complete as a network timeout will be occuring over and over again.

    Use !sym noisy, followed by .reload /f to determine what part of your symbol path is not working correctly. Then, fix or remove the offending part of the symbol path.

    This problem can also occur when you are debugging a program that is in the packet path for packets destined to a location on the symbol path. In this case, the typical workaround I recommend is to set an empty symbol path, attach to the process in question, write a dump file, and then detach from the process. Then, restore the normal symbol path and open the dump file in the debugger, and issue a .reload /f command to force all symbols to be pre-cached ahead of time. After all symbols are pre-cached in the downstream store cache, change the symbol path to only reference the downstream store cache location and not any UNC or HTTP symbol server paths, and attach the debugger to the process in the packet path for symbol server access.

  4. WinDbg refuses to load symbols for a module that I know the symbol server has symbols for. This issue can occur if WinDbg has previously tried (and failed) to download symbols for a module. There appears to be a bug in dbghelp’s symbol server support which can sometimes result in partially downloaded PDB files being left in the downstream store cache. If this happens, future attempts to access symbols for the module will fail with an error saying that symbols for the module cannot be found.

    If you turn on noisy symbol loading (!sym noisy), a more descriptive error is typically given. If you see a complaint about E_PDB_CORRUPT, then you are probably falling victim to this issue. The debugger output that indicates this problem would look like something along the lines of this:

    DBGHELP: c:\symbols­\ntdll.pdb­\2744327E50A64B24A87BDDCFC7D435A02­\ntdll.pdb – E_PDB_CORRUPT

    If you encounter this problem, simply delete the .pdb named in the error message and retry loading symbols via the .reload /f <modulename> command.

  5. WinDbg hangs and never comes back when I attach to a specific process, such as an svchost instance. If you’re sure that you aren’t experiencing a problem with a broken symbol path or unqualified module load tracking breakpoints being saved in your workspace, and the debugger never comes back when attaching to a certain process (or almost always hangs after the first command when attaching to the process in question), then the process you are debugging may be in a code path responsible for symbol loading.

    This problem is especially common if you are debugging an svchost instance, as there are a lot of important but unrelated pieces of code running in the various svchost instances, some of which are critical for network symbol server support to work. If you are debugging a process in the critical path for network symbol server support, and you have a symbol path with a network component set, then you may cause the debugger to deadlock (hang forever) the first time you try and load symbols.

    One example of a situation that can cause this is if you are debugging code in the same svchost instance as the DNS cache service. In this case, when you try to load symbols and you have an HTTP symbol server link in your symbol path, the debugger will deadlock because it will try and make an RPC call to the DNS cache service when it tries to resolve the hostname of the server referenced in your symbol path. Because the DNS cache service will never respond until the debugger resumes the process, and the debugger will never resume the process until it gets a response from the RPC request to the DNS cache service, your debugging session will hang indefinitely.

    Note that if you are simply debugging something in the packet path of a symbol server store, you will typically see the debugger become unresponsive for long periods of time but not hang completely. This is because the debugger can handle network timeouts (if somewhat slowly) and will eventually fail the request to the network symbol path. However, if the debugger tries to make an IPC request of some sort to the process being debugged, and the IPC request doesn’t have any built-in timeout (most local IPC mechanisms do not), then the debugger session will be lost for good.

    This problem can be worked around similarly to how I typically recommend users deal with slow module loading or failed symbol server accesses with a program in the packet path for a symbol server referenced in the symbol path. Specifically, it is possible to pre-cache all symbols for the process by creating a dump of the process from a debugger instance with an empty symbol path, and then detaching and opening the dump with the full symbol path and forcing a download of all symbols. Then, start a debugging session on the live process with a symbol path that references only the local downstream store into which symbols were being downloaded to in order to prevent any dangerous network accesses from happening.

    Another common way to get yourself into this sort of debugger deadlock problem is to use the clipboard to paste into WinDbg while you are debugging a program that has placed something into the clipboard. This results in a similar deadlock as WinDbg may get blocked on a DDE request to the clipboard owner, which will never respond by virtue of being debugged. In that case, the workaround is simply to be careful about copying or pasting text into or out of WinDbg.

  6. Remote debugging with -remote or .server is flaky or stops working properly after awhile. This can happen if all debuggers in the session aren’t running the same debugger version.

    Make sure that all peers in the remote debugging scenario are using the (same) latest debugger version. If you mix and match debugger versions with -remote, things will often break in strange and hard to diagnose ways in my experience (there doesn’t seem to be a whole lot of graceful support for backwards or forwards compatibility with respect to the debugger remoting protocol).

    Also, several recent releases of the debugger package didn’t work at all in remote debugging mode on Windows 2000. This is, as far as I know, fixed in the latest release.

Most of these problems are simple to fix or avoid once you know what to look for (although they can certainly burn a lot of time if you’re caught unaware, having done that myself while learning about these “gotchas”).

If you’re experiencing a weird WinDbg problem, you should also not be shy about debugging the malfunctioning debugger instance itself. Often, taking a stack trace of all threads in the problematic debugger instance will be enough to give you an idea of what sort of problem is holding things up (remember that the Microsoft public symbol server has symbols for the debugger binaries as well as the OS binaries).

Useful debugger commands: .writemem and .readmem

Thursday, September 20th, 2007

From time to time, it can be useful to save a chunk of memory for whatever reason when you’re debugging a program. For instance, you might need to capture a long buffer argument to a function for later analysis (perhaps with a custom analysis tool outside the scope of the debugger).

There are a couple of options built in to the debugger to do this. For example, if you just want to save the contents of memory for later perusal, you could always write a complete minidump of the target. However, this has a few downsides; for one, unless you build in dump file processing capability into your analysis program, dump files are typically going to be less than easily accessible to simple analysis tools. (Although one could write a program utilizing MiniDumpReadDumpStream, this is more work than necessary.)

Furthermore, complete dumps tend to be large, and in the case of a kernel debugger connection over serial port, it can take many hours to save a kernel memory dump just to gain access to a comparatively small region of memory.

Instead of writing a dump file, another option is to use one of the display memory commands to save the contents of memory to a debugger log file. For instance, one might use “db address len“, write it to a log file, and parse the output. This is much less time-consuming than a kernel memory dump over kd, and in some cases it might be desirable to have the hex dump for you (that db provides) in plain text, but if one just wants the raw memory contents, that too is less than ideal.

Fortunately, there’s a third option: the .writemem command, which as the name implies, writes an arbitrary memory range to a file in raw binary form. There are two arguments, a filename and a range. For instance, one usage might be:

.writemem C:\\Users\\User\\Stack.bin @rsp L1000

This command would write 0x1000 bytes of stack to the file. (Remember that address ranges may include a space-delimited component to specify the length.)

The command works on all targets, including when one is using the kernel debugger, making it the command of choice for writing out arbitrary chunks of memory.

There also exists a command to perform the inverse operation, .readmem, which takes the same arguments, but instead reads memory from the file given and writes it to the specified address range. This can be useful for anything between substituting large arguments to a function out at run-time to applying large patches to replace non-trivial sections of code as a whole.

Furthermore, because the memory image format used by both commands is just the raw bits from the target, it becomes easy to work with the written out data with a standard hex editor, or even a disassembler. (For instance, another common use case of .writemem is to when dealing with self-modifying code, write the code out to a file after it has been finalized, and then load the resulting raw memory image up as raw opcodes in a more full-featured disassembler than the debugger.)

Never, ever, EVER wake a computer from suspend without user consent

Thursday, September 13th, 2007

I am not a happy camper.

Today, I got in to work and unpacked my laptop from my laptop bag and discovered that it had gone into hibernation due to a critically low battery event. That was fairly strange, because last night I had suspended my laptop (fully charged) and placed it into my laptop bag. Somehow, it managed to consume a full battery charge between my putting it into my bag last night, and my getting in to work.

This is obviously not good, because it meant that the laptop had to have been powered on while in my laptop bag for it to have possibly used that much battery power (a night in suspend is a drop in the bucket as far as battery life is concerned). Let me tell you a thing or two about laptop bags: they’re typically padded (in other words, also typically insulated to a degree) and generally don’t have a whole lot of ventilation potential designed into them, at least as far as laptop compartments go. Running a laptop in a laptop bag for a protracted period of time is a bad thing, that’s for certain.

Given this, I was not all that happy to discover that my laptop had indeed been resumed and had been running overnight in my laptop bag until the battery got low enough for it to go into emergency hibernate mode. (Fortunately, it appears to have sustained no permanent damage from this event. This time….)

So, I set out to find out the culprit. The event log showed the system clearly waking just seconds after the clock hit 3AM local time. Hmm. Well, that’s a rather interesting thing, because Windows Update is by default set to install updates at 3AM local time, and this Tuesday was Patch Tuesday. A quick examination of the Windows Update log (%SystemRoot%\WindowsUpdate.log) confirmed this fact:

2007-09-13 03:00:11:521 408 f4c AU The machine was woken up by Windows Update

Great. Well, according the event logs, it ran for another 1 hour and 45 minutes or so before the battery got sufficiently low for it to go into hibernate. So, apparently, Windows Update woke my laptop, on battery, in my laptop bag to install updates. It didn’t even bother to suspend it after the fact (gee, thanks), leaving the system to keep running until either it ran out of battery or something broke due to running in a confined place with zero ventilation. Fortunately, I lucked out and the former happened this time.

But that’s not all. This is actually not the first time this has happened to me. In fact, on August 29 (last month), I got woken up at 3AM because Windows Update decided to install some updates in the middle of the night. That time, it apparently needed to reboot after installing updates, and I got woken up by the boot sound (thanks for that, Windows Update!). At the time, I wrote it off as “intended” behavior, as the system happened to be plugged in to wall power overnight and a friend pointed out to me that Windows Update states that while plugged in, it will resume a computer to install updates and put it back into suspend afterwards.

Well, that’s fine and all (aside from the waking me up at 3AM part, which sucked, but I suppose it was documented to be that way). Actually, I’ll take that back, my computer waking me up in the middle of the night to automatically install updates is far from fine, but that pales in comparison to what happened the second time around. The powering the system on while it was on battery power to install updates, however, is a completely different story indeed.

This is, in my opinion, a spectacular failure of the “left hand not knowing what the right is doing” sort at Microsoft. One of the really great things about Windows Vista was that it was taking back power management from all the uncooperative programs out there. Except for, I suppose, Windows Update.

Consider that for a portable (laptop/notebook) computer, it is often the case that it’s downright dangerous to just wake the computer at unexpected times. For example, what if I was on an airplane during takeoff and Windows Update decided that, in its vast and amazing knowledge of what’s best for the world, it would just power on my laptop and enable Bluetooth, 802.11, etc. Or say my laptop was sitting in its laptop bag (somewhere I often leave it overnight to save myself the trouble of putting it there in the morning before I go to work), and it powers on to do intensive tasks like install updates and reboot with no ventilation to the system, and overheats (suffering irrreparable physical damage as a result). Oh, wait, that is what happpened… except that the laptop in question survived, this time around. I wonder if everyone else with a laptop running Windows Vista will be so lucky (or if I’ll be so lucky next time).

What if I was for some reason carrying my laptop in my laptop bag and, say, walking, going up a flight of stairs, running, whatnot, and Windows Update decided that it was so amazingly cool, that it would power on my computer without asking me and the hard drive crashed from being spun up while being moved under unsafe (for a hard drive) conditions?

In case whoever is responsible for this (amazingly negligent) piece of code in Windows Update every reads this, let me spell it out for you. Read my lips (or words): It is unacceptable to wake my computer up without asking me. U-N-A-C-C-E-P-T-A-B-L-E, in case that was not clear. I don’t care what the circumstances are. It’s kind of like Fight Club: Rule no. 1 is that you do not do that, ever, no matter what the circumstances are. And I really dare anyone at Microsoft to say that by accepting the default settings for Windows Update, I was consenting to it running my laptop for prolonged periods of time in a laptop bag. Yeah, I thought so…

You can absolutely bet that I’ll be filing a support incident about this while I do my best to have this code, how shall I say, permanently evicted from Windows Update, on behalf of all laptop owners. I can see how someone might have thought that it would be a cool idea to install updates even if you suspend your computer at night (which being the default in Vista would happen often). However, it just completely, jaw-droppingly drops the ball with portable computers (laptops), in the absolute worst way possible. That is, one that can result in physical damage to the end user’s computer and/or permanent data loss. This is just so obvious to me that I literally could not believe what had happened when I woke up, that something that ships by default and is turned on by default with Vista would do something so completely stupid, so irresponsible, so negligent.

I was pretty happy with the power management improvements in Windows Vista up until now. Windows Update just completely spoiled the party.

The worst thing is that if the consequences of resuming laptops without asking weren’t already so blindingly obvious, this topic (forcing a system resume programmatically) comes up on the Microsoft development newsgroups from time to time, and it’s always shot down immediately because of the danger (yes, danger) of unexpected programmatic resumes with laptop computers.

(Sorry if this posting comes off as a bit of a rant. However, I don’t think anyone could disagree that the possibility of Windows Update automatically powering on the laptop in the scenarios I listed above could possibly be redeemable.)

Things I learned while poking around with Exchange 2007

Tuesday, August 28th, 2007

(Warning: Long post about Exchange 2007 setup woes ahead.)

Recently, I’ve had the chance (or the misfortune, some might say) of having some time with poking around in an Exchange 2007 configuration. Here’s a brief list of some of the more annoying things I’ve ran into along the way, and how I resolved them (in no particular order):

  1. Offline Address Book (OAB) doesn’t work in RPC-HTTP (Outlook Anywhere) mode unless you have autodiscovery set up. This one took me several weeks to figure out. It was a pretty annoying problem because first I didn’t even know what was the problem – when I configured Outlook 2007 to talk to Exchange in native RPC mode, and then configured it to work over Outlook Anywhere (RPC-HTTP), some indeterminant time after that I would occasionally get complaints from Outlook that send/receive had failed because an item was not found. Now, I enabled full Outlook debug logging, but of course, there was absolutely no mention of this anywhere at all in Outlook’s logs on disk – in fact, there was barely anything useful there at all. Only by searching in Google for a long time did I learn that it might be related to the OAB (or Offline Address Book). After learning this, I narrowed it down to the OAB by determining that doing a send/receive just for purposes of downloading the address book would always break (you can do this from the send/receive menu).

    However, fixing the problem was quite another story. The thing that was most frustrating is that, of course, RPC-HTTP runs over SSL and so you can’t do a packet capture of what was going on. So I tried checking IIS logs, but there weren’t any hits outside of the RPC-HTTP proxy URL (nothing at all apparently related to OAB). Most of the information I had found on Google / Microsoft.com related to things like the OAB distribution URL not being set, the OAB virtual directory not being created in IIS, and a variety of other problems, all of which I could rule out as not applicable to me.

    The whole time, the Exchange address book worked fine over OWA too. And to make things even stranger, I seem to recall that it magically appeared to work if I switched Outlook off of RPC-HTTP and back to direct RPC connectivity.

    It turns out that the real problem here was that I didn’t have autodiscovery working correctly for one of the mail domains in the Exchange environment. I only thought to look at autodiscovery after reading this post about somebody else’s OWA woes. Apparently, Outlook wanted to talk to https://example.com/autodiscover or https://autodiscover.example.com/autodiscover (where “example.com” is the mail domain in use) in order to determine the OAB download URL. This explained the lack of hits in my IIS logs, as the mail server for this domain happened to be on a completely different box from anything else on that domain, so hits on example.com/autodiscover would never show up. Because the mail server wasn’t even on that domain, I decided to just go with autodiscover.example.com. However, this presented a problem, as I would need to acquire another cert and another IP address for that
    domain, just for Outlook to not complain about the OAB periodically. Ugh!

    After doing that (I ended up using a subject alternate name (SAN) certificate), the OAB magically began working in Outlook 2007. Hooray.

  2. Don’t specify a list of domain names (-DomainName) for New-ExchangeCertificate in quotes if you are requesting a SAN certificate. I spent about 10 minutes staring at my command line wondering just what was possibly wrong with it before I learned that the way New-ExchangeCertificate is written, it expects a list of command line arguments and not a command line argument that is a list (subtle distinction, eh?). This one turned out to be my not paying close enough attention to the documentation examples.
  3. Exchange will not accept a certificate with an “E=” in the subject name field for the TLS listeners for IMAP4 / POP3 / SMTP. This one ate up a good chunk of time trying to work through as well. I had filled in an email field out of habit when requesting a cert for Exchange from the domain CA, and everything else in the world besides Exchange had no problem with it. That is to say, IIS liked it, browsers liked it, RDP-SSL liked it, and pretty much everything else I tried with it worked. However, as soon as I gave it to Exchange to use for IMAP4 / POP3 / SMTP, it would barf with an extremely unhelpful (and totally misleading!) event log message:

    A certificate for the hostname “example.com” could not be found. SSL or TLS encryption cannot be made to the IMAP service.

    Which was of course, completely wrong. The certificate was there in the cert store for the computer account, and nothing else had any trouble recognizing it. Even the Exchange console recognized it just fine, but the service just would not take it on start.

    Now, to make things even worse, the Event ID of that event log message happened to be “2007“. Try searching in Google for “Exchange 2007 event id 2007” and you’ll see what a wonderful thing that is for getting useful information on the subject. (Hint: You’ll get pages talking about any Exchange 2007 event log message.)

    Finally, I ended up taking the “sledgehammer approach” and just made a new cert, without an “E=” in the subject name, and it magically worked. Grrrrr…

  4. Something doesn’t work right with the “-MemberOfGroup” filter when used in conjunction with an email address policy (EAP). For some reason, I could never get this to work. The bizzare thing was, the exact same filter string would work great for an address book policy. Furthermore, when dumping the EAP out, if I ran the LDAP query that the filter OPATH got translated into in the AD management console, it returned the expected results. Even more baffling, if I used any other custom filter besides “-MemberOfGroup”, the EAP would work, which didn’t make any sense at all given that the exact same OPATH filter worked fine with an address book policy. I never got to the bottom of this unfortunately, and finally gave up and used a filter off of one of the AD properties for a user instead (which, by the way, worked fine as both a custom or precanned filter).

    I’m guessing that this one has got to be something misconfigured or broken on my end, but for the life of me I have absolutely no idea what. The particular Exchange install was a fresh one on a completely clean Active Directory, all pretty much the most extreme simple case possible (both of which were, as far as I know, done by the books).

  5. Exchange requires that the root certificate authority issuing a non-self-signed certificate be trusted for a certificate to be used for IMAP4 / POP3 / SMTP. Another fun one, unlike everything else in Windows (including RDP-SSL, IIS, etc), Exchange barfs on a certificate for usage as a server if the root CA is not trusted. If you used an external CA to save yourself the headache of setting up a domain CA just for Exchange testing, make sure that Exchange is configured to trust it, or IMAP4 / POP3 / SMTP will all fall over when given a cert issued by that CA. This holds true for both Hub Transport / Client Access / Edge Transport roles in my observation.
  6. The Exchange mangement console and Exchange command shell use a ton of memory. In my experience, the management console (MMC applet) displaced something on the order of 150-200MB* of commit if you watch the memory counters closely before it finished loading. The command shell (PowerShell-based) clocked in at a cool 80MB or so. Wow. Whatever happened to Bill Gate’s quote on memory usage:

    For DOS LM 1.0, the redirector took up 64K of RAM.

    And Bill went ballistic.

    “What do you mean 64K? When we wrote BASIC, it only took up 8K of RAM. What the f*k do you think idiots think you’re doing? Is this thing REALLY 8 F*ing BASIC’s?”

    Yeah, I’m a programmer, and I realize that more complicated software tends to make memory-performance trade offs. Still, it’s a mail server management UI and a command shell. I find it rather amazing that Visual Studio 2005 Team Suite in all of its glory and .NET-ness still manages to clock in at less memory usage (after loading a project) than the MMC console for a mail server. Oh, and to add insult to injury, the MMC GUI can’t even do about half of the administrative tasks out there, despite its humongous memory footprint (to be fair, the GUI is supposed to be enhanced to cover most of the “missing spots” in the SP1 timeframe, from what I can see). Times change, I guess…

    * Note: To be fair (and more precise), dumping the address space and totalling MEM_COMMIT | MEM_PRIVATE regions turned up ~160MB of unshared memory after spawning one instance of the MMC console when the commit charge for the process was ~200MB.

  7. The Exchange 2007 management console has friendly, owner-drawn windows with pretty gradient custom background bitmaps. In other words, the GUI looks nice and polished… until you try and use it over an Internet link instead of a LAN. Then you end up waiting 10 seconds for the “New Mailbox” page to blit its pretty gradient background over to mstsc.exe, block by block. Whoops. To Microsoft’s credit, there is an option (View->Visual Effects->Never) to turn this off. Too bad that it seems to be stuck on maximum graphical awesomeness (otherwise read as excruciatingly slowness over Terminal Server) by default, and that the option is arguably rather well hidden.

Okay, enough bashing on Exchange (and to be fair, the last two items are more things that annoyed me about Exchange than something I’d consider a “problem” in some sense of the word). I like it for all of the cool things it offers, and knowing what I do now, I could probably have gotten a simple Exchange 2007 configuration running in half the time or less than it took the first time around. Furthermore, I’m sure that most of the other large, competing integrated messaging solutions also have their fair share of skeletons in the closet, too.

But that doesn’t change the fact that despite going over the documentation available to me, setting up Exchange was an exercise in banging my head against bizzare problem after bizzare problem, while intermittantly either waiting on pretty dialogs to take forever to transfer over RDP (before I figured out how to “de-prettyify” the GUI), or waiting for the MMC console to finish loading.

Yes, yes, I know, I should be using the PowerShell applet to do all of the work instead of the clunky GUI. Sorry, but I don’t fancy typing out long strings like “First Storage Group”, “CN=First Storage Group,CN=InformationStore”, etc over and over and over again until my fingers bleed. Sometimes, having a GUI is a good thing. Both tools are useful, but sometimes it is more convenient to use the GUI, and sometimes it is more convenient to use the command line. On the plus side, the fact that Exchange 2007 appears to completely support full administration exclusively via a scriptable command line is a big step forward.

Enough of my Exchange meanderings for now. Back to more regular topics next time…

PatchGuard v3 has no relation to “Purple Pill”

Thursday, August 16th, 2007

One of the things that seems to have been making the rounds lately is some confusion that the recent announcement that Kernel Patch Protection has been updated (“PatchGuard v3”) is a response to Alex Ionescu‘s “Purple Pill”. For those that missed the news recently, Purple Pill is a program that Alex wrote (and briefly posted) that uses a bug in a common ATI driver to gain code execution in kernel mode and from there load a driver image, bypassing the normal Kernel Mode Code Signing (KMCS, otherwise known as Ci – Code Integrity) checks.

Now, KMCS and PatchGuard are designed to achieve different goals. PatchGuard is intended, for better or worse, to prevent third party code from doing things like hooking system calls, hooking interrupts belonging to other code, performing code patches on kernel exports, and the like. Historically, many third party drivers have done this in dangerous and incorrect ways (something that appears to be common among personal security software suites). With the move to 64-bit (which requires a recompile of all kernel mode code), Microsoft took the opportunity to say “enough is enough” and outlaw things like system call hooking, with this ban being enforced by what has come to be known as PatchGuard.

Now, while it is arguable that the assertion that it is really not possible to do correctly many of the things that PatchGuard blocks is really true, it is a sad fact that the vast majority of third party drivers out there which do such dangerous deeds are buggy (and often introduce security holes or crash – bluescreen – bugs in the process). PatchGuard is an obfuscation-protected system for periodically verifying the integrity of various key kernel modules in-memory, as well as the integrity of certain important kernel global variables and processor registers (e.g. a select few MSRs).

On the other hand, KMCS is intended to address a completely different problem. KMCS enforces a policy that every driver that is loaded on Vista x64 (and later Windows versions) must be signed with a valid VeriSign code signing certificate. There is nothing more and nothing less to KMCS than that; the entity of KMCS can be summed up as a mandatory digital signature check on loadable kernel modules. Unlike PatchGuard, KMCS has no way (nor does it try to) enforce any sort of restrictions on what sort of actions any code that passes the signature check can take. That’s not to say that there aren’t potential consequences for signing a driver that does something Microsoft doesn’t like, as they could always blacklist a given driver (or signing key) if they so desired. However, KMCS itself doesn’t have a “magic crystal ball” that would allow it to automatically determine whether a given driver is “good” or “bad”. Rather, as long as the driver has a valid signature and hasn’t been revoked, it is simply permitted to be loaded and executed.

In other words, there is nothing in PatchGuard that has any bearing on Purple Pill. Alex’s program essentially exploits a bug in a (perfectly legitimate) third party driver with a valid signature. Unless his program goes out of its way to do something that would attract the attention of PatchGuard (which would not be necessary to achieve the task of simply mapping code into kernel mode and calling an entrypoint on the loaded image), PatchGuard can be said to be completely unrelated to Purple Pill. And although I wasn’t involved in the creation of Purple Pill, I can tell you that there ought to have been no need to do anything that would have aroused PatchGuard’s ire in order to accomplish what Alex’s program does.

Furthermore, it’s actually a fact that PatchGuard v3 has been available for a lot longer than Purple Pill has (Windows Server 2008 – formerly Windows Server Codename “Longhorn” – has shipped PatchGuard v3 since at least Beta 3), further dispeling the myth that the two are connected in any way. Microsoft simply chose this “Patch Tuesday” to publicly announce PatchGuard v3 and begin pushing it out to users via Windows Update. In this case, the coincidence is just that – a coincidence. You can easily verify this yourself, as the code that I posted for disabling PatchGuard v2 doesn’t in fact work on Windows Server 2008 Beta 3; you’ll get a bugcheck 109 within a few minutes of trying to bypass the driver using the method I provided an implementation for in that paper.

It can sometimes be confusing as to what the relation is between KMCS and PatchGuard, especially because both are at present only implemented on x64 systems, and at least in PatchGuard’s case, PatchGuard on Vista (for some reason) seemed to get more press than PatchGuard on Windows Server 2003 (despite the fact that they are essentially the same). However, with some careful consideration as to what the two technologies do, it’s clear that there isn’t really a direct relation.

You can write a (complete) minidump of a process from TaskMgr in Vista

Friday, August 10th, 2007

One of the often overlooked enhancements that was made to Windows with the release of Vista is the capability to write a complete (large) minidump describing the state of a process from Task Manager. To use this functionality, switch to the Processes tab in Task Manager, access the right click (context) menu for a process for which your user account has access to, and select Create Dump File.

Although not as handy as having an “in-box” debugger (ntsd was most regrettably removed from the Vista distribution on the grounds that too many users were getting it and the (typically more up to date) DTW distribution of ntsd confused), Microsoft has thrown the developer crowd at least something of a bone with the dump file support in Task Manager. (It’s at least easier to talk a non-developer through getting a dump via Task Manager than via ntsd, or so one would suppose.)

The create dump file option writes a full minidump out to %temp%\exename.dmp. The dump is large and describes a fairly complete state of the process, so it would be a good idea to compress it before transfer. (I don’t know of any option to generate summary dumps that ships with the OS. However, to be honest, unless space is of an extreme concern, I don’t know why anyone would want to write a summary dump if they are manually gathering a dump – full state information is definitely worth a few minutes wait of transfer time on a typical cable/DSL connection.)

While I’d still prefer having a full debugger shipped with the OS (still powerful, even without symbol support), the new Task Manager support is definitely better than nothing. Though, I still object to the line of thought that it’s better to remove developer tools from the default install because people might accidentally run an older version that shipped with the OS instead of the newer version they installed. Honestly, if a person is enough of a developer to understand how to work ntsd, they had better damn well be able to know which version they are starting (which is really not that much of a feat, considering that the start up banner for ntsd prints out the version number on the first line). If someone is really having that much trouble with launching the wrong version of the debugger, in my expert opinion that is going to be the least of their problems in effectively debugging a problem.

(</silly_rant> – still slightly annoyed at losing out on the debuggers on the default Vista install [yep, I used them!])

How least privilege is that service, anyway (or much ado about impersonation) – part 2

Wednesday, August 8th, 2007

Last time, I described some of the details behind impersonation (including a very brief overview of some of the dangers of using it improperly, and how to use impersonation safely via Security Quality of Service). I also mentioned that it might be possible to go from a compromised LocalService / NetworkService to full LocalSystem access in some circumstances. This article expands upon that concept, and ties together just what impersonation means for low-privileged services and the clients that talk to them.

As I mentioned before, impersonation is not really broken by design as it might appear at first glance, and it is in fact possible to use it correctly via setting up a correct SQOS when making calls out to impersonation-enabled IPC servers. Many things do correctly use impersonation, in fact, just to be clear about that. (Then again, many things also use strcpy correctly (e.g. after an explicit length check). It’s just the ones that don’t which get all the bad press…)

That being said, as with strcpy, it can be easy to misuse impersonation, often to dangerous consequences. As they say, the devil is often in the details. The fact is that there exist a great many things out there which simply don’t use impersonation correctly. Whether this is just due to how they are designed or ignorance of the sensitive nature of allowing an untrusted server to impersonate ones security concept is debatable (though in many cases I would tend towards the latter), but for now, many programs just plain get impersonation wrong.

Microsoft is certainly not ignorant to the matter (for example, David’s post describes how Office wrappers CreateFile to ensure that it never gets tricked into allowing impersonation, because the wrapper by default doesn’t permit remote servers to fully impersonate the caller via the proper use of SQOS). However, the defaults remain permissive at least as far as APIs that connect to impersonation-enabled servers (e.g. named pipes, RPC, LPC), and by default allow full impersonation by the server unless the client program explicitly specifies a different SQOS. Even the best people make mistakes, and Microsoft is certainly no exception to this rule – programmers are, after all, only human.

In retrospect, if the security system were first being designed in today’s day and age instead of back in the days of NT 3.1, I’m sure the designers would have chosen a less security sensitive default, but the reality is that the default can probably never change due to massive application compatibility issues.

Back to the issue of LocalService / NetworkService and svchost, however. Many of these low privileged services have “careless” clients that connect to impersonation-enabled IPC servers with high privileges, even in Windows Server 2008. Moreover, due to the fact that LocalService / NetworkService isolation is from a security standpoint all but paper thin prior to Windows Server 2003 (and better, though only in the non-shared-process, that is, non-svchost case in Vista and Windows server 2008), the sort of additive attack surface problem I described in the previous article comes into play. To give a basic example, try attaching to the svchost that runs the “LocalServiceNetworkRestricted” service group, including Eventlog, Dhcp (the DHCP client), and several other services (in Windows Server 2008) and setting the following breakpoint (be sure to disable HTTP symbol server access beforehand or you’ll deadlock the debugger and have to reboot – another reason why I dislike svchost services in general):

bp RPCRT4!RpcImpersonateClient "kv ; gu ; !token ; g"

Then, wait for an event log message to be written to the system event log (a fairly regular occurance, though if you want you can use msg.exe to send a TS messge from any account if you don’t want to wait, which will result in the message being logged to the system event log immediately courtsey of the hard error logging facility). You’ll see something like this:

Call Site
RPCRT4!RpcImpersonateClient
wevtsvc!EvtCheckAccess+0x68
wevtsvc!LegacyAccessCheck+0x15e
wevtsvc!ElfrReportEventW+0x2b2
RPCRT4!Invoke+0x65
[…]

TS Session ID: 0x1
User: S-1-5-18 (LocalSystem)
Groups:
00 S-1-5-32-544 (Administrators)
Attributes – Default Enabled Owner
01 S-1-1-0
Attributes – Mandatory Default Enabled
02 S-1-5-11
Attributes – Mandatory Default Enabled
03 S-1-16-16384 (System Integrity)
Attributes – GroupIntegrity GroupIntegrityEnabled
Primary Group: S-1-5-18
Privs:
00 0x000000002 SeCreateTokenPrivilege Attributes –
[…]
Auth ID: 0:3e7 (SYSTEM_LUID)
Impersonation Level: Impersonation
TokenType: Impersonation

Again, checking winnt.h, it is immediately obvious that the eventlog service (which runs as LocalService) gets RPC requests from LocalSystem at System integrity level, with the caller enabling full impersonation and transferring all of its far-reaching privileges, many of them alone enough to completely compromise the system. Now, this and of itself might not be so bad, if not for the fact that a bunch of other “non-privileged” services share the same effective security context as eventlog (thanks to the magic of svchost), such as the DHCP client, significant parts of the Windows Audio subsystem (in Vista or in Srv08 if you enable audio), and various other services (such as the Security Center / Peer Networking Identity Manager / Peer Name Resolution Protocol services, at least in Vista).

What does all of this really mean? Well, several of those above services are network facing (the DHCP client certainly is), and they all likely expose some sort of user-facing IPC interface as well. Due to the fact that they all share the same security context as eventlog, a compromise in any one of those services could trivially be escalated to LocalSystem by an attacker who is the least bit clever. And that is how a hypothetical vulnerability in a non-privileged network-facing service like the DHCP client might get blown up into a LocalSystem compromise, thanks to svchost.

(Actually, even the DHCP client alone seems to expose its own RPC interface that is periodically connected to by LocalSystem processes who allow impersonation, so in the case of an (again hypothetical) vulnerability DHCP client service, one wouldn’t even need to go to the trouble of attacking eventlog as the current service would be enough.)

The point of this series is not to point fingers at the DHCP / Windows Audio / Eventlog (or other) services, however, but rather to point out that many of the so-called “low privileged” services are not actually as low privileged as one might think in the current implementation. Much of the fault here actually lies with whatever programs connect to these low-privileged services with full impersonation enabled than the actual services themselves, in fact, but the end result is that many of these services are not nearly as privilege-isolated as we would prefer to believe due to the fact that they are called irresponsibly (e.g. somebody forgot to fill out a SECURITY_QUALITY_OF_SERVICE and accepted the defaults, which amounts to completely trusting the other end of the IPC call).

The problem is even more common when it comes to third party software. I would put forth that at least some of this is a documentation / knowledge transfer problem. For example, how many times have you seen SECURITY_QUALITY_OF_SERVICE mentioned in the MSDN documentation? The CreateFile documentation mentions SQOS-related attributes (impersonation restriction flags) in passing, with only a hint of the trouble you’re setting yourself up for by accepting the defaults. This is especially insidious with CreateFile, as unlike other impersonation-enabled APIs, it’s comparatively very easy to sneak a “bad” filename that points to a dangerous, custom impersonation-enabled named pipe server anywhere a user-specified file is opened and then written to (other impersonation attack approaches typically require that an existing, “well-known” address / name for an IPC server to be compromised as opposed to the luxury of being able to set up a completely new IPC server with a unique name that a program might be tricked into connecting to).

The take-home for this series is then to watch out when you’re connecting to a remote service that allows impersonation. Unless absolutely necessary you should specify the minimum level of impersonation (e.g. SecurityIdentification) instead of granting full access (e.g. SecurityImpersonation, the default). And if you must allow the remote service to use full impersonation, be sure that you aren’t creating a “privilege inversion” where you are transferring high privileges to an otherwise low-privileged, network-facing service.

Oh, and just to be clear, this doesn’t mean that eventlog (or the other services mentioned) are full of security holes outside of the box. You’ll note that I explicitly used a hypothetical vulnerability in the DHCP client service for my example attack scenario. The impersonation misuse does, however, mean that much of the work that has been done in terms of isolating services into their own compartmentalized security contexts isn’t exactly the bullet proof wall one would hope for most LocalService / NetworkService processes (and especially those sharing the same address space). Ironically, though, from certain respects it is the callers of these services that share part or all of the blame (depending on whether the service really requires the ability to fully impersonate its clients like that or not).

As a result, at least from an absolute security perspective, I would consider eventlog / DHCP / AudioSrv (and friends) just as “LocalSystem” in Windows Server 2008 as they were back in Windows 2000, because the reality is that if any of those services are compromised, in today’s (and tommorow’s, with respect to Windows Server 2008, at least judging from the Beta 3 timeframe) implementation, the attacker can elevate themselves to LocalSystem if they are sufficiently clever. That’s not to say that all the work that’s been done since Windows 2000 is wasted, but rather that we’re hardly “all of the way there yet”.

How least privilege is that service, anyway (or much ado about impersonation) – part 1

Monday, August 6th, 2007

Previously, I discussed some of the pitfalls with LocalService and NetworkService, especially with respect to pre-Windows-Vista platforms (e.g. Windows Server 2003).

As I alluded to in that post, however, a case can be made that there still exists a less than ideal circumstance even in Vista / Srv08 with respect to how the least-privilege service initiative has actually paid out. Much of this is due to the compromise between performance and security that was made with packing many unrelated (or loosely related) services into a single svchost process. On Windows Server 2003 and prior platforms, the problem is somewhat more exacerbated as it is much easier for services running under different processes to directly interfere with eachother while running as LocalService / NetworkService, due to a loose setting of the owner field in the process object DACL for LocalService / NetworkService processes.

As it relates to Vista, and downlevel platforms as well, many of these “low privileged” services that run as LocalService or NetworkService are really highly privileged processes in disguise. For a moment, assume that this is actually a design requirement (which is arguable in many cases) and that some of these “low privileged” services are required to be given dangerous abilities (where “dangerous” is defined in this context as could be leveraged to take control of the system or otherwise elevate privileges). A problem then occurs when services that really don’t need to be given high privileges are mixed in the same security context (either the same process in Vista, or the same account (LocalService / NetworService) in downlevel systems. This is due to the obvious fact that mixing code with multiple privilege levels in a way such that code at a lower privilege level can interfere with code at a higher privilege level represents a clear break in the security model.

Now, some of you are by this point probably thinking “What’s he talking about? LocalService and NetworkService aren’t the same thing as LocalSystem.”, and to a certain extent, that is true, at least when you consider a trivial service on its own. There are many ways in which these “low privileged” services are actually more privileged than might meet the eye at first glance, however. Many of these “low privileged services” happen to work in a way such that although they don’t explicitly run with System/Administrator privileges, if the service “really wanted to”, it could get high privileges with the resources made available to it. Now, in my mind, when you are considering whether two things are equivalent from a privilege level, the proper thing to do is to consider the maximum privilege set a particular program could run with, which might not be the same as the privilege set of the account the process runs as.

To see what I mean, it’s necessary to understand an integral part of the Windows security model called impersonation. Impersonation allows one program to “borrow” the security context of another program, with that program’s consent, for purposes of performing a particular operation on behalf of that program. This is classically described as a higher privileged server that, when receiving a request from a client, obtains the privileges of the client and uses them to carry out the task in question. Thus, impersonation is often seen as a way to ensure that a high privileged server is not “tricked” into doing a dangerous operation on behalf of a client, because it “borrows” the client’s security context for the duration of the operation it is performing on behalf of that client. In other words, for the duration of that operation, the server’s privileges are effectively the same as the client’s, which often results in the server appearing to “drop privileges” temporarily.

Now, impersonation is an important part of practically all of the Windows IPC (inter-process communication) mechanisms, such as named pipes, RPC, LPC, and soforth. Aside from the use of impersonation to authoritatively identify the identity of a caller for purposes of access checks, many services use impersonation to “drop privileges” for an operation to the same level of security as a caller (such that if a “plain user” tries to make a call that does some operation requiring administrative privileges, even if the service was running with those administrative privileges originally, any resources accessed by the server during impersonation are treated as if the “plain user” accessed them. This prevents the “plain user” from successfully performing the dangerous / administrative task without the proper privileges being granted to it).

All this is well and fine, but there’s a catch: impersonation can also effectively elevate the privileges of a thread, not just lower them. Therein lies the rub with svchosts and LocalService / NetworkService accounts, since while these accounts may appear to be unprivileged at first glance, many of them operate RPC or LPC or named pipe servers that that privileged clients connect to on a regular basis. If one of these “low privileged” services is then compromised, although an attacker might not immediately be able to gain control of the system by elevating themself to LocalSystem privileges, with a bit of patience he or she can still reach the same effect. Specifically, the attacker need only take over the server part of the RPC / LPC / other IPC interface, wait for an incoming request, then impersonate the caller. If the caller happens to be highly privileged, then poof! All of a sudden the lowly, unprivileged LocalService / NetworkService just gained administrative access to the box.

Of course, the designers of the NT security system forsaw this problem a mile away, and built in features to the security system to prevent its abuse. Foremost, the caller gets to determine to what extent the server can impersonate it, via the use of a little-known attribute known as the Security Quality of Service (SQOS). Via the SQOS attribute (which can be set on any call that enables a remote server to impersonate one’s security context), a client can specify that a service can query its identity but not use that identity to perform access or privilege checks, as opposed to giving the server free reign over its identity once it connects. Secondly, in the post-Windows 2000 era, Microsoft restricted impersonation to accounts with a newly-created privilege, SeImpersonatePrivilege, which is by default only granted to LocalService / NetworkService / LocalSystem and administrative accounts.

So, impersonation isn’t really broken by design (and indeed David LeBlanc has an excellent article describing how, if used correctly, impersonation isn’t the giant security hole that you might think it is.

That being said, impersonation can still be very dangerous if misused (like many other aspects of the security system).

Coming up: A look at how all of this impersonation nonsense applies to LocalService / NetworkService (and svchost processes) in Windows (past and future versions).

“These aren’t the network connections you’re looking for”

Saturday, August 4th, 2007

It seems that among other things, my Srv08 Beta 3 box seems to be practicing its Jedi mind tricks whenever I try to reconnect a broken RDP session to it:

These aren't the network connections you're looking for

I guess it really doesn’t want me to be able to get back on after a connectivity interruption. I certainly hope this bug doesn’t make it out to RTM, as seamless RDP session reconnects are one of my favorite RDP features.

After all, a user shouldn’t really have to think about it when they switch between network connections, in my opinion. Programs should really be designed to more seamlessly resume across minor connectivity hiccups than is commonplace nowadays, especially with mobile Internet connectivity becoming mainstream. I kind of like this over conventional SSH/screen, in that at least for RDP, session reconnect doesn’t require any user interaction. I suppose with an autoreconnecting SSH client and some scriptable commands to run screen immediately on connecting one could hack up a similar end user experience, though I doubt it would be as seamless. One thing that SSH/screen have on RDP is better shadowing support, allowing for more than one session to shadow a particular target session at the same time in particular.

There are still some flaws with RDP session reconnect in general (especially relating to the fact that window Z order tends to get something akin to randomized when you reconnect, and sometimes windows mysteriously get shrunk to the minimum size requiring manual resizing to fix), but it’s mostly reliable in my experience (discounting this Srv08 weirdness, that is).