Archive for September, 2007

Heading to Blue Hat…

Wednesday, September 26th, 2007

Today, I’m heading out to Redmond for this coming Blue Hat. My first time going to Blue Hat, but it’ll hopefully be interesting, to say the least :)

I’m definitely looking forward to seeing some of the presentations, and meeting some interesting people. along the way.

Common WinDbg problems and solutions

Monday, September 24th, 2007

When you’re debugging a program, the last thing you want to have to deal with is the debugger not working properly. It’s always frustrating to get sidetracked on secondary problems when you’re trying to focus on tracking down a bug, and especially so when problems with your debugger cause you to lose a repro or burn excessive amounts of time waiting around for the debugger to finish doing who knows what that is taking forever.
a
This is something that I get a fair amount of questions about from time to time, and so I’ve compiled a short list of some common issues that one can easily get tripped up by (and how to avoid or solve them).

  1. I’m using ntsd and I can’t get symbols to load, or most of the debugger extension commands (!commands) don’t work. This usually means that you launched the ntsd that ships with the operating system (prior to Windows Vista), which is much older than the one shipping with the debugger package. Because it is in the system directory, it will be in your executable search path.

    To fix this problem, use the ntsd executable in the debugger installation directory.

  2. WinDbg takes a very long time to process module load events, and it is using max processor time (spinning) on one CPU. This typically happens if you have many unqualified breakpoints that track module load events (created via bu) saved in your workspace. This problem is especially noticible when you are working with programs that have a very large number of decorated C++ symbols, such as debug builds of programs that make heavy use of the STL or other template classes. Unqualified breakpoints are expensive in general due to forcing immediate symbol loads of all modules, but moreover they also force the debugger to undecorate and perform pattern matches against every symbol in a module that is being loaded, for every unresolved breakpoint.

    If you allow a large number of unqualified breakpoints to become saved in a default workspace, this can make the debugger appear to be extremely slow no matter what program you are debugging.

    To avoid getting bitten by this problem, don’t use unqualified breakpoints (breakpoints without a modulename! prefix on their address expression) unless absolutely necessary. Also, it’s typically a good idea to clear all your breakpoints before you save your workspace if you don’t need them to be saved for your next debugging session with that debugger workspace (by default, bu breakpoints are persisted in the debugger workspace, unlike bp breakpoints which go away after every debugging session). If you are in the habit of saving the workspace every time you attach to a running process, and you often use bu breakpoints, this will tend to clutter up the user default workspace and can quickly lead to very poor debugger performance if you’re not careful.

    You can use the bc command to delete breakpoints (bc * to remove all breakpoints), although you will need to save the workspace to persist the changes. If the problem has gotten to the point where it’s not possible to even get past module loading in a reasonable amount of time so that you can use bc * to clear out saved breakpoints, you can remove the contents of the HKCU\Software\Microsoft\Windbg\Workspaces registry key and subkeys to return WinDbg to a pristine state. This will wipe out your saved debugger window positions and other saved debugger settings, so use it as a last resort.

  3. WinDbg takes a very long time to process module load events, but it is not consuming a lot of processor time. This typically means that your symbol path includes either a broken HTTP symbol store link or a broken UNC symbol store path. A non-responsive path in your symbol path will cause any operation that tries to load symbols for a module to take a long time to complete as a network timeout will be occuring over and over again.

    Use !sym noisy, followed by .reload /f to determine what part of your symbol path is not working correctly. Then, fix or remove the offending part of the symbol path.

    This problem can also occur when you are debugging a program that is in the packet path for packets destined to a location on the symbol path. In this case, the typical workaround I recommend is to set an empty symbol path, attach to the process in question, write a dump file, and then detach from the process. Then, restore the normal symbol path and open the dump file in the debugger, and issue a .reload /f command to force all symbols to be pre-cached ahead of time. After all symbols are pre-cached in the downstream store cache, change the symbol path to only reference the downstream store cache location and not any UNC or HTTP symbol server paths, and attach the debugger to the process in the packet path for symbol server access.

  4. WinDbg refuses to load symbols for a module that I know the symbol server has symbols for. This issue can occur if WinDbg has previously tried (and failed) to download symbols for a module. There appears to be a bug in dbghelp’s symbol server support which can sometimes result in partially downloaded PDB files being left in the downstream store cache. If this happens, future attempts to access symbols for the module will fail with an error saying that symbols for the module cannot be found.

    If you turn on noisy symbol loading (!sym noisy), a more descriptive error is typically given. If you see a complaint about E_PDB_CORRUPT, then you are probably falling victim to this issue. The debugger output that indicates this problem would look like something along the lines of this:

    DBGHELP: c:\symbols­\ntdll.pdb­\2744327E50A64B24A87BDDCFC7D435A02­\ntdll.pdb – E_PDB_CORRUPT

    If you encounter this problem, simply delete the .pdb named in the error message and retry loading symbols via the .reload /f <modulename> command.

  5. WinDbg hangs and never comes back when I attach to a specific process, such as an svchost instance. If you’re sure that you aren’t experiencing a problem with a broken symbol path or unqualified module load tracking breakpoints being saved in your workspace, and the debugger never comes back when attaching to a certain process (or almost always hangs after the first command when attaching to the process in question), then the process you are debugging may be in a code path responsible for symbol loading.

    This problem is especially common if you are debugging an svchost instance, as there are a lot of important but unrelated pieces of code running in the various svchost instances, some of which are critical for network symbol server support to work. If you are debugging a process in the critical path for network symbol server support, and you have a symbol path with a network component set, then you may cause the debugger to deadlock (hang forever) the first time you try and load symbols.

    One example of a situation that can cause this is if you are debugging code in the same svchost instance as the DNS cache service. In this case, when you try to load symbols and you have an HTTP symbol server link in your symbol path, the debugger will deadlock because it will try and make an RPC call to the DNS cache service when it tries to resolve the hostname of the server referenced in your symbol path. Because the DNS cache service will never respond until the debugger resumes the process, and the debugger will never resume the process until it gets a response from the RPC request to the DNS cache service, your debugging session will hang indefinitely.

    Note that if you are simply debugging something in the packet path of a symbol server store, you will typically see the debugger become unresponsive for long periods of time but not hang completely. This is because the debugger can handle network timeouts (if somewhat slowly) and will eventually fail the request to the network symbol path. However, if the debugger tries to make an IPC request of some sort to the process being debugged, and the IPC request doesn’t have any built-in timeout (most local IPC mechanisms do not), then the debugger session will be lost for good.

    This problem can be worked around similarly to how I typically recommend users deal with slow module loading or failed symbol server accesses with a program in the packet path for a symbol server referenced in the symbol path. Specifically, it is possible to pre-cache all symbols for the process by creating a dump of the process from a debugger instance with an empty symbol path, and then detaching and opening the dump with the full symbol path and forcing a download of all symbols. Then, start a debugging session on the live process with a symbol path that references only the local downstream store into which symbols were being downloaded to in order to prevent any dangerous network accesses from happening.

    Another common way to get yourself into this sort of debugger deadlock problem is to use the clipboard to paste into WinDbg while you are debugging a program that has placed something into the clipboard. This results in a similar deadlock as WinDbg may get blocked on a DDE request to the clipboard owner, which will never respond by virtue of being debugged. In that case, the workaround is simply to be careful about copying or pasting text into or out of WinDbg.

  6. Remote debugging with -remote or .server is flaky or stops working properly after awhile. This can happen if all debuggers in the session aren’t running the same debugger version.

    Make sure that all peers in the remote debugging scenario are using the (same) latest debugger version. If you mix and match debugger versions with -remote, things will often break in strange and hard to diagnose ways in my experience (there doesn’t seem to be a whole lot of graceful support for backwards or forwards compatibility with respect to the debugger remoting protocol).

    Also, several recent releases of the debugger package didn’t work at all in remote debugging mode on Windows 2000. This is, as far as I know, fixed in the latest release.

Most of these problems are simple to fix or avoid once you know what to look for (although they can certainly burn a lot of time if you’re caught unaware, having done that myself while learning about these “gotchas”).

If you’re experiencing a weird WinDbg problem, you should also not be shy about debugging the malfunctioning debugger instance itself. Often, taking a stack trace of all threads in the problematic debugger instance will be enough to give you an idea of what sort of problem is holding things up (remember that the Microsoft public symbol server has symbols for the debugger binaries as well as the OS binaries).

VMware Server 1.0.4 released (security fixes)

Friday, September 21st, 2007

A few days ago, VMware put out VMware Server 1.0.4. Although there appear to be some plain old bug fixes, the real news here is that a couple of very nasty security bugs that have been fixed, including “VM-breakout” class bugs that could allow a malicious VM to compromise the host.

I’d imagine that it’s only a matter of time before VM monitor / hypervisor bugs become as big a deal as standard operating system bugs. For those that are already deploying virtualized infrastructure, that is probably already true even now to an extent.

What’s nasty about VM breakout bugs is that they can very easily lead to large numbers of machines being compromised in a fairly stealthy way. There’s been a whole lot of good discussion about hypervisor-based rootkits and malware recently. Although detecting an unexpected layering of hypervisor is one thing, telling the difference between a hypervisor that’s supposed to be running and a compromised hypervisor that’s supposed to be running, from the context of a guest, is an entirely different factor altogether. To a clever attacker, a hypervisor compromise is a pretty scary thing.

Now, I’m not trying to call fire and brimstone down on VMware or anything like that, and the bug in this case is reportedly not remotely exploitable without remote admin access to a guest. But all the same, hypervisor / VM monitor bugs are certainly nothing to be shaking a stick at.

Anyways, if you use VMware and you’ve got VMs that are either untrusted or allow non-admistrative access, it’s time to (borrowing a friend’s term) ready the patch brigade if you have not already.

Useful debugger commands: .writemem and .readmem

Thursday, September 20th, 2007

From time to time, it can be useful to save a chunk of memory for whatever reason when you’re debugging a program. For instance, you might need to capture a long buffer argument to a function for later analysis (perhaps with a custom analysis tool outside the scope of the debugger).

There are a couple of options built in to the debugger to do this. For example, if you just want to save the contents of memory for later perusal, you could always write a complete minidump of the target. However, this has a few downsides; for one, unless you build in dump file processing capability into your analysis program, dump files are typically going to be less than easily accessible to simple analysis tools. (Although one could write a program utilizing MiniDumpReadDumpStream, this is more work than necessary.)

Furthermore, complete dumps tend to be large, and in the case of a kernel debugger connection over serial port, it can take many hours to save a kernel memory dump just to gain access to a comparatively small region of memory.

Instead of writing a dump file, another option is to use one of the display memory commands to save the contents of memory to a debugger log file. For instance, one might use “db address len“, write it to a log file, and parse the output. This is much less time-consuming than a kernel memory dump over kd, and in some cases it might be desirable to have the hex dump for you (that db provides) in plain text, but if one just wants the raw memory contents, that too is less than ideal.

Fortunately, there’s a third option: the .writemem command, which as the name implies, writes an arbitrary memory range to a file in raw binary form. There are two arguments, a filename and a range. For instance, one usage might be:

.writemem C:\\Users\\User\\Stack.bin @rsp L1000

This command would write 0x1000 bytes of stack to the file. (Remember that address ranges may include a space-delimited component to specify the length.)

The command works on all targets, including when one is using the kernel debugger, making it the command of choice for writing out arbitrary chunks of memory.

There also exists a command to perform the inverse operation, .readmem, which takes the same arguments, but instead reads memory from the file given and writes it to the specified address range. This can be useful for anything between substituting large arguments to a function out at run-time to applying large patches to replace non-trivial sections of code as a whole.

Furthermore, because the memory image format used by both commands is just the raw bits from the target, it becomes easy to work with the written out data with a standard hex editor, or even a disassembler. (For instance, another common use case of .writemem is to when dealing with self-modifying code, write the code out to a file after it has been finalized, and then load the resulting raw memory image up as raw opcodes in a more full-featured disassembler than the debugger.)

Never, ever, EVER wake a computer from suspend without user consent

Thursday, September 13th, 2007

I am not a happy camper.

Today, I got in to work and unpacked my laptop from my laptop bag and discovered that it had gone into hibernation due to a critically low battery event. That was fairly strange, because last night I had suspended my laptop (fully charged) and placed it into my laptop bag. Somehow, it managed to consume a full battery charge between my putting it into my bag last night, and my getting in to work.

This is obviously not good, because it meant that the laptop had to have been powered on while in my laptop bag for it to have possibly used that much battery power (a night in suspend is a drop in the bucket as far as battery life is concerned). Let me tell you a thing or two about laptop bags: they’re typically padded (in other words, also typically insulated to a degree) and generally don’t have a whole lot of ventilation potential designed into them, at least as far as laptop compartments go. Running a laptop in a laptop bag for a protracted period of time is a bad thing, that’s for certain.

Given this, I was not all that happy to discover that my laptop had indeed been resumed and had been running overnight in my laptop bag until the battery got low enough for it to go into emergency hibernate mode. (Fortunately, it appears to have sustained no permanent damage from this event. This time….)

So, I set out to find out the culprit. The event log showed the system clearly waking just seconds after the clock hit 3AM local time. Hmm. Well, that’s a rather interesting thing, because Windows Update is by default set to install updates at 3AM local time, and this Tuesday was Patch Tuesday. A quick examination of the Windows Update log (%SystemRoot%\WindowsUpdate.log) confirmed this fact:

2007-09-13 03:00:11:521 408 f4c AU The machine was woken up by Windows Update

Great. Well, according the event logs, it ran for another 1 hour and 45 minutes or so before the battery got sufficiently low for it to go into hibernate. So, apparently, Windows Update woke my laptop, on battery, in my laptop bag to install updates. It didn’t even bother to suspend it after the fact (gee, thanks), leaving the system to keep running until either it ran out of battery or something broke due to running in a confined place with zero ventilation. Fortunately, I lucked out and the former happened this time.

But that’s not all. This is actually not the first time this has happened to me. In fact, on August 29 (last month), I got woken up at 3AM because Windows Update decided to install some updates in the middle of the night. That time, it apparently needed to reboot after installing updates, and I got woken up by the boot sound (thanks for that, Windows Update!). At the time, I wrote it off as “intended” behavior, as the system happened to be plugged in to wall power overnight and a friend pointed out to me that Windows Update states that while plugged in, it will resume a computer to install updates and put it back into suspend afterwards.

Well, that’s fine and all (aside from the waking me up at 3AM part, which sucked, but I suppose it was documented to be that way). Actually, I’ll take that back, my computer waking me up in the middle of the night to automatically install updates is far from fine, but that pales in comparison to what happened the second time around. The powering the system on while it was on battery power to install updates, however, is a completely different story indeed.

This is, in my opinion, a spectacular failure of the “left hand not knowing what the right is doing” sort at Microsoft. One of the really great things about Windows Vista was that it was taking back power management from all the uncooperative programs out there. Except for, I suppose, Windows Update.

Consider that for a portable (laptop/notebook) computer, it is often the case that it’s downright dangerous to just wake the computer at unexpected times. For example, what if I was on an airplane during takeoff and Windows Update decided that, in its vast and amazing knowledge of what’s best for the world, it would just power on my laptop and enable Bluetooth, 802.11, etc. Or say my laptop was sitting in its laptop bag (somewhere I often leave it overnight to save myself the trouble of putting it there in the morning before I go to work), and it powers on to do intensive tasks like install updates and reboot with no ventilation to the system, and overheats (suffering irrreparable physical damage as a result). Oh, wait, that is what happpened… except that the laptop in question survived, this time around. I wonder if everyone else with a laptop running Windows Vista will be so lucky (or if I’ll be so lucky next time).

What if I was for some reason carrying my laptop in my laptop bag and, say, walking, going up a flight of stairs, running, whatnot, and Windows Update decided that it was so amazingly cool, that it would power on my computer without asking me and the hard drive crashed from being spun up while being moved under unsafe (for a hard drive) conditions?

In case whoever is responsible for this (amazingly negligent) piece of code in Windows Update every reads this, let me spell it out for you. Read my lips (or words): It is unacceptable to wake my computer up without asking me. U-N-A-C-C-E-P-T-A-B-L-E, in case that was not clear. I don’t care what the circumstances are. It’s kind of like Fight Club: Rule no. 1 is that you do not do that, ever, no matter what the circumstances are. And I really dare anyone at Microsoft to say that by accepting the default settings for Windows Update, I was consenting to it running my laptop for prolonged periods of time in a laptop bag. Yeah, I thought so…

You can absolutely bet that I’ll be filing a support incident about this while I do my best to have this code, how shall I say, permanently evicted from Windows Update, on behalf of all laptop owners. I can see how someone might have thought that it would be a cool idea to install updates even if you suspend your computer at night (which being the default in Vista would happen often). However, it just completely, jaw-droppingly drops the ball with portable computers (laptops), in the absolute worst way possible. That is, one that can result in physical damage to the end user’s computer and/or permanent data loss. This is just so obvious to me that I literally could not believe what had happened when I woke up, that something that ships by default and is turned on by default with Vista would do something so completely stupid, so irresponsible, so negligent.

I was pretty happy with the power management improvements in Windows Vista up until now. Windows Update just completely spoiled the party.

The worst thing is that if the consequences of resuming laptops without asking weren’t already so blindingly obvious, this topic (forcing a system resume programmatically) comes up on the Microsoft development newsgroups from time to time, and it’s always shot down immediately because of the danger (yes, danger) of unexpected programmatic resumes with laptop computers.

(Sorry if this posting comes off as a bit of a rant. However, I don’t think anyone could disagree that the possibility of Windows Update automatically powering on the laptop in the scenarios I listed above could possibly be redeemable.)