Archive for November, 2007

The default invalid parameter behavior for the VC8 CRT doesn’t break into the debugger

Thursday, November 8th, 2007

One of the problems that confuses people from time to time here at work is that if you happen to hit a condition that trips the “invalid parameter” handler for VC8, and you’ve got a debugger attached to the process that fails, then the process mysteriously exits without giving the debugger a chance to inspect the condition of the program in question.

For those unfamiliar with the concept, the “invalid parameter” handler is a new addition to the Microsoft CRT, which kills the process if various invalid states are encountered. For example, dereferencing a bogus iterator in a release build might trip the invalid parameter handler if you’re lucky (if not, you might see random memory corruption, of course).

The reason why there is no debugger interaction here is that the default CRT invalid parameter handler (present in invarg.c if you’ve got the CRT source code handy) invokes UnhandledExceptionFilter in an attempt to (presumably) give the debugger a crack at the exception. Unfortunately, in reality, UnhandledExceptionFilter will just return immediately if a debugger is attached to the process, assuming that this will cause the standard SEH dispatcher logic to pass the event to the debugger. Because the default invalid parameter handler doesn’t really go through the SEH dispatcher but is in fact simply a direct call to UnhandledExceptionFilter, this results in no notification to the debugger whatsoever.

This counter-intuitive behavior can be more than a little bit confusing when you’re trying to debug a problem, since from the debugger, all you might see in a case like a bad iterator dereference would be this:

0:000:x86> g
ntdll!NtTerminateProcess+0xa:
00000000`7759053a c3              ret

If we pull up a stack trace, then things become a bit more informative:

0:000:x86> k
RetAddr           
ntdll32!ZwTerminateProcess+0x12
kernel32!TerminateProcess+0x20
MSVCR80!_invoke_watson+0xe6
MSVCR80!_invalid_parameter_noinfo+0xc
TestApp!wmain+0x10
TestApp!__tmainCRTStartup+0x10f
kernel32!BaseThreadInitThunk+0xe
ntdll32!_RtlUserThreadStart+0x23

However, while we can get a stack trace for the thread that tripped the invalid parameter event in cases like this with a simple single threaded program, adding multiple threads will throw a wrench into the debuggability of this scenario. For example, with the following simple test program, we might see the following when running the process under the debugger after we continue the initial process breakpoint (this example is being run as a 32-bit program under Vista x64, though the same principle should apply elsewhere):

0:000:x86> g
ntdll!RtlUserThreadStart:
sub     rsp,48h
0:000> k
Call Site
ntdll!RtlUserThreadStart

What happened? Well, the last thread in the process here happened to be the newly created thread instead of the thread that called TerminateProcess. To make matters worse, the other thread (which was the one that caused the actual problem) is already gone, killed by TerminateProcess, and its stack has been blown away. This means that we can’t just figure out what’s happened by asking for a stack trace of all threads in the process:

0:000> ~*k

.  0  Id: 1888.1314 Suspend: -1 Unfrozen
Call Site
ntdll!RtlUserThreadStart

Unfortunately, this scenario is fairly common in practice, as most non-trivial programs use multiple threads for one reason or another. If nothing else, many OS-provided APIs internally create or make use of worker threads.

There is a way to make out useful information in a scenario like this, but it is unfortunately not easy to do after the fact, which means that you’ll need to have a debugger attached and at your disposal before the failure happens. The simplest way to catch the culprit red-handed here is to just breakpoint on ntdll!NtTerminateProcess. (A conditional breakpoint could be employed to check for NtCurrentProcess ((HANDLE)-1) in the first parameter if the process frequently calls TerminateProcess, but this is typically not the case and often it is sufficient to simply set a blind breakpoint on the routine.)

For example, in the case of the provided test program, we get much more useful results with the breakpoint in place:

0:000:x86> bp ntdll32!NtTerminateProcess
0:000:x86> g
Breakpoint 0 hit
ntdll32!ZwTerminateProcess:
mov     eax,29h
0:000:x86> k
RetAddr           
ntdll32!ZwTerminateProcess
kernel32!TerminateProcess+0x20
MSVCR80!_invoke_watson+0xe6
MSVCR80!_invalid_parameter_noinfo+0xc
TestApp!wmain+0x2e
TestApp!__tmainCRTStartup+0x10f
kernel32!BaseThreadInitThunk+0xe
ntdll32!_RtlUserThreadStart+0x23

That’s much more diagnosable than a stack trace for the completely wrong thread.

Note that from an error reporting perspective, it is possible to catch these errors by registering an invalid parameter handler (via _set_invalid_parameter_handler), which is rougly analogus to the mechanism one uses to register a custom handler for pure virtual function call failures.

Linux features that I’d like to see in Windows: iptables

Wednesday, November 7th, 2007

One of the things that I really miss about Linux-based boxen when I’m working with Windows from time to time is the fact that the built-in Windows firewall capabilities are just downright anemic when compared to the power and flexibility of iptables.

Sure, there’s Windows Firewall, RRAS, Advanced TCP/IP Filtering (which is anything but advanced), and IPSec Policies that come with Windows and allow you to firewall things off. Unfortunately, while Windows Firewall and RRAS (with respect to “Basic Firewall” in Windows Server 2003) do a passable job of an inbound host firewall, there is really just nothing that comes with Windows that is reasonably good at managing a complicated network (e.g. multi-machine) firewall.

RRAS has built-in static packet filtering, but it’s downright ridiculously limited given the fact that it’s something that it is ostensibly oriented towards network administrators (who should, theoretically, know what they’re doing). You essentially have the option of creating either an allow list with a default deny, or a deny list with a default allow, and that’s it. (There’s also not really any support for stateful packet filtering in this mode of RRAS, as is available from Basic Firewall, although you can at least differentiate between established and non-established TCP packets. Barely.)

IPSec Policies are slightly less limited than RRAS static packet filtering, but they’re still nowhere near expressive enough for any sort of non-trivial network firewall configuration. You can at least mix and match allow and deny rules, but the ordering is only based on netmask and, as far as I know, is otherwise not user controllable.

Iptables and ip_conntrack, on the other hand, are highly expressive and allow one to comparatively easily create rules that are either downright impossible or extremely difficult to do (e.g. requiring convoluted use of both RRAS static packet filtering and IPSec Policies) in any managable fashion with the built-in Windows firewall tools. As and added bonus, they also have highly flexible NAT capabilities built-in that easily integrate and cooperate with firewall rules.

Now, it’s not really a matter of there being anything that is technically wrong or deficient with the Windows networking stack that would prevent there being a reasonably high quality firewall, but more that just nobody has gone out and done it and shipped it with the platform. (No, I don’t count the “personal firewall” type things that ship with XYZ AV/”Home Security” product as anywhere in this category. I don’t trust those far enough to not create security holes, much less act competently as a firewall.)

There are a number of various third party firewall packages out there, but I tend to be fairly suspicious of installing third party code on my boxes in general, much less third party kernel level code that is facing the network outside of any firewall or packet filtering. Most of them don’t seem to have anywhere near the sort of capabilities that iptables provides, anyway.

Invasive DRM systems are dangerous from a security perspective

Tuesday, November 6th, 2007

In recent times, it seems to be an increasing trend for anti-copying software DRM systems to install invasive privileged software. For example, there’s the ever so infamous “Sony DRM Rootkit” that Mark Russinovich publicly exposed some time ago. Unfortunately, software like this is becoming commonplace nowadays.

Most DRM technologies tend to use unsupported and/or “fringe” techniques to make themselves difficult to understand and debug. However, more often than not, the DRM authors often get little things wrong with their anti-debug/anti-hack implementations, and when you’re running in a privileged space, “little things wrong” can translate into a security vulnerability.

Yesterday, Microsoft published a security advisory concerning a privileged DRM system (Macrovision’s SafeDisc, secdrv.sys) that happened to have a security bug in it. Security bugs in privileged parts of DRM code are certainly nothing new in this day and age, but what makes this case interesting is that here, Microsoft shipped the affected code with the operating system since Windows XP. To make matters worse, the code (a kernel driver) is running by default on all Windows XP x64 and Windows Server 2003 x64 systems. (The bug is reportedly not present in the Windows Vista implementation of the driver in question.) On x64 versions of Windows, the driver even runs if you have no programs running (or even installed) that use SafeDisc. Curiously, this does not appear to be the case for x86 versions of Windows XP and Windows Server 2003, for which secdrv is configured as a demand start driver.

I’ve always wondered how Macrovision managed to talk Microsoft into shipping a third party, software only driver, which for x64 versions of Windows is always enabled (secdrv has, as previously mentioned, been on every Windows box since Windows XP, although it is not automatically started on x86 versions of Windows). Doing so always seemed rather… distasteful to me, but the situation becomes even more unfortunate when the code in question has a security bug (a kernel memory overwrite – local privilege escalation – bug, as far as I know). I’m sure all the Windows Terminal Server admins out there will really just love having to reboot their Windows Server 2003 x64 TS boxes because a buggy video game DRM system was shipped with the OS and is on by default, despite the fact that there’s almost zero chance that any software that would have used Macrovision would possibly ever end up on a server OS install.

In this respect, I think that DRM systems are going to be high priority targets for security bugs in this increasingly digitally-restricted world. With more and more content and software being locked up by DRM, said DRM systems become increasingly widespread and attractive targets for attack. Furthermore, from observing the seemingly never-ending wars between Sony, Macrovision, and others against persons who reverse engineer the DRM systems these companies distribute, it would seem that there is in fact already a fair amount of incentive for “unsavory” individuals to be taking these DRM systems apart, security bugs nonwithstanding.

To make matters worse, due to their extremely secretive nature, it’s highly questionable whether DRM systems get proper (and effective) code and design reviews. In fact, this point is even more worrisome (in my opinion) when one considers that software DRM systems are virtually by definition engineered to be difficult to understand (so as to be more resilient to attack through security by obscurity).

The worst thing about bugs with DRM systems is that for a “legitimate” customer that obeys the “rules” and happily installs DRM-ware on their box, they generally can’t ever mitigate the risk by simply disabling the “affected component” without breaking their DRM’d content or software (and good luck trying to ask technical support how to get the protected content to work without DRM, security bugs or not). Most ironically, “unscrupulous” individuals who have hacked around their DRM systems might be more secure in the end than paying customers, if they can access the protected content without involving the DRM system.

Now to be fair, in this case, I’d imagine most users could probably disable the secdrv driver if they wanted to (I doubt anyone running Windows Server 2003 would miss it at all, for one). Still, the fact remains that the vast majority of DRM systems are of amazingly poor quality in terms of robustness and well written code.

This is one of the reasons why I personally am extremely wary of playing games that require administrative privileges or install administrative “helper services” for non-administrative users, because games have a high incidence of including low quality anti-cheat/anti-hack/anti-copying systems nowadays. I simply don’t trust the people behind these systems to get their code right enough to be comfortable with it running with full privileges on my box. From a software security perspective, I tend to rank most DRM technologies right up there with anti-virus software and other dregs of the software security world.

Update: Changed to note that secdrv is configured as auto start only for Windows XP x64 and Windows Server 2003 x64. On x86 versions of these operating systems, it is a demand start driver. The reason for this discrepancy is not known, although one would tend to suspect that it is perhaps an oversight given the behavior on x86 machines.

Most data references in x64 are RIP-relative

Monday, November 5th, 2007

One of the larger (but often overlooked) changes to x64 with respect to x86 is that most instructions that previously only referenced data via absolute addressing can now reference data via RIP-relative addressing.

RIP-relative addressing is a mode where an address reference is provided as a (signed) 32-bit displacement from the current instruction pointer. While this was typically only used on x86 for control transfer instructions (call, jmp, and soforth), x64 expands the use of instruction pointer relative addressing to cover a much larger set of instructions.

What’s the advantage of using RIP-relative addressing? Well, the main benefit is that it becomes much easier to generate position independent code, or code that does not depend on where it is loaded in memory. This is especially useful in today’s world of (relatively) self-contained modules (such as DLLs or EXEs) that contain both data (global variables) and the code that goes along with it. If one used flat addressing on x86, references to global variables typically required hardcoding the absolute address of the global in question, assuming the module loads at its preferred base address. If the module then could not be loaded at the preferred base address at runtime, the loader had to perform a set of base relocations that essentially rewrite all instructions that had an absolute address operand component to refer to take into account the new address of the module.

The loader is hardly capable of figuring out what instructions would need to be rewritten in such a form, instead requiring assistance from the compiler and linker (in terms of the base relocation section of a PE image, for Windows) to provide it with a list of addresses that correspond to instruction operands that need to be modified to reflect the new image base after an image has been relocated.

An instruction that uses RIP relative addressing, however, typically does not require any base relocations (otherwise known as “fixups”) at load time if the module containing it is relocated, however. This is because as long as portions of the module are not internally re-arranged in memory (something not supported by the PE format), any addresses reference that is both relative to the current instruction pointer and refers to a location within the confines of the current image will continue to refer to the correct location, no matter where the image is placed at load time.

As a result, many x64 images have a greatly reduced number of fixups, due to the fact that most operations can be performed in an RIP-relative fashion. For example, the base relocation information (not including alignment padding) on the 64-bit ntdll.dll (for Windows Vista) is a mere 560 bytes total, compared to 18092 bytes in the Wow64 (x86) version.

Fewer fixups also means better memory usage when a binary is relocated, as there is a higher probability that a particular page will not need to be modified by the base relocation process, and thus can still remain shared even if a particular process needs to relocate a particular DLL.

I tend to prefer debugging with release builds instead of debug builds.

Friday, November 2nd, 2007

One of the things that I find myself espousing both at work and outside of work from time to time is the value of debugging using release builds of programs (for Windows applications, anyways). This may seem contradictory to some at first glance, as one would tend to believe that the debug build is in fact better for debugging (it is named the “debug build”, after all).

However, I tend to disagree with this sentiment, on several grounds:

  1. Debugging on debug builds only is an unrealistic situation. Most of the “interesting” problems that crop up in real life tend to be with release builds on customer sites or production environments. Many of the time, we do not have the luxury of being able to ship out a debug build to a customer or production environment.

    There is no doubt that debugging using the debug build can be easier, but I am of the opinion that it is disadvantageous to be unable to effectively debug release builds. Debugging with release builds all the time ensures that you can do this when you’ve really got no choice, or when it is not feasible to try and repro a problem using a debug build.

  2. Debug builds sometimes interfere with debugging. This is a highly counterintuitive concept initially, one that many people seem to be surprised at. To see what I mean, consider the scenario where one has a random memory corruption bug.

    This sort of problem is typically difficult and time consuming to track down, so one would want to use all available tools to help in this process. One most useful tool in the toolkit of any competent Windows debugger should be page heap, which is a special mode of the RTL heap (which implements the Win32 heap as exposed by APIs such as HeapAlloc).

    Page heap places a guard page at the end (or before, depending on its configuration) of every allocation. This guard page is marked inaccessible, such that any attempt to write to an allocation that exceeds the bounds of the allocated memory region will immediately fault with an access violation, instead of leaving the corruption to cause random failures at a later time. In effect, page heap allows one to catch the guility party “red handed” in many classes of heap corruption scenarios.

    Unfortunately, the debug build greatly diminishes the ability of page heap to operate. This is because when the debug version of the C runtime is used, any memory allocations that go through the CRT (such as new, malloc, and soforth) have special check and fill patterns placed before and after the allocation. These fill patterns are intended to be used to help detect memory corruption problems. When a memory block is returned using an API such as free, the CRT first checks the fill patterns to ensure that they are intact. If a discrepancy is found, the CRT will break into the debugger and notify the user that memory corruption has occured.

    If one has been following along thus far, it should not be too difficult to see how this conflicts with page heap. The problem lies in the fact that from the heap’s perspective, the debug CRT per-allocation metadata (including the check and fill patterns) are part of the user allocation, and so the special guard page is placed after (or before, if underrun protection is enabled) the fill patterns. This means that some classes of memory corruption bugs will overwrite the debug CRT metadata, but won’t trip page heap up, meaning that the only indication of memory corruption will be when the allocation is released, instead of when the corruption actually occured.

  3. Local variable and source line stepping are unreliable in release builds. Again, as with the first point, it is dangerous to get into a pattern of relying on these conveniences as they simply do not work correctly (or in the expected fashion) in release builds, after the optimizer has had its way with the program. If you get used to always relying on local variable and source line support, when used in conjunction with debug builds, then you’re going to be in for a rude awakening when you have to debug a release build. More than once at work I’ve been pulled in to help somebody out after they had gone down a wrong path when debugging something because the local variable display showed the wrong contents for a variable in a release build.

    The moral of the story here is to not rely on this information from the debugger, as it is only reliable for debug builds. Even then, local variable display will not work correctly unless you are stepping in source line mode, as within a source line (while stepping in assembly mode), local variables may not be initialized in the way that the debugger expects given the debug information.

Now, just to be clear, I’m not saying that anyone should abandon debug builds completely. There are a lot of valuable checks added by debug builds (assertions, the enhanced iterator validation in the VS2005 CRT, and stack variable corruption checks, just to name a few). However, it is important to be able to debug problems with release builds, and it seems to me that always relying on debug builds is detrimental to being able to do this. (Obviously, this can vary, but this is simply speaking on my personal experience.)

When I am debugging something, I typically only use assembly mode and line number information, if available (for manually matching up instructions with source code). Source code is still of course a useful time saver in many instances (if you have it), but I prefer not relying on the debugger to “get it right” with respect to such things, having been burned too many times in the past with incorrect results being returned in non-debug builds.

With a little bit of practice, you can get the same information that you would out of local variable display and the like with some basic reading of disassembly text and examination of the stack and register contents. As an added bonus, if you can do this in debug builds, you should by definition be able to do so in release builds as well, even when the debugger is unable to track locals correctly due to limitations in the debug information format.

How does one retrieve the 32-bit context of a Wow64 program from a 64-bit process on Windows Server 2003 x64?

Thursday, November 1st, 2007

Recently, Jimmy asked me what the recommended way to retrieve the 32-bit context of a Wow64 application on Windows XP x64 / Windows Server 2003 x64 was.

I originally responded that the best way to do this was to use Wow64GetThreadContext, but Jimmy mentioned that this doesn’t exist on Windows XP x64 / Windows Server 2003 x64. Sure enough, I checked and it’s really not there, which is rather a bummer if one is trying to implement a 64-bit debugger process capable of debugging 32-bit processes on pre-Vista operating systems.

Normally, I don’t typically recommend using undocumented implementation details in production code, but in this case, there seems to be little choice as there’s no documented mechanism to perform this operation prior to Vista. Because Vista introduces a documented way to perform this task, going an undocumented route is at least slightly less questionable, as there’s an upper bound on what operating systems need to be supported, and major changes to the implementation of things on downlevel operating systems are rarer than with new operating system releases.

Clearly, this is not always the case; Windows XP Service Pack 2 changed an enormous amount of things, for instance. However, as a general rule, service packs tend to be relatively conservative with this sort of thing. That’s not that one has carte blanche with using undocumented implementation details on downlevel platforms, but perhaps one can sleep a bit easier at night knowing that things are less likely to break than in the next Windows release.

I had previously mentioned that the Wow64 layer takes a rather unexpected approach to how to implement GetThreadContext and SetThreadContext. While I mentioned at a high level what was going on, I didn’t really go into the details all that much.

The basic implementation of these routines is to determine whether the thread is running in 64-bit mode or not (determined by examining the SegCs value of the 64-bit context record for the thread as returned by NtGetContextThread). If the thread is running in 64-bit mode, and the thread is a Wow64 thread, then an assumption can be made that the thread is in the middle of a callout to the Wow64 layer (say, a system call).

In this case, the 32-bit context is saved at a well-known location by the process that translates from running in 32-bit mode to running in 64-bit mode for system calls and other voluntary, user mode “32-bit break out” events. Specifically, the Wow64 layer repurposes the second TLS slot of each 64-bit thread (that is, Teb->TlsSlots[ 1 ]) to point to a structure of the following layout:

typedef struct _WOW64_THREAD_INFO
{
   ULONG UnknownPrefix;
   WOW64_CONTEXT Wow64Context;
   ULONG UnknownSuffix;
} WOW64_THREAD_INFO, * PWOW64_THREAD_INFO;

(The real structure name is not known..)

Normally, system components do not use the TLS array, but the Wow64 layer is an exception. Because there is not normally any third party 64-bit code running in a Wow64 process, the Wow64 layer is free to do what it wants with the TlsSlots array of the 64-bit TEB for a Wow64 thread. (Each Wow64 thread has its own, separate 32-bit TEB, so this does not interfere with the operation of TLS by the 32-bit program that is currently executing.)

In the case where the requested Wow64 is in a 64-bit Wow64 callout, all one needs to do is to retrieve the base address of the 64-bit TEB of the thread in question, read the second entry in the TlsSlots array, and then read the WOW64_CONTEXT structure out of the memory block referred to by the second 64-bit TLS slot.

The other case that is significant is that where the Wow64 thread is running 32-bit code and is not in a Wow64 callout. In this case, because Wow64 runs x86 code natively, one simply needs to capture the 64-bit context of the desired thread and truncate all of the 64-bit registers to their 32-bit counterparts.

Setting the context of a Wow64 thread works exactly like retrieving the context of a Wow64 thread, except in reverse; one either modifies the 64-bit thread context if the thread is running 32-bit code, or one modifies the saved context record based off of the 64-bit TEB of the desired thread (which will be restored when the thread resumes execution).

I have posted a basic implementation of a version of Wow64­GetThreadContext that operates on pre-Windows-Vista platforms. Note that this implementation is incomplete; it does not translate floating point registers, nor does it only act on the subset of registers requested by the caller in CONTEXT::ContextFlags. The provided code also does not implement Wow64­SetThreadContext; implementing the “set” operation and extending the “get” operation to fully conform to GetThreadContext semantics are left as an exercise for the reader.

This code will operate on Vista x64 as well, although I would strongly recommend using the documented API on Vista and later platforms instead.

Note that the operation of Wow64 on IA64 platforms is completely different from that on x64. This information does not apply in any way to the IA64 version of Wow64.