Archive for May, 2007

Can I get that without the blinky lights?

Tuesday, May 15th, 2007

It seems that every sort of electronic device you get nowadays has to have some sort of ridiculously annoying blinky light.

The set of Bluetooth stereo headphones that I recently bought have an annoying blue light on the “M” button on each earpiece. While you’re playing stereo audio through them, the two lights slowly fade in and out. It’s really quite annoying to walk into a dark room with the headphones on and have a weird blue glow playing off the walls because of it. (It seems that there must be a rule out that somewhere that every Bluetooth device has to have a blue LED somewhere, or it’s not really Bluetooth(tm).)

Then there’s the V740 ExpressCard that I got recently for EVDO Rev.A Internet access. The improved upstream and latency on it is great, but I wish that someone would have written on the box: “Warning, this product includes a blinky LED whose sole purpose is to annoy the operator.”

Specifically, there’s bright LED (low quality camera – sorry) on the card that blinks several times per second while in use. The LED doesn’t provide any sort of indication of data rate, what type of service you are using; just that you’re connected (in other words, nothing useful). Apparently, Novatel thought it important enough to remind you that you’re still online with an annoying bright light right by your laptop keyboard (where most ExpressCard ports are) that strobes a couple times per second. Great idea, there, Novatel…

Then, there’s a PNY USB flash memory stick I bought recently that also blinks many times per second while it’s in use (bright red, this time). This, of course, really sucks if you use it for ReadyBoost, pretty much forcing you to plug it into the back of your computer (to the eternal annoyance of anyone sitting across from you at a table).

Even my laptop (an XPS M1710) is adorned with flashy LEDs, though at least these are programmable (and can be disabled if desired), making them at least somewhat redeemable.

Ironically, my cell phone is about the least intrusive gadget I use in terms of annoying blinky lights. I think that is saying something about the unfortunate state of affairs with modern electronic gadgets and the obsession with blinky lights nowadays…

Beware GetThreadContext on Wow64

Friday, May 11th, 2007

If you’re planning on running a 32-bit program of yours under Wow64, one of the things that you may need to watch out for is a subtle change in how GetThreadContext / SetThreadContext operate for Wow64 processes.

Specifically, these operations require additional access rights when operating on a Wow64 context (as is the case of a Wow64 process calling Get/SetThreadContext). This is hinted at in MSDN in the documentation for GetThreadContext, with the following note:

WOW64: The handle must also have THREAD_QUERY_INFORMATION access.

However, the reality of the situation is not completely documented by MDSN.

Under the hood, the Wow64 context (e.g. x86 context) for a thread on Windows x64 is actually stored at a relatively well-known location in the virtual address space of the thread’s process, in user-mode. Specifically, one of the TLS slots in thread’s TLS array is repurposed to point to a block of memory that contains the Wow64 context for the thread (used to read or write the context when the Wow64 “portion” of the thread is not currently executing). This is presumably done for performance reasons, as the Wow64 thunk layer needs to be able to quickly transition to/from x86 mode and x64 mode. By storing the x86 context in user mode, this transition can be managed without a kernel mode call. Instead, a simple far call is used to make the transition from x86 to x64 (and an iretq is used to transition from x64 to x86). For example, the following is what you might see when stepping into a Wow64 layer call with the 64-bit debugger while debugging a 32-bit process:

ntdll_777f0000!ZwWaitForMultipleObjects+0xe:
00000000`7783aebe 64ff15c0000000  call    dword ptr fs:[0C0h]
0:000:x86> t
wow64cpu!X86SwitchTo64BitMode:
00000000`759c31b0 ea27369c753300  jmp     0033:759C3627
0:012:x86> t
wow64cpu!CpupReturnFromSimulatedCode:
00000000`759c3627 67448b0424      mov     r8d,dword ptr [esp]

A side effect of the Wow64 register context being stored in user mode, however, is that it is not so easily accessible to a remote process. In order to access the Wow64 context, what needs to occur is that the TEB for a thread in question must be located, then read out of the thread’s process’s address space. From there, the TLS array is processed to locate the pointer to the Wow64 context structure, which is then read (or written) from the thread’s process’s address space.

If you’ve been following along so far, you might see the potential problem here. From the perspective of GetThreadContext, there is no handle to the process associated with the thread in question. In other words, we have a missing link here: In order to retrieve the Wow64 context of the thread whose handle we are given, we need to perform a VM read operation on its process. However, to do that, we need a process handle, but we’ve only got a thread handle.

The way that the Wow64 layer solves this problem is to query the process ID associated with the requested thread, open a new handle to the process with the required access rights, and then performs the necessary VM read / write operations.

Now, normally, this works fine; if you can get a handle to the thread of a process, then you should almost always be able to get a handle to the process itself.

However, the devil is in the details, here. There are situations where you might not have access to open a handle to a process, even though you have a handle to the thread (or even an existing process handle, but you can’t open a new one). These situations are relatively rare, but they do occur from time to time (in fact, we ran into one at work here recently).

The most likely scenario for this issue is when you are dealing with (e.g. creating) a process that is operating under a different security context than your process. For example, if you are creating a process operating as a different user, or are working with a process that has a higher integrity level than the current process, then you might not have access to open a new handle to the process (even if you might already have an existing handle returned by a CreateProcess* family routine.

This turns into a rather frustrating problem, especially if you have a handle to both the process and a thread in the process that you’re modifying; in that case, you already have the handle that the Wow64 layer is trying to open, but you have no way to communicate it to Wow64 (and Wow64 will try and fail to open the handle on its own when you make the GetThreadContext / SetThreadContext call).

There are two effective solutions to this problem, neither of which are particularly pretty.

First, you could reverse engineer the Wow64 layer and figure out the exact specifics behind how it locates the PWOW64_CONTEXT, and implement that logic inline (using your already-existing process handle instead of creating a new one). This has the downside that you’re way into undocumented implementation details land, so there isn’t a guarantee that your code will continue to operate on future Windows versions.

The other option is to temporarily modify the security descriptor of the process to allow you to open a second handle to it for the duration of the GetThreadContext / SetThreadContext calls. Although this works, it’s definitely a pain to have to go muddle around with security descriptors just to get the Wow64 layer to work properly.

Note that if you’re a native x64 process on x64 Windows, and you’re setting the 64-bit context of a 64-bit process, this problem does not apply. (Similarly, if you’re a 32-bit process on 32-bit Windows, things work “as expected” as well.)

So, in case you’ve been getting mysterious STATUS_ACCESS_DENIED errors out of GetThreadContext / SetThreadContext in a Wow64 program, now you know why.

Don’t perform complicated tasks in your unhandled exception filter

Thursday, May 10th, 2007

When it comes to crash reporting, the mechanism favored by many for globally catching “crashes” is the unhandled exception filter (as set by SetUnhandledExceptionFilter).

However, many people tend to go wrong with this mechanism with respect to what actions they take when the unhandled exception filter is called. To understand what I am talking about, it’s necessary to define the conditions under which an unhandled exception filter is executed.

The unhandled exception filter is, by definition, called when an unhandled exception occurs in any Win32 thread in a process. This sort of event is virtually always caused by some sort of corruption of the process state somewhere, such that something eventually probably touched a non-allocated page somewhere and caused an unhandled access violation (or some other similarly severe problem).

In other words, in the context of the unhandled exception filter, you don’t really know what lead up to the current unhandled exception, and more importantly, you don’t know what you can rely on in the process. For example, if you get an AV that bubbles up to your UEF, it might have been caused by corruption in the process heap, which would mean that you probably can’t safely perform heap allocations or you’re risking running into the same problem that caused the original crash in the first place. Or perhaps the problem was an unhandled allocation failure, and another attempt by your unhandled exception filter to allocate memory might just similarly fail.

Actually, the problem gets a bit worse, because you aren’t even guaranteed anything about what the other threads in the process are doing when the crash occurs (in fact, if there are any other threads in the process at the time of the crash, chances are that they’re still running when your unhandled exception filter is called – there is no magical logic to suspend all other activity in the process while your filter is called). This has a couple of other implications for you:

  1. You can’t really rely on the state of synchronization objects in the process. For all you know, the thread that crashed owned a lock that will cause a deadlock if you try to acquire a second lock, which might be owned by a thread that is waiting on the lock owned by the crashed thread.
  2. You can’t with 100% certainty assume that a secondary failure (precipitated by the “original” crash) won’t occur in a different thread, causing your exception filter to be entered by an additional thread at the same time as it is already processing an event from the “first” crash.

In fact, it would be safe to say that there is even less that you can safely do in an unhandled exception filter than under the infamous loader lock in DllMain (which is saying something indeed).

Given these rather harsh conditions, performing actions like heap allocations or writing minidumps from within the current process are likely to fail. Essentially, as whatever kind of recovery action you take from the UEF grows more complicated, it becomes extremely more likely to fail (possibly causing secondary failures that obscure the original problem) as a result of corruption that caused (or is caused by) the original crash. Even something as seemingly innocuous as creating a new process is potentially dangerous (are you sure that nothing in CreateProcess will ever touch the process heap? What about acquire the loader lock? I’ll give you a hint – the latter is definitely not true, such as in the case where Software Restriction Policies are defined).

If you’ve ever taken a look at the kernel32 JIT debugger support in Windows XP, you may have noticed that it doesn’t even follow these rules – it calls CreateProcess, after all. This is part of the reason why sometimes you’ll have programs silently crash even with a JIT debugger. For programs where you want truly robust crash reporting, I would recommend putting as much of the crash reporting logic into a separate process that is started before a crash occurs (e.g. during program initialization), instead of following the JIT launch-reporting-process approach. This “watchdog process” can then sit idle until it is signaled by the process it is watching over that a crash has occured.

This signaling mechanism should, ideally, be pre-constructed during initialization so that the actual logic within the unhandled exception filter just signals the watchdog process that an event occurs, with a pointer to the exception/context information. The filter should then wait for the watchdog process to signal that it is finished before exiting the process.

The mechanism that we use here at work with our programs to communicate between the “guarded” process and the watchdog is simply a file mapping that is mapped into both processes (for passing information between the two, such as the address of the exception record and context record for an active exception event) and a pair of events that are used to communicate “exception occured” and “dump writing completed” between the two processes. With this configuration, all the exception filter needs to do is to store some data in the file mapping view (already mapped ahead of time) and call SetEvent to notify the watchdog process to wake up. It then waits for the watchdog process to signal completion before terminating the process. (This particular mechanism does not address the issue of multiple crashes occuring at the same time, which is something that I deemed acceptable in this case.) The watchdog process is responsible for all of the “heavy lifting” of the crash reporting process, namely, writing the actual dump with MiniDumpWriteDump.

An alternative to this approach is to have the watchdog process act as a debugger on the guarded process; however, I do not typically recommend this as acting as a debugger has a number of adverse side effects (notably, that a great many events cause the process to be suspended entirely while the debugger/watchdog process can inspect the state of the process for a particular debugger event). The watchdog process mechanism is more performant (if ever so slightly less robust), as there is virtually no run-time overhead in the guarded process unless an unhandled exception occurs.

So, the moral of the story is: keep it simple (at least with respect to your unhandled exception filters). I’ve dealt with mechanisms that try to do the error reporting logic in-process, and those that punt the hard work off to a watchdog process in a clean state, and the latter is significantly more reliable in real world cases. The last thing you want to be happening with your crash reporting mechanism is that it causes secondary problems that hide the original crash, so spend the bit of extra work to make your reporting logic that much more reliable and save yourself the headaches later on.

What is the “lpReserved” parameter to DllMain, really? (Or a crash course in the internals of user mode process initialization)

Wednesday, May 9th, 2007

One of the parameters to the DllMain function is the enigmatic lpReserved argument. According to MSDN, this parameter is used to convey whether a DLL is being loaded (or unloaded) as part of process startup or termination, or as part of a dynamic DLL load/unload operation (e.g. LoadLibrary/FreeLibrary).

Specifically, MSDN says that for static DLL operations, lpReserved contains a non-null value, whereas for dynamic DLL operations, it contains a null value.

There’s actually a little bit more to this parameter, though. While it’s true that in the case of DLL_PROCESS_DETACH operations, it is little more than a boolean value, it has some more significance in DLL_PROCESS_ATTACH operations.

To understand this, you need to know a little bit more about how process/thread initialization occurs. When a new user mode thread is started, the kernel queues a user mode APC to it, pointing to a function in ntdll called LdrInitializeThunk (ntdll is always mapped into the address space of a new process before any user mode code runs). The kernel also arranges for the thread to dispatch the user mode APC the first time it begins execution.

One of the arguments to the APC is a pointer to a CONTEXT structure describing the initial execution state of the new thread. (The actual contents of the CONTEXT structure are based at the stack of the thread.)

When the thread is first resumed, the APC executes and control transfers to ntdll!LdrInitializeThunk. From there, depending on whether the process is already initialized or not, either the process initialization code is executed (loading DLLs statically linked to the process, and soforth), or per-thread initialization code is run (for instance, making DLL_THREAD_ATTACH callouts to loaded DLLs).

If a user mode thread always actually begins execution at ntdll!LdrInitializeThunk, then you might be wondering how it ever starts executing at the start address specified in a CreateThread call. The answer is that eventually, the code called by LdrInitializeThunk passes the context record argument supplied by the kernel to the NtContinue system call, which you can think of as simply taking that context and transferring control to it. Because the context record argument to the APC contained the information necessary for control to be transferred to the starting address supplied to CreateThread, the thread then begins executing at expected thread starting address*.

(*: Actually, there is typically another layer here – usually, control would go to a kernel32 or ntdll function (depending on whether you are running on a downlevel platform or on Vista), which sets up a top level exception handler and then calls the start routine supplied to CreateThread. But, for the purposes of this discussion, you can consider it as just running the requested thread start routine.)

As all of this relates to DllMain, the value of the lpReserved parameter to DllMain (when process initialization is occuring and static linked DLLs are being loaded and initialized) corresponds to the context record argument supplied to the LdrInitializeThunk APC, and is thus representative of the initial context that will be set for the thread after initialization completes. In fact, it’s actually the context that will be used after process initialization is complete, and not just a copy of it. This means that by treating the lpReserved argument as a PCONTEXT, the initial execution state for the first thread in the process can be examined (or even altered) from the DllMain of a static-linked DLL.

This can be verified experimentally by using some trickery to step into process initialization DllMain (more on just how to do that with the user mode debugger in a future entry, as it turns out to be a bit more complicated than what one might imagine):

1:001> g
Breakpoint 3 hit
ADVAPI32!DllInitialize:
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> r
rax=0000000000000000 rbx=00000000002c4770 rcx=000007fefeaf0000
rdx=0000000000000001 rsi=000007fefeb15580 rdi=0000000000000003
rip=000007fefeb15580 rsp=00000000001defb8 rbp=0000000000000000
 r8=00000000001df530  r9=00000000001df060 r10=00000000002c1310
r11=0000000000000246 r12=0000000000000000 r13=00000000001df0f0
r14=0000000000000000 r15=000000000000000d
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b
             efl=00000246
ADVAPI32!DllInitialize:
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> k
RetAddr           Call Site
00000000`77414664 ADVAPI32!DllInitialize
00000000`77417f29 ntdll!LdrpRunInitializeRoutines+0x257
00000000`7748e974 ntdll!LdrpInitializeProcess+0x16af
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe
1:001> .cxr @r8
rax=0000000000000000 rbx=0000000000000000 rcx=00000000ffb3245c
rdx=000007fffffda000 rsi=0000000000000000 rdi=0000000000000000
rip=000000007742c6c0 rsp=00000000001dfa08 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=0000  es=0000  fs=0000  gs=0000
             efl=00000200
ntdll!RtlUserThreadStart:
00000000`7742c6c0 4883ec48        sub     rsp,48h
1:001> u @rcx
tracert!mainCRTStartup:
00000000`ffb3245c 4883ec28        sub     rsp,28h

If you’ve been paying attention thus far, then you might then be able to explain why when you set a hardware breakpoint at the initial process breakpoint, the debugger warns you that it will not take effect. For example:

(16f0.1890): Break instruction exception
- code 80000003 (first chance)
ntdll!DbgBreakPoint:
00000000`7742fdf0 cc              int     3
0:000> ba e1 kernel32!CreateThread
        ^ Unable to set breakpoint error
The system resets thread contexts after the process
breakpoint so hardware breakpoints cannot be set.
Go to the executable's entry point and set it then.
 'ba e1 kernel32!CreateThread'
0:000> k
RetAddr           Call Site
00000000`774974a8 ntdll!DbgBreakPoint
00000000`77458068 ntdll!LdrpDoDebuggerBreak+0x35
00000000`7748e974 ntdll!LdrpInitializeProcess+0x167d
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe

Specifically, this message occurs because the debugger knows that after the process breakpoint occurs, the current thread context will be discarded at the call to NtContinue. As a result, hardware breakpoints (which rely on the debug register state) will be wiped out when NtContinue restores the expected initial context of the new thread.

If one is clever, it’s possible to apply the necessary modifications to the appropriate debug register values in the context record image that is given as an argument to LdrInitializeThunk, which will thus be realized when NTDLL initialization code for the thread runs.

The fact that every user mode thread actually begins life at ntdll!LdrInitializeThunk also explains why, exactly, you can’t create a new thread in the current process from within DllMain and attempt to synchronize with it; by virtue of the fact that the current thread is executing DllMain, it must have the enigmatic loader lock acquired. Because the new thread begins execution at LdrInitializeThunk (even before the start routine you supply is called), for the purpose of making DLL_THREAD_ATTACH callouts, it too will become almost immediately blocked on the loader lock. This results in a classic deadlock if the thread already in DllMain tries to wait for the new thread.

Parting shots:

  • Win9x is, of course, completely dissimilar as far as DllMain goes. None of this information applies there.
  • The fact that lpReserved is a PCONTEXT is only very loosely documented by a couple of ancient DDK samples that used the PCONTEXT argument type, and some more recent SDK samples that name the “lpReserved” parameter “lpvContext”. As far as I know, it’s been around on all versions of NT (including Vista), but like other pseudo-documented things, it isn’t necessarily guaranteed to remain this way forever.
  • Oh, and in case you’re wondering why I used advapi32 instead of kernel32 in this example, it’s because due to a rather interesting quirk in ntdll on recent versions of Windows, kernel32 is always dynamic-loaded for every Win32 process (regardless of whether or not the main process image is static linked to kernel32). To make things even more interesting, kernel32 is dynamic loaded before static-linked DLLs are loaded. As a result, I decided it would be best to steer clear of it for the purposes of making this posting simple; be suitably warned, then, about trying this out on kernel32 at home.

Process-level security is not a particularly great way to enforce DRM when users own their own hardware.

Tuesday, May 8th, 2007

Recently, I discussed the basics of the new “process-level security” mechanism introduced with Windows Vista (integrity levels; otherwise known as “mandatory integrity control“, or MIC for short).

Although when combined with more conventional user-level access control, there is the potential to improve security for users to an extent, MIC is ultimately not a mechanism to lock users out of their own computers.

As you might have guessed by this point, I am speaking of the rather less savory topic of DRM. MIC might appear to be attractive to developers that wish to deploy a DRM system, but it really doesn’t provide a particularly effective way to stop a computer owner (administrator) from, well, administering their system.

MIC (and process-level security), on the surface, may appear to be a good way to accomplish this goal. After all, the process-level security model does allow for securable objects (such as processes) to be guarded against other objects – even of the same user sid, which is typically the kind of restriction that a software-based DRM system will try to enforce (i.e. preventing you from debugging a program).

However, it is important to consider that the sort of restrictions imposed by process-level security mechanisms are designed to protect programs from other programs. They are not supposed to protect programs from the user that controls the computer on which they run (in other words, the computer administrator or whatever you wish to call it).

Windows Vista attempts to implement such a (DRM) protection scheme, loosely based on the principals of process-level security, in the form of something called “protected processes“.

If you look through the Vista SDK headers (specifically, winnt.h), you may come across a particularly telling comment that would seem to indicate that protected processes were originally intended to be implemented via the MIC scheme for process-level security in Vista:

#define SECURITY_MANDATORY_LABEL_AUTHORITY       {0,0,0,0,0,16}
#define SECURITY_MANDATORY_UNTRUSTED_RID         (0x00000000L)
#define SECURITY_MANDATORY_LOW_RID               (0x00001000L)
#define SECURITY_MANDATORY_MEDIUM_RID            (0x00002000L)
#define SECURITY_MANDATORY_HIGH_RID              (0x00003000L)
#define SECURITY_MANDATORY_SYSTEM_RID            (0x00004000L)
#define SECURITY_MANDATORY_PROTECTED_PROCESS_RID (0x00005000L)

//
// SECURITY_MANDATORY_MAXIMUM_USER_RID is the highest RID
// that can be set by a usermode caller.
//

#define SECURITY_MANDATORY_MAXIMUM_USER_RID \\
   SECURITY_MANDATORY_SYSTEM_RID

As it would turn out, protected processes (as they are called) are not actually implemented using the integrity level/MIC mechanism on Vista; instead, there is another, alternate mechanism that provides a way to mark protected processes are “untouchable” by “normal” processes (the lack of flexibility in the integrity level ACE system, as far as specifying which access rights are permitted, is the likely reason. If you read the linked article and the paper it includes, there are a new set of access rights defined specially for dealing with protected processes, which are deemed “safe”. These access rights are requestable for such processes, unlike the standard access rights, and there isn’t a good way to convey this with the set of “allow/deny read/write/execute” options available with an integrity level ACE on Vista.)

The end result is however, for the most part, the same; “protected processes” are essentially to high integrity (or lower) processes as high (or medium) integrity processes are to low integrity processes; that is, they cannot be adversely affected by a lesser-trusted process.

This is where the system begins to break down, though. Process integrity is an interesting way to attempt to curtail malware and exploits because the human at the computer (presumably) does not wish such activity to occur. On the other hand, DRM attempts to prevent the human at their computer from performing an action that they (ostensibly) do in fact wish to perform, with their own computer.

This is a fundamental distinction. The difference is that the malware or exploit code that process level security is designed to defend against doesn’t have the benefit of a human with physical (or administrative) access to the computer in question. That little detail turns out to make a world of difference, as we humans aren’t necessarily constrained by the security system like a program would be. For instance, if some evil exploit code running as a low integrity process on a computer wants to gain administrative access to the box, it just can’t do so (excepting the possibility of local privilege escalation exploits or trying to social-engineer the user into giving the program said access – for the moment, ignore those attack vectors, though they are certainly real ones that must be dealt with at some point).

However, if I am a human sitting at my computer, and I am logged on as a “plain user” and wish to perform an administrative task, I am not so constrained. Instead, I simply either log out and log back in as an administrative user (using my administrative account password), or type my password into an elevation prompt. Problem solved!

Now, of course, the protected process mechanism in Vista isn’t quite that dumb. It does try to block administrators from gaining access to protected processes; direct attempts will return STATUS_ACCESS_DENIED. However, again, humans can be a bit more clever here. For one, a user (and by user, I mean a person with full control over their computer) that is intent on bypassing the protected process mechanism could simply load a driver designed to subvert the protected process mechanism.

The DRM system might then counter that attack by then requiring kernel mode code to be signed, on the theory that for wide-scale violations of the DRM system in such a manner, a “cracker” would need to obtain a code-signing cert that would make them more-easily identifiable and vulnerable to legal attack.

However, people are clever (and more specifically, people with physical / administrative access to a computer are not so necessarily constrained by the basic “rules” of the operating system). One could imagine somebody doing something like patching out the driver signing checks on disk, or any number of other approaches. The theoretical counters to attacks like that would be some sort of hardware support to verify the boot process and ensure that only trusted, signed (and thus unmodified by a “cracker”) code can boot the system. Even that is not necessarily foolproof, though; what’s to say that nobody has compromised the task-offload engine on the system’s NIC to run custom code with full physical memory access, outside the confines of the operating system entirely? Free reign over something capable of performing DMA to physical memory means that kernel code and data can be freely rewritten.

Now, where am I going with all of this? I suppose that I am just frustrated that certain people seem to want to continue to invest significant resources into systems that try to wrest control of a computer from an end user, which are simply doomed to fail by the very nature of the diverse and uncontrolled systems upon which that code will run (and which sometimes compromise the security of customer systems in the process). I don’t think the people behind the protected processes system at Microsoft are stupid, not by any means. However, I can’t help but feel that they have know they’re fighting a losing battle, and that their knowledge and expertise would be better spent on more productive things (like working to improve the next release of Windows, or what-have-you).

Now, a couple of parting shots in an effort to quell several potential misconceptions before they begin:

  • I am not advocating that people bypass DRM. This is probably less than legal in many places. I am, however, trying to make a case for the fact that trying to use security models originally designed to protect users from malware as a DRM mechanism is at best a bad idea.
  • I’m also not trying to downplay the negative impact of theft of copyrighted materials, or anything of that sort. As a programmer myself, I’m well aware that if nobody will buy your product because it’s pirated all over the world, then it’s hard to eke out a living. However, I do believe that it is a fallacy to say that it’s impossible to make money out of software or content in the Internet age without layer after layer of customer-unfriendly DRM.
  • I’m not trying to knock the rest of the improvements in Vista (or the start of process-level security being deployed to joe end user, even though it’s probably not yet perfect). There’s a lot of good work that’s been done with Vista, and despite the (ill-conceived, some might say) DRM mechanisms, there is real value that has been added with this release.
  • I’m also not trying to say that Microsoft is devoting so much of its time to DRM that it isn’t paying any attention to adding real value to its products. However, in my view – most of the time spent on DRM is time that could be better spent adding that “real value” instead of doing the dance of security by obscurity (as with today’s systems, that is really all you can do, when it comes down to it) with some enigmatic idea of a “cracker” out there intent on stealing every piece of software or content they get their hands on and redistributing it to every person in the world for free.
  • I’m also not trying to state that the kernel mode code signing requirements for x64 Vista are entirely motivated by DRM (or that all it’s good for is an attempt to enforce DRM), but I doubt that anyone could truthfully say that DRM played no part in the decision to require signed drivers on x64 Vista either. Regardless, there remain other reasons for ostensibly requiring signed code besides trying to block (or at least hold accountable) attempts to bypass the protected process system.

Tricks for getting the most out of your minidumps: Including specific memory regions in a dump

Friday, May 4th, 2007

If you’ve ever worked on any sort of crash reporting mechanism, one of the constraints that you are probably familiar with is the size of the dump file created by your reporting mechanism. Obviously, as developers, we’d really love to write a full dump including the entire memory image of the process, full data about all threads and handles (and the like), but this is often less than possible in the real world (particularly if you are dealing with some sort of automated crash submission system, which needs to be as un-intrusive as possible, including not requiring the transfer of 50MB .dmp files).

One way you can improve the quality of the dumps your program creates without making the resulting .dmp unacceptably large is to just use a bit of intelligence as to what parts of memory you’re interested in. After all, while certainly potentially useful, chances are you probably won’t really need the entire address space of the process at the time of a crash to track down the issue. Often enough, simply a stack trace (+ listing of threads) is enough, which is more along the lines of what you see when you make a fairly minimalistic minidump.

However, there are lots of times where that little piece of state information that might explain how your program got into its crashed state isn’t on the stack, leaving you stuck without some additional information. An approach that can sometimes help is to include specific, “high-value” regions of memory in a memory dump. For example, something that can often be helpful (especially in the days of custom calling conventions that try to avoid using the stack where-ever possible) is to include a small portion of memory around each register in-memory.

The idea here is that when you’re going to write a dump, check each register in the faulting context to see if it points to a valid location in the address space of the crashed process. If so, you can include a bit of memory (say, +/- 128 bytes [or some other small amount] from the register’s value) in the dump. On x86, you can actually optimize this a bit further and typically leave out eip/esp/ebp (and any register that points into an executable section of an image section, on the assumption that you’ll probably be able to grab any relevant images from the symbol server (you are using a symbol repository with your own binaries included, aren’t you?) and don’t need to waste space with that code in the dump).

One class of problem that this can be rather helpful in debugging is a crash where you have some sort of structure or class that is getting used in some partially valid state and you need the contents of the struct/class to figure out just what happened. In many cases, you can probably infer the state of your mystery class/struct from what other threads in a program were doing, but sometimes this isn’t possible. In those cases, having access to the class/struct that was being ‘operated upon’ is a great help, and often times you’ll find code where there is a `this’ pointer to an address on the heap that is tantalyzingly present in the current register context. If you were using a typical minimalistic dump, then you would probably not have access to heap memory (due to size constraints) and might find yourself out of luck. If you included a bit of memory around each register when the crash occured, however, that just might get you the extra data points needed to figure out the problem. (Determining which registers “look” like a pointer is something easily accomplished with several calls to VirtualQueryEx on the target, taking each crash-context register value as an address in the target process and checking to see if it refers to a committed region.)

Another good use case for this technique is to include information about your program state in the form of including the contents of various key heap- (or global- ) based objects that wouldn’t normally be included in the dump. In that case, you probably need to set up some mechanism to convey “interesting” addresses to the crash reporting mechanism before a crash occurs, so that it can simply include them in the dump without having to worry about trying to grovel around in the target’s memory trying to pick out interesting things after-the-fact (something that is generally not practical in an automated fashion, especially without symbols). For example, if you’ve got some kind of server application, you could include pointers to particularly useful per-session-state data (or portions thereof, size constraints considered). The need for this can be reduced somewhat by including useful verbose logging data, but you might not always want to have verbose logging on all the time (for various reasons), which might result in an initial repro of a problem being less than useful for uncovering a root cause.

Assuming that you are following the recommended approach of not writing dumps in-process, the easiest way to handle this sort of communication between the program and the (hopefully isolated in a different process) crash reporting mechanism is to use something like a file mapping that contains a list (or fixed-size array) of pointers and sizes to record in the dump. This can make adding or removing “interesting” pointers from the list to be included as simple as adding or removing an entry in a flat array.

As far as including additional memory regions in a minidump goes, this is accomplished by including a MiniDumpCallback function in your call to MiniDumpWriteDump (via the CallbackParam parameter). The minidump callback is essentially a way to perform advanced customizations on how your dump is processed, beyond a set of general behaviors supplied by the DumpType parameter. Specifically, the minidump callback lets you do things like include/exclude all sorts of things from the dump – threads, handle data, PEB/TEB data, memory locations, and more – in a programmatic fashion. The way it works is that as MiniDumpWriteDump is writing the dump, it will call the callback function you supply a number of times to query you for any data you want to add or subtract from the dump. There’s a huge amount of customization you can do with the minidump callback; too much for just this post, so I’ll simply describe how to use it to include specific memory regions.

As far as including memory regions go, you need to wait for the MemoryCallback event being passed to your minidump callback. The way the MemoryCallback event works is that you are called back repeatedly, until your callback returns FALSE. Each time you are called back (and return TRUE), you are expected to have updated the CallbackOutput->MemoryBase and CallbackOutput->MemorySize output parameter fields with the base address and length of a region that is to be included in the dump. When your callback finally returns FALSE, MiniDumpWriteDump assumes that you’re done specifying additional memory regions to include and continues on to the rest of the steps involved in writing the dump.

So, to provide a quick example, assuming you had a DumpWriter class containing an array of address / length pairs, you might use a minidump callback that looks something like this to include those addresses in the dump:

BOOL CALLBACK DumpWriter::MiniDumpCallback(
 PVOID CallbackParam,
 const PMINIDUMP_CALLBACK_INPUT CallbackInput,
 PMINIDUMP_CALLBACK_OUTPUT CallbackOutput
 )
{
 DumpWriter *Writer;
 BOOL        Status;

 Status = FALSE;

 Writer = reinterpret_cast<DumpWriter*>(CallbackParam);

 switch (CallbackInput->CallbackType)
 {

/*
 ... handle other events ...
 */

 case MemoryCallback:
  //
  // If we have some memory regions left to include then
  // store the next. Otherwise, indicate that we're finished.
  //

  if (Writer->Index == DumpWriter->Count)
   Status = FALSE;
  else
  {
   CallbackOutput->MemoryBase =
     Writer->Addresses[ Writer->Index ].Base;
   CallbackOutput->MemorySize =
     Writer->Addresses[ Writer->Index ].Length;

   Writer->Index += 1;
   Status = TRUE;
  }
  break;

/*
 ... handle other events ...
 */
 }

 return Status;
}

In a future posting, I’ll likely revisit some of the other neat things that you can do with the minidump callback function (as well as other things you can do to make your minidumps more useful to work with). In the mean time, Oleg Starodumov also has some great documentation (beyond that in MSDN) about just what all the other minidump callback events do, so if you’re finding MSDN a little bit lacking in that department, I’d encourage you to check his article out.