Don’t perform complicated tasks in your unhandled exception filter

When it comes to crash reporting, the mechanism favored by many for globally catching “crashes” is the unhandled exception filter (as set by SetUnhandledExceptionFilter).

However, many people tend to go wrong with this mechanism with respect to what actions they take when the unhandled exception filter is called. To understand what I am talking about, it’s necessary to define the conditions under which an unhandled exception filter is executed.

The unhandled exception filter is, by definition, called when an unhandled exception occurs in any Win32 thread in a process. This sort of event is virtually always caused by some sort of corruption of the process state somewhere, such that something eventually probably touched a non-allocated page somewhere and caused an unhandled access violation (or some other similarly severe problem).

In other words, in the context of the unhandled exception filter, you don’t really know what lead up to the current unhandled exception, and more importantly, you don’t know what you can rely on in the process. For example, if you get an AV that bubbles up to your UEF, it might have been caused by corruption in the process heap, which would mean that you probably can’t safely perform heap allocations or you’re risking running into the same problem that caused the original crash in the first place. Or perhaps the problem was an unhandled allocation failure, and another attempt by your unhandled exception filter to allocate memory might just similarly fail.

Actually, the problem gets a bit worse, because you aren’t even guaranteed anything about what the other threads in the process are doing when the crash occurs (in fact, if there are any other threads in the process at the time of the crash, chances are that they’re still running when your unhandled exception filter is called – there is no magical logic to suspend all other activity in the process while your filter is called). This has a couple of other implications for you:

  1. You can’t really rely on the state of synchronization objects in the process. For all you know, the thread that crashed owned a lock that will cause a deadlock if you try to acquire a second lock, which might be owned by a thread that is waiting on the lock owned by the crashed thread.
  2. You can’t with 100% certainty assume that a secondary failure (precipitated by the “original” crash) won’t occur in a different thread, causing your exception filter to be entered by an additional thread at the same time as it is already processing an event from the “first” crash.

In fact, it would be safe to say that there is even less that you can safely do in an unhandled exception filter than under the infamous loader lock in DllMain (which is saying something indeed).

Given these rather harsh conditions, performing actions like heap allocations or writing minidumps from within the current process are likely to fail. Essentially, as whatever kind of recovery action you take from the UEF grows more complicated, it becomes extremely more likely to fail (possibly causing secondary failures that obscure the original problem) as a result of corruption that caused (or is caused by) the original crash. Even something as seemingly innocuous as creating a new process is potentially dangerous (are you sure that nothing in CreateProcess will ever touch the process heap? What about acquire the loader lock? I’ll give you a hint – the latter is definitely not true, such as in the case where Software Restriction Policies are defined).

If you’ve ever taken a look at the kernel32 JIT debugger support in Windows XP, you may have noticed that it doesn’t even follow these rules – it calls CreateProcess, after all. This is part of the reason why sometimes you’ll have programs silently crash even with a JIT debugger. For programs where you want truly robust crash reporting, I would recommend putting as much of the crash reporting logic into a separate process that is started before a crash occurs (e.g. during program initialization), instead of following the JIT launch-reporting-process approach. This “watchdog process” can then sit idle until it is signaled by the process it is watching over that a crash has occured.

This signaling mechanism should, ideally, be pre-constructed during initialization so that the actual logic within the unhandled exception filter just signals the watchdog process that an event occurs, with a pointer to the exception/context information. The filter should then wait for the watchdog process to signal that it is finished before exiting the process.

The mechanism that we use here at work with our programs to communicate between the “guarded” process and the watchdog is simply a file mapping that is mapped into both processes (for passing information between the two, such as the address of the exception record and context record for an active exception event) and a pair of events that are used to communicate “exception occured” and “dump writing completed” between the two processes. With this configuration, all the exception filter needs to do is to store some data in the file mapping view (already mapped ahead of time) and call SetEvent to notify the watchdog process to wake up. It then waits for the watchdog process to signal completion before terminating the process. (This particular mechanism does not address the issue of multiple crashes occuring at the same time, which is something that I deemed acceptable in this case.) The watchdog process is responsible for all of the “heavy lifting” of the crash reporting process, namely, writing the actual dump with MiniDumpWriteDump.

An alternative to this approach is to have the watchdog process act as a debugger on the guarded process; however, I do not typically recommend this as acting as a debugger has a number of adverse side effects (notably, that a great many events cause the process to be suspended entirely while the debugger/watchdog process can inspect the state of the process for a particular debugger event). The watchdog process mechanism is more performant (if ever so slightly less robust), as there is virtually no run-time overhead in the guarded process unless an unhandled exception occurs.

So, the moral of the story is: keep it simple (at least with respect to your unhandled exception filters). I’ve dealt with mechanisms that try to do the error reporting logic in-process, and those that punt the hard work off to a watchdog process in a clean state, and the latter is significantly more reliable in real world cases. The last thing you want to be happening with your crash reporting mechanism is that it causes secondary problems that hide the original crash, so spend the bit of extra work to make your reporting logic that much more reliable and save yourself the headaches later on.

7 Responses to “Don’t perform complicated tasks in your unhandled exception filter”

  1. Cool post. You have exactly the same situation in Unix, being called from SEGV, BUS, or ILL signal handlers. Obvious rule of thumb: no malloc.

  2. Marc Sherman says:

    Hi Skywing,

    Your statement:

    “(This particular mechanism does not address the issue of multiple crashes occuring at the same time, which is something that I deemed acceptable in this case.)”

    hits home with me. My latest minidump showed 2 threads in my UEF. Luckily for me, my UEF only sets the “crashed” event (which means I need to search for the CONTEXT struct in my minidumps).

    But what would happen in your UEF? Since you’re writing 2 pointers to shared mem, I suppose you could end up with thread A’s exception record pointer but thread B’s context record pointer in the shared memory segment. I imagine this would confuse `analyze -v`.

    I know you said not to synchronize in a UEF, but I think it would be safe in this case since the synchronization object (say a critical section) would only be acquired/released in the UEF. This could be used to synchronize access to the shared memory segment. (NOTE: This is only for synchronizing threads in the crashed process, not the watchdog process).

    Something like this:

    UEF thread A:
    1. acquire critical section
    2. write pointers to shared mem
    3. set “crashed” event
    4. wait for “done writing minidump” event
    5. release critical section

    The fact that thread A holds the critical section for the whole time keeps UEF thread B from (possibly partially) writing between steps 3 and 4 (partially because MiniDumpWriteDump suspends threads in the crashed process).

    What do you think?

    thanks,
    Marc

  3. Marc Sherman says:

    Skywing,

    Is there a comment size limit? I submitted a not small comment but it doesn’t show up.

    Marc

  4. Skywing says:

    Akismet ate it as spam for some reason; recovered it from the spam queue.

    The handler I outlined does not support multiple simultaneous crashes particularly well. You can try your luck with a critical section but be aware that there is a bit of state internal to NTDLL that you rely on by doing that. If you were going to use a sync object I’d go with a mutex instead, on the theory that kernel mode state is less likely to get broke (especially if you protect the handle from closure).

  5. […] 10th, 2007 in Links A good essay on not putting code in your Win32 exception handling, and more reasonable approaches for dealing with about-to-be-dead […]

  6. Alex Ionescu says:

    You should mention that Vista fixes part of this problem by having WerFault.exe out-of-process catch the exception. Gone is the “process disappearing without a crash dialog” bug.

  7. Yuhong Bao says:

    “Even something as seemingly innocuous as creating a new process is potentially dangerous (are you sure that nothing in CreateProcess will ever touch the process heap? What about acquire the loader lock? I’ll give you a hint – the latter is definitely not true, such as in the case where Software Restriction Policies are defined).”
    “If you’ve ever taken a look at the kernel32 JIT debugger support in Windows XP, you may have noticed that it doesn’t even follow these rules – it calls CreateProcess, after all.”
    So doesn’t Google Breakpad, and here is a deadlock caused by this:
    https://bugzilla.mozilla.org/show_bug.cgi?id=474254