Archive for the ‘NT Internals’ Category

What’s the difference between the Wow64 and native x86 versions of a DLL?

Friday, June 1st, 2007

Wow64 includes a complete set of 32-bit system DLLs implementing the Win32 API (for use by Wow64 programs). So, what’s the difference between the native 32-bit DLLs and their Wow64 versions?

On Windows x64, not much, actually. Most of the DLLs are actually the 32-bit copies from the 32-bit version of the operating system (not even recompiled). For example, the Wow64 ws2_32.dll on Vista x64 is the same file as the native 32-bit ws2_32.dll on Vista x86. If I compare the wow64 version of ws2_32.dll on Vista with the x86 native version, they are clearly identical:

32-bit Vista:

C:\\Windows\\System32>md5sum ws2_32.dll
d99a071c1018bb3d4abaad4b62048ac2 *ws2_32.dll

64-bit Vista (Wow64):

C:\\Windows\\SysWOW64>md5sum ws2_32.dll
d99a071c1018bb3d4abaad4b62048ac2 *ws2_32.dll

Some DLLs, however, are not identical. For instance, ntdll is very different across native x86 and Wow64 versions, due to the way that Wow64 system calls are translated to the 64-bit user mode portion of the Wow64 layer. Instead of calling kernel mode directly, Wow64 system services go through a translation layer in user mode (so that the kernel does not need to implement 32-bit and 64-bit versions of each system service, or stubs thereof in kernel mode). This difference can clearly be seen when comparing the Wow64 ntdll with the native x64 ntdll, with respect to a system service:

If we look at the native x86 ntdll, we see the expected call through the SystemCallStub pointers in SharedUserData:

0:000> u ntdll!NtClose
ntdll!ZwClose:
mov     eax,30h
mov     edx,offset SharedUserData!SystemCallStub
call    dword ptr [edx]
ret     4

However, an examination of the Wow64 ntdll shows something different; a call is made through a field at offset +C0 in the 32-bit TEB:

0:000> u ntdll!NtClose
ntdll!ZwClose:
mov     eax,0Ch
xor     ecx,ecx
lea     edx,[esp+4]
call    dword ptr fs:[0C0h]
ret     4

If we examine the 32-bit TEB, this field as marked as “WOW32Reserved” (kind of a misnomer, since it’s actually reserved for the 32-bit portion of Wow64):

0:000> dt ntdll!_TEB
   +0x000 NtTib            : _NT_TIB
[...]
   +0x0c0 WOW32Reserved    : Ptr32 Void

As I had previously mentioned, this all leads up to a transition to 64-bit mode in user mode in the Wow64 case. From there, the 64-bit portion of the Wow64 layer reformats the request into a 64-bit system call, makes the actual kernel transition, then repackages the results into what a 32-bit system call would return.

Thus, one case where a Wow64 DLL will necessarily differ is when that DLL contains a system call stub. The most common case of this is ntdll, of course, but there are a couple of other modules with inline system call stubs (user32.dll and gdi32.dll are the most common other modules, though winsrv.dll (part of CSRSS) also contains stubs, however it has no Wow64 version).

There are other cases, however, where Wow64 DLLs differ from their native x86 counterparts. Usually, a DLL that makes an LPC (“Local Procedure Call”) call to another component also has a differing version in the Wow64 layer. This is because most of the LPC interfaces in 64-bit Windows have been “widened” to 64-bits, where they included pointers (most notably, this includes the LSA LPC interface for use in implementing code that talks to LSA, such as LogonUser or calls to authentication packages). As a result, modules like advapi32 or secur32 differ on Wow64.

This has, in fact, been the source of some rather insidious bugs where mistakes have been made in the translation layer. For instance, if you call LsaGetLogonSessionData on Windows Server 2003 x64, from a Wow64 process, you may unhappily find out that if your program accesses any fields beyond “LogonTime” in the SECURITY_LOGON_SESSION_DATA structure returned by LsaGetLogonSessionData, it will crash. This is the result of a bug in the Wow64 version of Secur32.dll (on Windows Server 2003 x64 / Windows XP x64), where part of the structure was not properly converted to its 32-bit counterpart after the 64-bit version is read “off the wire” from an LSA LPC call. This particular problem has been fixed in Windows Vista x64 (and can be worked around with some clever hackery if you take a look at how the conversion layer in Secur32 works for Srv03 x64). Note that as far as I know from having spoken to PSS, Microsoft has no plans to fix the bug for pre-Vista platforms (and it is still broken in Srv03 x64 SP2), so if you are affected by that problem then you’ll have to live with implementing a hackish workaround if you can’t rebuild your app as native 64-bit (the problem only occurs for 32-bit processes on Srv03 x64, not 32-bit processes on Srv03 32-bit, or 64-bit processes on Srv03 x64).

There are a couple of other corner cases which result in necessary differences between Wow64 modules and their native 32-bit counterparts, though these are typically few and far between. Most notably absent in the list of things that require code changes for a Wow64 DLL are modules that directly talk to 64-bit drivers via IOCTLs. For example, even “low-level” modules such as mswsock.dll (the AFD Winsock2 Tcpip service provider which implements the user mode to kernel mode translation layer from Winsock2 service provider calls to AFD networking calls in kernel mode) that communicate heavily with drivers via IOCTLs are unchanged in Wow64. This is due to direct support for Wow64 processes in the kernel IOCTL interface (requiring changes to drivers to be aware of 32-bit callers via IoIs32bitProcess), such that 32-bit and 64-bit processes can share the same IOCTL codes even though such structures are typically “widened” to 64-bits for 64-bit processes.

Due to this approach to Wow64, most of the code that implements the 32-bit Win32 API on x64 computers is identical to their 32-bit counterparts (the same binary in the vast majority of cases). For DLLs that do require a rebuild, typically only small portions (system call stubs and LPC stubs in most cases) require any code changes. The major benefits to this design are savings with respect to QA and development time for the Wow64 layer, as most of the code implementing the Win32 API (at least in user mode) did not need to be re-engineered. That’s not to say that Microsoft hasn’t tested it, but it’s a far cry from having to modify every single module in a non-trivial way.

Additionally, this approach has the added benefit of preserving the vast majority of implementation details that have slowly sept into the Win32 programming model, as much as Microsoft has tried to prevent such from happening. Even programs that use undocumented system calls from NTDLL should generally work on the 64-bit version of that operating system via the Wow64 layer, for instance.

In a future posting later on down the line, I’ll be expanding on this high level overview of Wow64 and taking a more in depth look at the guts of Wow64 (of course, subject to change between OS releases without warning, as usual).

Beware GetThreadContext on Wow64

Friday, May 11th, 2007

If you’re planning on running a 32-bit program of yours under Wow64, one of the things that you may need to watch out for is a subtle change in how GetThreadContext / SetThreadContext operate for Wow64 processes.

Specifically, these operations require additional access rights when operating on a Wow64 context (as is the case of a Wow64 process calling Get/SetThreadContext). This is hinted at in MSDN in the documentation for GetThreadContext, with the following note:

WOW64: The handle must also have THREAD_QUERY_INFORMATION access.

However, the reality of the situation is not completely documented by MDSN.

Under the hood, the Wow64 context (e.g. x86 context) for a thread on Windows x64 is actually stored at a relatively well-known location in the virtual address space of the thread’s process, in user-mode. Specifically, one of the TLS slots in thread’s TLS array is repurposed to point to a block of memory that contains the Wow64 context for the thread (used to read or write the context when the Wow64 “portion” of the thread is not currently executing). This is presumably done for performance reasons, as the Wow64 thunk layer needs to be able to quickly transition to/from x86 mode and x64 mode. By storing the x86 context in user mode, this transition can be managed without a kernel mode call. Instead, a simple far call is used to make the transition from x86 to x64 (and an iretq is used to transition from x64 to x86). For example, the following is what you might see when stepping into a Wow64 layer call with the 64-bit debugger while debugging a 32-bit process:

ntdll_777f0000!ZwWaitForMultipleObjects+0xe:
00000000`7783aebe 64ff15c0000000  call    dword ptr fs:[0C0h]
0:000:x86> t
wow64cpu!X86SwitchTo64BitMode:
00000000`759c31b0 ea27369c753300  jmp     0033:759C3627
0:012:x86> t
wow64cpu!CpupReturnFromSimulatedCode:
00000000`759c3627 67448b0424      mov     r8d,dword ptr [esp]

A side effect of the Wow64 register context being stored in user mode, however, is that it is not so easily accessible to a remote process. In order to access the Wow64 context, what needs to occur is that the TEB for a thread in question must be located, then read out of the thread’s process’s address space. From there, the TLS array is processed to locate the pointer to the Wow64 context structure, which is then read (or written) from the thread’s process’s address space.

If you’ve been following along so far, you might see the potential problem here. From the perspective of GetThreadContext, there is no handle to the process associated with the thread in question. In other words, we have a missing link here: In order to retrieve the Wow64 context of the thread whose handle we are given, we need to perform a VM read operation on its process. However, to do that, we need a process handle, but we’ve only got a thread handle.

The way that the Wow64 layer solves this problem is to query the process ID associated with the requested thread, open a new handle to the process with the required access rights, and then performs the necessary VM read / write operations.

Now, normally, this works fine; if you can get a handle to the thread of a process, then you should almost always be able to get a handle to the process itself.

However, the devil is in the details, here. There are situations where you might not have access to open a handle to a process, even though you have a handle to the thread (or even an existing process handle, but you can’t open a new one). These situations are relatively rare, but they do occur from time to time (in fact, we ran into one at work here recently).

The most likely scenario for this issue is when you are dealing with (e.g. creating) a process that is operating under a different security context than your process. For example, if you are creating a process operating as a different user, or are working with a process that has a higher integrity level than the current process, then you might not have access to open a new handle to the process (even if you might already have an existing handle returned by a CreateProcess* family routine.

This turns into a rather frustrating problem, especially if you have a handle to both the process and a thread in the process that you’re modifying; in that case, you already have the handle that the Wow64 layer is trying to open, but you have no way to communicate it to Wow64 (and Wow64 will try and fail to open the handle on its own when you make the GetThreadContext / SetThreadContext call).

There are two effective solutions to this problem, neither of which are particularly pretty.

First, you could reverse engineer the Wow64 layer and figure out the exact specifics behind how it locates the PWOW64_CONTEXT, and implement that logic inline (using your already-existing process handle instead of creating a new one). This has the downside that you’re way into undocumented implementation details land, so there isn’t a guarantee that your code will continue to operate on future Windows versions.

The other option is to temporarily modify the security descriptor of the process to allow you to open a second handle to it for the duration of the GetThreadContext / SetThreadContext calls. Although this works, it’s definitely a pain to have to go muddle around with security descriptors just to get the Wow64 layer to work properly.

Note that if you’re a native x64 process on x64 Windows, and you’re setting the 64-bit context of a 64-bit process, this problem does not apply. (Similarly, if you’re a 32-bit process on 32-bit Windows, things work “as expected” as well.)

So, in case you’ve been getting mysterious STATUS_ACCESS_DENIED errors out of GetThreadContext / SetThreadContext in a Wow64 program, now you know why.

What is the “lpReserved” parameter to DllMain, really? (Or a crash course in the internals of user mode process initialization)

Wednesday, May 9th, 2007

One of the parameters to the DllMain function is the enigmatic lpReserved argument. According to MSDN, this parameter is used to convey whether a DLL is being loaded (or unloaded) as part of process startup or termination, or as part of a dynamic DLL load/unload operation (e.g. LoadLibrary/FreeLibrary).

Specifically, MSDN says that for static DLL operations, lpReserved contains a non-null value, whereas for dynamic DLL operations, it contains a null value.

There’s actually a little bit more to this parameter, though. While it’s true that in the case of DLL_PROCESS_DETACH operations, it is little more than a boolean value, it has some more significance in DLL_PROCESS_ATTACH operations.

To understand this, you need to know a little bit more about how process/thread initialization occurs. When a new user mode thread is started, the kernel queues a user mode APC to it, pointing to a function in ntdll called LdrInitializeThunk (ntdll is always mapped into the address space of a new process before any user mode code runs). The kernel also arranges for the thread to dispatch the user mode APC the first time it begins execution.

One of the arguments to the APC is a pointer to a CONTEXT structure describing the initial execution state of the new thread. (The actual contents of the CONTEXT structure are based at the stack of the thread.)

When the thread is first resumed, the APC executes and control transfers to ntdll!LdrInitializeThunk. From there, depending on whether the process is already initialized or not, either the process initialization code is executed (loading DLLs statically linked to the process, and soforth), or per-thread initialization code is run (for instance, making DLL_THREAD_ATTACH callouts to loaded DLLs).

If a user mode thread always actually begins execution at ntdll!LdrInitializeThunk, then you might be wondering how it ever starts executing at the start address specified in a CreateThread call. The answer is that eventually, the code called by LdrInitializeThunk passes the context record argument supplied by the kernel to the NtContinue system call, which you can think of as simply taking that context and transferring control to it. Because the context record argument to the APC contained the information necessary for control to be transferred to the starting address supplied to CreateThread, the thread then begins executing at expected thread starting address*.

(*: Actually, there is typically another layer here – usually, control would go to a kernel32 or ntdll function (depending on whether you are running on a downlevel platform or on Vista), which sets up a top level exception handler and then calls the start routine supplied to CreateThread. But, for the purposes of this discussion, you can consider it as just running the requested thread start routine.)

As all of this relates to DllMain, the value of the lpReserved parameter to DllMain (when process initialization is occuring and static linked DLLs are being loaded and initialized) corresponds to the context record argument supplied to the LdrInitializeThunk APC, and is thus representative of the initial context that will be set for the thread after initialization completes. In fact, it’s actually the context that will be used after process initialization is complete, and not just a copy of it. This means that by treating the lpReserved argument as a PCONTEXT, the initial execution state for the first thread in the process can be examined (or even altered) from the DllMain of a static-linked DLL.

This can be verified experimentally by using some trickery to step into process initialization DllMain (more on just how to do that with the user mode debugger in a future entry, as it turns out to be a bit more complicated than what one might imagine):

1:001> g
Breakpoint 3 hit
ADVAPI32!DllInitialize:
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> r
rax=0000000000000000 rbx=00000000002c4770 rcx=000007fefeaf0000
rdx=0000000000000001 rsi=000007fefeb15580 rdi=0000000000000003
rip=000007fefeb15580 rsp=00000000001defb8 rbp=0000000000000000
 r8=00000000001df530  r9=00000000001df060 r10=00000000002c1310
r11=0000000000000246 r12=0000000000000000 r13=00000000001df0f0
r14=0000000000000000 r15=000000000000000d
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b
             efl=00000246
ADVAPI32!DllInitialize:
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> k
RetAddr           Call Site
00000000`77414664 ADVAPI32!DllInitialize
00000000`77417f29 ntdll!LdrpRunInitializeRoutines+0x257
00000000`7748e974 ntdll!LdrpInitializeProcess+0x16af
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe
1:001> .cxr @r8
rax=0000000000000000 rbx=0000000000000000 rcx=00000000ffb3245c
rdx=000007fffffda000 rsi=0000000000000000 rdi=0000000000000000
rip=000000007742c6c0 rsp=00000000001dfa08 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=0000  es=0000  fs=0000  gs=0000
             efl=00000200
ntdll!RtlUserThreadStart:
00000000`7742c6c0 4883ec48        sub     rsp,48h
1:001> u @rcx
tracert!mainCRTStartup:
00000000`ffb3245c 4883ec28        sub     rsp,28h

If you’ve been paying attention thus far, then you might then be able to explain why when you set a hardware breakpoint at the initial process breakpoint, the debugger warns you that it will not take effect. For example:

(16f0.1890): Break instruction exception
- code 80000003 (first chance)
ntdll!DbgBreakPoint:
00000000`7742fdf0 cc              int     3
0:000> ba e1 kernel32!CreateThread
        ^ Unable to set breakpoint error
The system resets thread contexts after the process
breakpoint so hardware breakpoints cannot be set.
Go to the executable's entry point and set it then.
 'ba e1 kernel32!CreateThread'
0:000> k
RetAddr           Call Site
00000000`774974a8 ntdll!DbgBreakPoint
00000000`77458068 ntdll!LdrpDoDebuggerBreak+0x35
00000000`7748e974 ntdll!LdrpInitializeProcess+0x167d
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe

Specifically, this message occurs because the debugger knows that after the process breakpoint occurs, the current thread context will be discarded at the call to NtContinue. As a result, hardware breakpoints (which rely on the debug register state) will be wiped out when NtContinue restores the expected initial context of the new thread.

If one is clever, it’s possible to apply the necessary modifications to the appropriate debug register values in the context record image that is given as an argument to LdrInitializeThunk, which will thus be realized when NTDLL initialization code for the thread runs.

The fact that every user mode thread actually begins life at ntdll!LdrInitializeThunk also explains why, exactly, you can’t create a new thread in the current process from within DllMain and attempt to synchronize with it; by virtue of the fact that the current thread is executing DllMain, it must have the enigmatic loader lock acquired. Because the new thread begins execution at LdrInitializeThunk (even before the start routine you supply is called), for the purpose of making DLL_THREAD_ATTACH callouts, it too will become almost immediately blocked on the loader lock. This results in a classic deadlock if the thread already in DllMain tries to wait for the new thread.

Parting shots:

  • Win9x is, of course, completely dissimilar as far as DllMain goes. None of this information applies there.
  • The fact that lpReserved is a PCONTEXT is only very loosely documented by a couple of ancient DDK samples that used the PCONTEXT argument type, and some more recent SDK samples that name the “lpReserved” parameter “lpvContext”. As far as I know, it’s been around on all versions of NT (including Vista), but like other pseudo-documented things, it isn’t necessarily guaranteed to remain this way forever.
  • Oh, and in case you’re wondering why I used advapi32 instead of kernel32 in this example, it’s because due to a rather interesting quirk in ntdll on recent versions of Windows, kernel32 is always dynamic-loaded for every Win32 process (regardless of whether or not the main process image is static linked to kernel32). To make things even more interesting, kernel32 is dynamic loaded before static-linked DLLs are loaded. As a result, I decided it would be best to steer clear of it for the purposes of making this posting simple; be suitably warned, then, about trying this out on kernel32 at home.

Process-level security is not a particularly great way to enforce DRM when users own their own hardware.

Tuesday, May 8th, 2007

Recently, I discussed the basics of the new “process-level security” mechanism introduced with Windows Vista (integrity levels; otherwise known as “mandatory integrity control“, or MIC for short).

Although when combined with more conventional user-level access control, there is the potential to improve security for users to an extent, MIC is ultimately not a mechanism to lock users out of their own computers.

As you might have guessed by this point, I am speaking of the rather less savory topic of DRM. MIC might appear to be attractive to developers that wish to deploy a DRM system, but it really doesn’t provide a particularly effective way to stop a computer owner (administrator) from, well, administering their system.

MIC (and process-level security), on the surface, may appear to be a good way to accomplish this goal. After all, the process-level security model does allow for securable objects (such as processes) to be guarded against other objects – even of the same user sid, which is typically the kind of restriction that a software-based DRM system will try to enforce (i.e. preventing you from debugging a program).

However, it is important to consider that the sort of restrictions imposed by process-level security mechanisms are designed to protect programs from other programs. They are not supposed to protect programs from the user that controls the computer on which they run (in other words, the computer administrator or whatever you wish to call it).

Windows Vista attempts to implement such a (DRM) protection scheme, loosely based on the principals of process-level security, in the form of something called “protected processes“.

If you look through the Vista SDK headers (specifically, winnt.h), you may come across a particularly telling comment that would seem to indicate that protected processes were originally intended to be implemented via the MIC scheme for process-level security in Vista:

#define SECURITY_MANDATORY_LABEL_AUTHORITY       {0,0,0,0,0,16}
#define SECURITY_MANDATORY_UNTRUSTED_RID         (0x00000000L)
#define SECURITY_MANDATORY_LOW_RID               (0x00001000L)
#define SECURITY_MANDATORY_MEDIUM_RID            (0x00002000L)
#define SECURITY_MANDATORY_HIGH_RID              (0x00003000L)
#define SECURITY_MANDATORY_SYSTEM_RID            (0x00004000L)
#define SECURITY_MANDATORY_PROTECTED_PROCESS_RID (0x00005000L)

//
// SECURITY_MANDATORY_MAXIMUM_USER_RID is the highest RID
// that can be set by a usermode caller.
//

#define SECURITY_MANDATORY_MAXIMUM_USER_RID \\
   SECURITY_MANDATORY_SYSTEM_RID

As it would turn out, protected processes (as they are called) are not actually implemented using the integrity level/MIC mechanism on Vista; instead, there is another, alternate mechanism that provides a way to mark protected processes are “untouchable” by “normal” processes (the lack of flexibility in the integrity level ACE system, as far as specifying which access rights are permitted, is the likely reason. If you read the linked article and the paper it includes, there are a new set of access rights defined specially for dealing with protected processes, which are deemed “safe”. These access rights are requestable for such processes, unlike the standard access rights, and there isn’t a good way to convey this with the set of “allow/deny read/write/execute” options available with an integrity level ACE on Vista.)

The end result is however, for the most part, the same; “protected processes” are essentially to high integrity (or lower) processes as high (or medium) integrity processes are to low integrity processes; that is, they cannot be adversely affected by a lesser-trusted process.

This is where the system begins to break down, though. Process integrity is an interesting way to attempt to curtail malware and exploits because the human at the computer (presumably) does not wish such activity to occur. On the other hand, DRM attempts to prevent the human at their computer from performing an action that they (ostensibly) do in fact wish to perform, with their own computer.

This is a fundamental distinction. The difference is that the malware or exploit code that process level security is designed to defend against doesn’t have the benefit of a human with physical (or administrative) access to the computer in question. That little detail turns out to make a world of difference, as we humans aren’t necessarily constrained by the security system like a program would be. For instance, if some evil exploit code running as a low integrity process on a computer wants to gain administrative access to the box, it just can’t do so (excepting the possibility of local privilege escalation exploits or trying to social-engineer the user into giving the program said access – for the moment, ignore those attack vectors, though they are certainly real ones that must be dealt with at some point).

However, if I am a human sitting at my computer, and I am logged on as a “plain user” and wish to perform an administrative task, I am not so constrained. Instead, I simply either log out and log back in as an administrative user (using my administrative account password), or type my password into an elevation prompt. Problem solved!

Now, of course, the protected process mechanism in Vista isn’t quite that dumb. It does try to block administrators from gaining access to protected processes; direct attempts will return STATUS_ACCESS_DENIED. However, again, humans can be a bit more clever here. For one, a user (and by user, I mean a person with full control over their computer) that is intent on bypassing the protected process mechanism could simply load a driver designed to subvert the protected process mechanism.

The DRM system might then counter that attack by then requiring kernel mode code to be signed, on the theory that for wide-scale violations of the DRM system in such a manner, a “cracker” would need to obtain a code-signing cert that would make them more-easily identifiable and vulnerable to legal attack.

However, people are clever (and more specifically, people with physical / administrative access to a computer are not so necessarily constrained by the basic “rules” of the operating system). One could imagine somebody doing something like patching out the driver signing checks on disk, or any number of other approaches. The theoretical counters to attacks like that would be some sort of hardware support to verify the boot process and ensure that only trusted, signed (and thus unmodified by a “cracker”) code can boot the system. Even that is not necessarily foolproof, though; what’s to say that nobody has compromised the task-offload engine on the system’s NIC to run custom code with full physical memory access, outside the confines of the operating system entirely? Free reign over something capable of performing DMA to physical memory means that kernel code and data can be freely rewritten.

Now, where am I going with all of this? I suppose that I am just frustrated that certain people seem to want to continue to invest significant resources into systems that try to wrest control of a computer from an end user, which are simply doomed to fail by the very nature of the diverse and uncontrolled systems upon which that code will run (and which sometimes compromise the security of customer systems in the process). I don’t think the people behind the protected processes system at Microsoft are stupid, not by any means. However, I can’t help but feel that they have know they’re fighting a losing battle, and that their knowledge and expertise would be better spent on more productive things (like working to improve the next release of Windows, or what-have-you).

Now, a couple of parting shots in an effort to quell several potential misconceptions before they begin:

  • I am not advocating that people bypass DRM. This is probably less than legal in many places. I am, however, trying to make a case for the fact that trying to use security models originally designed to protect users from malware as a DRM mechanism is at best a bad idea.
  • I’m also not trying to downplay the negative impact of theft of copyrighted materials, or anything of that sort. As a programmer myself, I’m well aware that if nobody will buy your product because it’s pirated all over the world, then it’s hard to eke out a living. However, I do believe that it is a fallacy to say that it’s impossible to make money out of software or content in the Internet age without layer after layer of customer-unfriendly DRM.
  • I’m not trying to knock the rest of the improvements in Vista (or the start of process-level security being deployed to joe end user, even though it’s probably not yet perfect). There’s a lot of good work that’s been done with Vista, and despite the (ill-conceived, some might say) DRM mechanisms, there is real value that has been added with this release.
  • I’m also not trying to say that Microsoft is devoting so much of its time to DRM that it isn’t paying any attention to adding real value to its products. However, in my view – most of the time spent on DRM is time that could be better spent adding that “real value” instead of doing the dance of security by obscurity (as with today’s systems, that is really all you can do, when it comes down to it) with some enigmatic idea of a “cracker” out there intent on stealing every piece of software or content they get their hands on and redistributing it to every person in the world for free.
  • I’m also not trying to state that the kernel mode code signing requirements for x64 Vista are entirely motivated by DRM (or that all it’s good for is an attempt to enforce DRM), but I doubt that anyone could truthfully say that DRM played no part in the decision to require signed drivers on x64 Vista either. Regardless, there remain other reasons for ostensibly requiring signed code besides trying to block (or at least hold accountable) attempts to bypass the protected process system.

A brief discussion of Windows Vista’s IE Protected Mode (and user/process level security)

Wednesday, April 25th, 2007

I was discussing the recent QuickTime bug on Matasano Chargen, and the question of whether it would work in the presence IE7 + Vista and protected mode came up. I figured a more in depth explanation as to just what IE7’s Protected Mode actually does might be in order, hence this posting.

One of the new features introduced with Internet Explorer 7 on Windows Vista is something called “Protected Mode” (or “IE Protected Mode”). It’s an on-by-default security that is sold as something that greatly hardens Internet Explorer against meaningful exploitation, even if an exploitable hole in IE (or a component of IE, such as an ActiveX control) is found.

Let’s dig in a little bit deeper as to what IE Protected Mode is (and isn’t), and what it means for you.

First things first. Protected mode is not related to the “enhanced security configuration” that is introduced in Windows Server 2003 in any way. The “enhanced security configuration” IE included with Srv03 is, at its core, just a set of more restrictive (i.e. locked down) default settings with regard to things like scripting, downloading files, and soforth. Protected mode does not rely on locking down security zone settings to the point where you cannot download files or run any scripts by default, and is completely unrelated to the IE hardening work done in the default Srv03 configuration. I’d imagine that protected mode will probably be included in Longhon Server, but the underlying technologies are very different, and are designed to address different “market segments” (“enhanced security configuration” being just a set of more restrictive defaults, whereas protected mode is a fundamental rethink of how the browser interacts with the rest of the operating system).

Protected mode is a feature that is designed to make “surfing the web a safer experience” for end users. Unlike Srv03, where a locked down IE might fly because you are ostensibly not supposed to be doing lots of fancy web-browser-ish things from a server box, end users are clearly not going to take kindly towards not being permitted to download files, run JavaScript, and soforth in the default configuration.

The way protected mode takes a stab at making things better for the end users of the world is to build upon the new “integrity level” security mechanism that has been introduced into the Windows NT security model starting with Vista, with the goal of making the web browser an “untrusted” process that cannot perform “dangerous” things.

To understand what this means, it’s necessary to know what these new-fangled “integrity levels” in Vista are all about. Integrity levels are assigned to a token representing a user, and tokens are assigned to a process (and can be impersonated by a thread, typically something done by something like an IPC server process that needs to perform something on behalf of a lesser-privileged caller). What’s meaningful about integrity levels is that they allow you to partition what we know of as a “user” into something with multiple different “trust levels” (low, medium, high, with several other infrequently-used levels), such that a thread or a process running as a certain integrity level (or “trust level”) cannot “intefere” with something running at a higher integrity level.

The way this is implemented is by an additional level of security check that is performed when some kind of access rights check is performed. This additional check compares the integrity level of the caller (i.e. the thread or process token’s integrity level) with a new type of field in the security descriptor of the target object (called the “mandatory label“) that specifies what sorts of access a caller of a certain integrity level is allowed to request. The “mandatory label” allows an integrity level to be associated with an object for security checks, and allows three basic policies (lower integrity levels cannot request read access, low integrity levels cannot request write access, lower integrity levels cannot request execute access) to be set, comparing the integrity level of a caller against the integrity level specified with an object’s security descriptor. (Only these three generic access rights may be “guarded” by the integrity level in this way; there is no granularity to allow object specific access rights to be given specific minimum caller integrity levels).

The default settings in most places do not allow write access to be granted to processes of a lower integrity level, and the default minimum integrity level is usually “medium”. The new, label/integrity level access check is performed before conventional ACL-based checks.

In this respect, integrity levels are an attempt to inject something of a sort of process-level security into the NT security model.

If you’re at all familiar with how NT security works, this may be a bit new to you. NT is based upon user-level security, where processes (and threads, in the case of impersonation) run under the context of a user, and derive their security rights (i.e. what securable objects they have access to – files, directories, registry keys, and soforth) and privileges (i.e. the ability to shut down the system, the ability to load a driver, the ability to bypass ACL checks for backup/restore, and soforth) from the user context they run under. The thinking behind this sort of model is that each distinct user on a system will run as, well, a different user. Processes from one user cannot interfere with processes (or files, directories, and soforth) running as a different user, without being granted access to do so (i.e. via an ACL, or by special, administrator-level privileges). The “operating system” (i.e. the services and programs that support the system) conceptually runs as yet another “user”, and is thus ostensibly protected from adverse modifications by malicious users on the system. Each user thus exists in a sort of sandbox, unable to interfere with any other user. Conversely, any process running as a particular user can do anything to any other process (or file or directory) owned by that same user; there is no protection within a user security context.

Obviously, this is a gross oversimplification of the NT security model, but it gets the point across (or so I hope!): The security system in NT revolves around the user as the means to control access in a meaningful fashion. This does make sense in environments like large corporate settings, where many users share the same computer (or where computers are centrally managed), such that users cannot interfere with eachother, and ostensibly cannot attack their computers (i.e. the operating system) because they are running as “plain users” without administrator access and cannot perform “dangerous” tasks.

Unfortunately, in the era of the internet, exploitable software bugs, and computers with end users that run code they do not entirely trust, this model isn’t quite as good as we would like. Because the user is the security boundary, here, if an attacker can run code under your user account, they have full access to all of the processes, files, directories (and soforth) that are accessible to that user. And if that user account happened to be a computer administrator account, then things are even worse; now the attacker has free reign over the entire computer, and everything on it (including all other users present on the box).

Clearly, this isn’t such a great situation, especially given the reality that many users run untrusted code (or more generally, buggy or exploitable code) on a frequent basis. In this Internet-enabled age, user-level security as it has been traditionally implemented isn’t really enough.

There are still ways to make things “work” with user-level security; namely, to give each human several user accounts, specific to the task that they are doing. For example, I might have one user account that I use for browsing and games, and another user account that I use for accessing top secret corporate documents. If the user account that I use to browse the Internet with gets compromised somehow, such as by my running an exploitable program and getting “owned”, then my top secret corporate documents are still safe; the malicious code running under the Internet-browsing-and-games account doesn’t have access to do anything to my secret documents, since they are owned by a different account and the default ACL protects them from other users.

Of course, this is a tough pill to expect end users to swallow; having to switch user accounts as they switch between tasks of differing importance is at best inconvenient and at worst confusing and problematic (for example, if I want to download a file from the Internet for use with my top secret corporate documents, I have to go to (comparatively) a lot of trouble to give it to my other user, and doing so opens an implicit trust relationship between my secret-documents-user and my less-trusted-Internet-browsing user, that the program I just downloaded is 1) not inherently malicious, 2) not tampered with or compromised, and 3) not full of exploitable holes that would put my documents at risk anyway the moment my secret-documents-user runs it). Clearly, while you could theoretically still get by with user level access in today’s world, as a single user, doing so as it is implemented in Windows today is a major pain (and even with everyone’s best intentions, few people I have seen really follow through completely with the concept and do not share programs or files between their users in any way whatsoever).

(Note that I am not suggesting that things like running as nonadmin or breaking tasks up into different users are a lost cause, just that to get things truly right and secure, it is a much more difficult job than one might expect initially, so much so that most “joe users” will not stand a chance at doing it perfectly. I’m also not trying to knock on user-level security as just outright flawed, useless, or broken, but the fact remains there are problems in today’s world that merit additonal consideration.)

Whew, that’s a rather long segway into user-level security. Anyways, protected mode is Microsoft’s first real attempt to tackle this problem – the fact that user level security does not always provide fine enough granularity, in the fact of untrusted or buggy programs – in a consumer-level system, in such a way that is palatable to “joe users”. The way that it works is to leverage the integrity level mechanism to create an additonal security barrier between the user’s web browser (i.e. Internet Explorer in protected mode) and the rest of the user’s files and programs. This is done by assigning the IE process a low integrity level. Following with what we know of integrity levels above, this means that the IE process will be denied access (by the security manager in the kernel) to do things like overwrite your documents, place malicious programs in your “startup” Start Menu directory, overwrite executables in your system directory (of course, if you were already running as a plain user, it wouldn’t be able to do this anyway…), and soforth. This is clearly a good thing. In case the implications haven’t fully sunk in yet:

If an attacker compromises a low integrity process, they should not be able to destroy your data or install a trojan (or other malicious code) on your system*.

(*: This is, of course, barring implementation errors, design oversights, and local privilege escalation holes. The latter may prove to be an especially important sticking point, as many companies (Microsoft included) have often “looked down” upon local privilege escalation bugs as relatively not important to fix in a timely fashion. Perhaps the introduction of process-level security control will help add impetus to shatter the idea that leaving local privilege escalation holes sitting around is okay.)

Now, this is a very important departure from where we have been traditionally with user level access control. Combining per process access control with per user access control allows us to do a lot more to protect users from malicious code and buggy software (in other words, protecting users from themselves), in a fashion that is much easier to deal with from a user perspective.

However, I think it would be premature to say that we’re “all the way there” yet. Protected mode and low integrity level processes are definitely a great step in the right direction, but there still remain issues to be solved. For example, as I alluded to previously, the default configuration allows medium integrity objects to still be opened for read access by low integrity callers. This means that, for example, if an attacker compromises an IE process running in protected mode, they still do have a chance at doing some damage. For instance, even though an attacker might not be able to destroy your data, per-se, he or she can still read it (and thus steal it). So, to continue with our previous example of someone who works with top secret corporate documents, an attacker might not be able to wipe out the company financial records, or joe user’s credit card numbers, but he or she could still steal them (and post them on the Internet for anyone to see, or what have you). In other words, an attacker who compromises a low integrity process can’t destroy all your data (as would be the case if there were no process-level security and we were dealing with just one user account), but he or she can still read it and steal it.

There are other things to watch out for, too, with protected mode. Don’t get into the habit of clicking “OK” on that “are you sure you want this program to run outside of IE Protected Mode” dialog box, or you’re setting yourself up to be burned by clever malware. And certainly never click the “don’t ask me again” check box on the consent dialog, or you’re just begging for some piece of malware to abuse your implicit consent without you even realizing that something’s gone wrong. (In case you’re wondering, the mechanism in IE that allows processes to elevate to medium integrity involves an appcompat hook on CreateProcess that requests a medium integrity process (ieuser.exe) to prompt the user for consent, with the medium integrity process creating the process if the user agrees. So user interaction is still required there, though we know how much users love to click “OK” on all those pesky security warnings. Oh, and there is also some hardening that has been done in win32k.sys to prevent lower integrity processes from sending “dangerous” window messages to higher integrity processes (even WM_USER and friends are disabled by default across an integrity level boundary), so “shatter attacks” ought not to work against the consent dialog. Note that if you bypass the appcompat hook, the new process created is also low integrity, and won’t be able to write to anywhere outside of the “low integrity” sandbox writable files and directories.)

So, while running IE in protected mode does, in some respects limit the damage that can be done if you get compromised, I would still recommend not running untrusted programs under the same user account as your important documents (if you really care that much). Perhaps in a future release, we’ll see a solution that addresses the idea of not allowing untrusted programs to read arbitrary user data as well (such should be possible with proper use of the new integrity level mechanisms, although I suspect the true difficulty shall be in getting third party applications to play nicely as we continue to try and place control of the user’s documents more firmly in the control of the actual user instead of in any arbitrary application that runs on the box).

How I ended up in the kernel debugger while trying to get PHP and Cacti working…

Saturday, April 14th, 2007

Some days, nothing seems to work properly. This is the sad story of how something as innocent as trying to install a statistics graphing Web application culminated in my breaking out the kernel debugger in an attempt to get things working. (I don’t seem to have a lot of luck with web applications. So much for the way of the future being “easy to develop/deploy/use” web-based applications…)

Recently, I decided that to try installing Cacti in order to get some nice, pretty graphs describing resource utilization on several boxes at my apartment. Cacti is a PHP program that queries SNMP data and, with the help of a program called RRDTool, creates friendly historical graphs for you. It’s commonly used for monitoring things like network or processor usage over time.

In this particular instance, I was attempting to get Cacti working on a Windows Server 2003 x64 SP2 box. Running an amalgam of unix-ish programs on Windows is certainly “fun”, and doing it on native x64 is even more “interesting”. I didn’t expect to find myself in the kernel debugger while trying to get Cacti working, though…

To start out, the first thing I had to do was convert IIS6’s worker processes to 32-bit instead of 64-bit, as the standard PHP 5 distribution doesn’t support x64. (No, I don’t consider spending who knows how many hours to get PHP building on x64 natively a viable solution here, so I just decided to stick with the 32-bit release. I don’t particularly want to be in the habit of having to then maintain rebuild my own PHP distribution from a custom build environment each time security updates come out either…).

This wasn’t too bad (at least not at first); a bit of searching revealed this KB article that documented an IIS metabase flag that you can set to turn on 32-bit worker processes (with the help of the adsutil.vbs script included in the IIS Adminscripts directory).

One small snag here was that I happened to be running a symbol proxy in native x64 mode on this system already. Since the 32-bit vs 64-bit IIS worker process flag is an all-or-nothing option, I had to go install the 32-bit WinDbg distribution on this system and copy over the 32-bit symproxy.dll and symsrv.dll into %systemroot%\system32\inetsrv. Additionally, the registry settings used by the 64-bit symproxy weren’t directly accessible to the 32-bit version (due to a compatiblity feature in 64-bit versions of Windows known as Registry Reflection), so I had to manually copy over the registry settings describing which symbol paths the symbol proxy used to the Wow64 version of HKLM\Software. No big deal so far, just a minor annoyance.

The first unexpected problem that cropped up happened after I had configured the 32-bit symbol proxy ISAPI filter and installed PHP; after I enabled 32-bit worker processes, IIS started tossing HTTP 500 Internal Server Error statuses whenever I tried to browse any site on the box. Hmm, not good…

After determining that everything was still completely broken even after disabling the symbol proxy and PHP ISAPI modules, I discovered some rather unpleasant-looking event log messages:

ISAPI Filter ‘%SystemRoot%\Microsoft.NET\Framework64\v2.0.50727\aspnet_filter.dll’ could not be loaded due to a configuration problem. The current configuration only supports loading images built for a x86 processor architecture. The data field contains the error number. To learn more about this issue, including how to troubleshooting this kind of processor architecture mismatch error, see http://go.microsoft.com/fwlink/?LinkId=29349.

It seemed that the problem was the wrong version of ASP.NET being loaded (still the x64 version). The link in the event message wasn’t all that helpful, but a bit of searching located yet another knowledge base article – this time, about how to switch back and forth between 32-bit and 64-bit versions of ASP.NET. After running aspnet_regiis as described in that article, IIS was once again in a more or less working state. Another problem down, but the worst was yet to come…

With IIS working again, I turned towards configuring Cacti in IIS. Although, at first it appeared as though everything might actually go as planned (after configuring Cacti’s database settings, I soon found myself at its php-based initial configuration page), such things were not meant to be. The first sign of trouble appeared after I completed the initial configuration page and attempted to log on with the default username and password. Doing so resulted in my being thrown back to the log on page, without any error messages. A username and password combination not matching the defaults did result in a logon failure error message, so something besides a credential failure was up.

After some digging around in the Cacti sources, it appeared that the way that Cacti tracks whether a user is logged in or not is via setting some values in the standard PHP session mechanism. Since Cacti was apparently pushing me back to the log on page as soon as I logged on, I guessed that there was probably some sort of failure with PHP’s session state management.

Rewind a bit to back when I installed PHP. In the interest of expediency (hah!), I decided to try out the Win32 installer package (as opposed to just the zip distribution for a manual install) for PHP. Typically, I’ve just installed PHP for IIS the manual way, but I figured that if they had an installer nowadays, it might be worth giving it a shot and save myself some of the tedium.

Unfortunately, it appears that PHP’s installer is not all that intelligent. It turns out that in the IIS ISAPI mode, PHP configures the system-wide PHP settings to point the session state file directory to the user-specific temp directory (i.e. pointing to a location under %userprofile%). This, obviously, isn’t going to work; anonymous users logged on to IIS aren’t going to have access to the temp directory of the account I used to install PHP with.

After manually setting up a proper location for PHP’s session state with the right security permissions (and reconfiguring php.ini to match), I tried logging in to Cacti again. This time, I actually got to the main screen after changing the password (hooray, progress!).

From here, all that I had left to do was some minor reconfiguring of the Windows SNMP service in order to allow Cacti to query it, set up the Cacti poller task job (Cacti is designed to poll data from its configured data sources at regular intervals), and configure my graphs in Cacti.

Configuring SNMP wasn’t all that difficult (though something I hadn’t done before with the Windows SNMP service), and I soon had Cacti successfully querying data over SNMP. All that was left to do was graph it, and I was home free…

Unfortunately, getting Cacti to actually graph the data turned out to be rather troublesome. In fact, I still haven’t even got it working, though I’ve at least learned a bit more about just why it isn’t working…

When I attempted to create graphs in Cacti, everything would appear to work okay, but no RRDTool datafiles would ever appear. No amount of messing with filesystem permissions resolved the problem, and the Cacti log files were not particularly helpful (even on debug severity). Additionally, attempting to edit graph properties in Cacti would result in that HTTP session mysteriously hanging forever more (definitely not a good sign). After searching around (unsuccessfully) for any possible solutions, I decided to try and take a closer look at what exactly was going on when my requests to Cacti got stuck.

Checking the process list after repeating the sequence that caused a particular Cacti session to hang several times, I found that there appeared to be a pair of cmd.exe and rrdtool.exe instances corresponding to each hung session. Hmm, it would appear that something RRDTool was doing was freezing and PHP was waiting for it… (PHP uses cmd.exe to call RRDTool, so I guessed that PHP would be waiting for cmd.exe, which would be waiting for RRDTool).

At first, I attempted to attach to one of the cmd processes with WinDbg. (Incidentally, it would appear that there are currently no symbols for the Wow64 versions of the Srv03SP2 ntdll, kernel32, user32, and a large number of other core DLLs with Wow64 builds available on the Microsoft symbol server for some reason. If any Microsoft people are reading this, it would be greaaaat if you could fix the public symbol server for Srv03 SP2 x64 Wow64 DLLs …) However, symbols for cmd.exe were fortunately available, so it was relatively easy to figure out what it was up to, and prove my earlier hypothesis that it was simply waiting on an rrdtool instance:

0:001:x86> ~1k
ChildEBP RetAddr
0012fac4 7d4d8bf1 ntdll_7d600000!NtWaitForSingleObject+0x15
0012fad8 4ad018ea KERNEL32!WaitForSingleObject+0x12
0012faec 4ad02611 cmd!WaitProc+0x18
0012fc24 4ad01a2b cmd!ExecPgm+0x3e2
0012fc58 4ad019b3 cmd!ECWork+0x84
0012fc70 4ad03c58 cmd!ExtCom+0x40
0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
0012fee0 4ad06cf6 cmd!Dispatch+0x137
0012ff44 4ad07786 cmd!main+0x108
0012ffc0 7d4e7d2a cmd!mainCRTStartup+0x12f
0012fff0 00000000 KERNEL32!BaseProcessInitPostImport+0x8d
0:001:x86> !peb
[...]
CommandLine: 'cmd.exe /c c:/progra~2/rrdtool/rrdtool.exe -'
[...]

Given this, the next logical step to investigate would be the RRDTool.exe process. Unfortunately, something really weird seemed to be going on with all the RRDTool.exe processes (naturally). WinDbg would give me an access denied error for all of the RRDTool PIDs in the F6 process list, despite my being a local machine administrator.

Attempting to attach to these processes failed as well:

Microsoft (R) Windows Debugger Version 6.6.0007.5
Copyright (c) Microsoft Corporation. All rights reserved.

Cannot debug pid 4904, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”
Debuggee initialization failed, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”

This is not something that you want to be seeing on a server box. This particular error means that the process in question is in the middle of being terminated, which prevents a debugger from successfully attaching. However, processes typically terminate in timely fashion; in fact, it’s almost unheard of to actually see a process in the terminating state, since it happens so quickly. However, in this particular instances, the RRDTool processes were remaining in this half-dead state for what appeared to be an indefinite interval.

There are two things that commonly cause this kind of problem, and all of them are related to the kernel:

  1. The disk hardware is not properly responding to I/O requests and they are hanging indefinitely. This can block a process from exiting while the operating system waits for an I/O to finishing canceling or completing. Since this particular box was brand new (and with respectable, high-quality server hardware), I didn’t think that failing hardware was the cause here (or at least, I certainly hoped not!). Given that there were no errors in the event log about I/Os failing, and that I was still able to access files on my disks without issue, I decided to rule this possiblity out.
  2. A driver (or other kernel mode code in the I/O stack) is buggy and is not allowing I/O requests to be canceled or completed, or has deadlocked itself and is not able to complete an I/O request. (You might be familiar with the latter if you’ve tried to use the 1394 mass storage support in Windows for a non-trivial length of time.) Given that I had tentatively ruled out bad hardware, this would seem to be the most likely cause here.

Since the frozen process would be stuck in kernel mode, in either case, to proceed any further I would need to use the kernel debugger. I decided to start out with local kd, as that is sufficient for at least retrieving thread stacks and doing basic passive analysis of potential deadlock issues where the system is at least mostly still functional.

Sure enough, the stuck RRDTool process I had unsuccessfully tried to attach to was blocked in kernel mode:


lkd> !process 0n4904
Searching for Process with Cid == 1328
PROCESS fffffadfcc712040
SessionId: 0 Cid: 1328 Peb: 7efdf000 ParentCid: 1354
DirBase: 5ea6c000 ObjectTable: fffffa80041d19d0 HandleCount: 68.
Image: rrdtool.exe
[...]
THREAD fffffadfcca9a040 Cid 1328.1348 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Unknown) KernelMode Non-Alertable
fffffadfccf732d0 SynchronizationEvent
Impersonation token: fffffa80041db980 (Level Impersonation)
DeviceMap fffffa8001228140
Owning Process fffffadfcc712040 Image: rrdtool.exe
Wait Start TickCount 6545162 Ticks: 367515 (0:01:35:42.421)
Context Switch Count 445 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0015
Win32 Start Address windbg!_imp_RegCreateKeyExW (0x0000000000401000)
Start Address 0x000000007d4d1510
Stack Init fffffadfc4a95e00 Current fffffadfc4a953b0
Base fffffadfc4a96000 Limit fffffadfc4a8f000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 1
RetAddr Call Site
fffff800`01027752 nt!KiSwapContext+0x85
fffff800`0102835e nt!KiSwapThread+0x3c9
fffff800`013187ac nt!KeWaitForSingleObject+0x5a6
fffff800`012b2853 nt!IopAcquireFileObjectLock+0x6d
fffff800`01288dff nt!IopCloseFile+0xad
fffff800`01288f0e nt!ObpDecrementHandleCount+0x175
fffff800`0126ceb0 nt!ObpCloseHandleTableEntry+0x242
fffff800`0128d7a6 nt!ExSweepHandleTable+0xf1
fffff800`012899b6 nt!ObKillProcess+0x109
fffff800`01289d3b nt!PspExitThread+0xa3a
fffff800`0102e3fd nt!NtTerminateProcess+0x362
00000000`77ef0caa nt!KiSystemServiceCopyEnd+0x3
0202c9fc`0202c9fb ntdll!NtTerminateProcess+0xa

Hmm… not quite what I expected. If a buggy driver was involved, it should have at least been somewhere on the call stack, but in this particular instance all we have is ntoskrnl code, straight from the system call to the wait that isn’t coming back. Something was definitely wrong in kernel mode, but it wasn’t immediately clear what was causing it. It appeared as if the kernel was blocked on the file object lock (which, to my knowledge, is used to guard synchronous I/O’s that are issued for a particular file object), but, as the file object lock is built upon KEVENTs, the usual lock diagnostics extensions (like `!locks’) would not be particularly helpful. In this instance, what appeared to be happening was that the process rundown logic in the kernel was attempting to release all still-open handles in the exiting RRDTool process, and it was (for some reason) getting stuck while trying to close a handle to a particular file object.

I could at least figure out what file was “broken”, though, by poking around in the stack of IopCloseFile:

lkd> !fileobj fffffadf`ccf73250
\Temp\php\session\sess_bkcavai8fak8antv9coq46at95
LockOperation Set Device Object: 0xfffffadfce423370 \Driver\dmio
Vpb: 0xfffffadfce864840
Access: Read Write SharedRead SharedWrite
Flags: 0x40042
Synchronous IO
Cache Supported
Handle Created
File Object is currently busy and has 1 waiters.
FsContext: 0xfffffa800390e110 FsContext2: 0xfffffa8000106a10
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffadfcd601c20
Shared Cache Map: fffffadfccfdebb0 File Offset: 0 in VACB number 0
Vacb: fffffadfce97fb08
Your data is at: fffff98070e80000

From here, there are a couple of options:

  1. We could look for processes with an open handle to that file and check their stacks.
  2. We could look for an IRP associated with that file object and try and trace our way back from there.

Initially, I tried the first option, but this ended up not working particularly well. I attempted to use Process Explorer to locate all processes that had a handle to that file, but this ended up failing rather miserably as Process Explorer itself got deadlocked after it opened a handle to the file. This was actually rather curious; it turned out that processes could open a handle to this “broken” file just fine, but when they tried to close the handle, they would get blocked in kernel mode indefinitely.

That unsuccessful, I tried the second option, which is made easier by the use of `!irpfind’. Normally, this extension is very slow to operate (over a serial cable), but local kd makes it quite usable. This revealed something of value:


lkd> !irpfind -v 0 0 fileobject fffffadf`ccf73250
Looking for IRPs with file object == fffffadfccf73250
Scanning large pool allocation table for Tag: Irp? (fffffadfccdf6000 : fffffadfcce56000)
Searching NonPaged pool (fffffadfcac00000 : fffffae000000000) for Tag: Irp?
Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
fffffadfcc225380: Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[...]
>[ 11, 1] 2 1 fffffadfce7b6040 fffffadfccf73250 00000000-00000000 pending
\FileSystem\Ntfs
Args: fffffadfcd70e0a0 00000000 00000000 00000000

There was an active IRP for this file object. Hopefully, it could be related to whatever is holding the file object lock for that file object. Digging a bit deeper, it’s possible to determine what thread is associated with the IRP (if it’s a thread IRP), and from there, we can grab a stack (which might just give us the smoking gun we’re looking for)…:

lkd> !irp fffffadfcc225380 1
Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
Flags = 00000000
ThreadListEntry.Flink = fffffadfcc2253a0
ThreadListEntry.Blink = fffffadfcc2253a0
[...]
CancelRoutine = fffff800010ba930 nt!FsRtlPrivateResetLowestLockOffset
[...]
lkd> !thread fffffadfccea27d0
THREAD fffffadfccea27d0 Cid 10f8.138c Teb: 00000000fffa1000 Win32Thread: fffffa80023cd860 WAIT: (Unknown) UserMode Non-Alertable
fffffadfccf732e8 NotificationEvent
Impersonation token: fffffa8002c62060 (Level Impersonation)
DeviceMap fffffa8002f3b7b0
Owning Process fffffadfcc202c20 Image: w3wp.exe
Wait Start TickCount 6966187 Ticks: 952 (0:00:00:14.875)
Context Switch Count 1401 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0000
Win32 Start Address 0x00000000003d87d8
Start Address 0x000000007d4d1504
Stack Init fffffadfc4fbee00 Current fffffadfc4fbe860
Base fffffadfc4fbf000 Limit fffffadfc4fb8000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
RetAddr : Call Site
fffff800`01027752 : nt!KiSwapContext+0x85
fffff800`0102835e : nt!KiSwapThread+0x3c9
fffff800`012afb38 : nt!KeWaitForSingleObject+0x5a6
fffff800`0102e3fd : nt!NtLockFile+0x634
00000000`77ef14da : nt!KiSystemServiceCopyEnd+0x3
00000000`00000000 : ntdll!NtLockFile+0xa

This might just be what we’re looking for. There’s a thread in w3wp.exe (the IIS worker process), which is blocking on a synchronous NtLockFile call for that same file object that is in the “broken” state. Since I’m running PHP in ISAPI mode, this does make sense – if PHP is doing something to that file (which it could certainly be, since it’s a PHP session state file as we saw above), then it should be in the context of w3wp.exe.

In order to get a better user mode stack trace as to what might be going on, I was able to attach a user mode debugger to w3wp.exe and get a better picture as to what the deal was:


0:006> .effmach x86
Effective machine: x86 compatible (x86)
0:006:x86> ~6s
ntdll_7d600000!ZwLockFile+0x12:
00000000`7d61d82e c22800 ret 28h
0:006:x86> k
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
014edf84 023915a2 ntdll_7d600000!ZwLockFile+0x12
014edfbc 0241d886 php5ts!flock+0x82
00000000 00000000 php5ts!zend_reflection_class_factory+0xb576

It looks like that thread is indeed related to PHP; PHP is trying to acquire a file lock on the session state file. With a bit of work, we can figure out just what kind of lock it was trying to acquire.

The prototype for NtLockFile is as so:

// NtLockFile locks a region of a file.
NTSYSAPI
NTSTATUS
NTAPI
NtLockFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PULARGE_INTEGER LockOffset,
IN PULARGE_INTEGER LockLength,
IN ULONG Key,
IN BOOLEAN FailImmediately,
IN BOOLEAN ExclusiveLock
);

Given this, we can easily deduce the arguments off a stack dump:

0:006:x86> dd @esp+4 l0n10
00000000`014edf48 000002b0 00000000 00000000 014edfac
00000000`014edf58 014edfac 014edf74 014edf7c 00000000
00000000`014edf68 00000000 00000001
0:006:x86> dq 014edf74 l1
00000000`014edf74 00000000`00000000
0:006:x86> dq 014edf7c l1
00000000`014edf7c 00000000`00000001

It seems that PHP is trying to acquire an exclusive lock for a range of 1 byte starting at offset 0 in this file, with NtLockFile configured to wait until it acquires the lock.

Putting this information together, it’s now possible to surmise what is going on here:

  1. The child processes created by php have a file handle to the session state file (probably there from process creation inheritance).
  2. PHP tries to acquire an exclusive lock on part of the session state file. This takes the file object lock for that file and waits for the file to become exclusively available.
  3. The child process exits. Now, it tries to acquire the file object lock so that it can close its file handle. However, the file object lock cannot be acquired until the child process releases its handle as the handle is blocking PHP’s NtLockFile from completing.
  4. Deadlock! Neither thread can continue, and PHP appears to hang instead of configuring my graphs properly.

In this particular instance, it was actually possible to “recover” from the deadlock without rebooting; the IIS worker process’s wait in NtLockFile is marked as a UserMode wait, so it is possible to terminate the w3wp.exe process, which releases the file object lock and ultimately allows all the frozen processes that are trying to close a handle to the PHP session state file to finish the close handle operation and exit.

This is actually a nasty little problem; it looks like it’s possible for one user mode process to indefinitely freeze another user mode process in kernel mode via a deadlock. Although you can break the deadlock by terminating the second user mode process, the fact that a user mode process can, at all, cause the kernel to deadlock during process exit (“breakable” or not) does not appear to be a good thing to me.

Meanwhile, knowing this right now doesn’t really solve my problem. Furthermore, I suspect that there’s probably a different problem here too, as the command line that was given to RRDTool (simply “-“) doesn’t look all that valid to me. I’ll see if I can come up with some way to work around this deadlock problem, but it definitely looks like an unpleasant one. If it really is a file handle being incorrectly inherited to a child process, then it might be possible to un-mark that handle for inheritance with some work. The fact that I am having to consider making something to patch PHP to work around this is definitely not a happy one, though…

Silly me for thinking that it would just take a kernel debugger to get a web application running…

Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Monday, February 5th, 2007

This is the final post in the x64 exception handling series, comprised of the following articles:

  1. Programming against the x64 exception handling support, part 1: Definitions for x64 versions of exception handling support
  2. Programming against the x64 exception handling support, part 2: A description of the new unwind APIs
  3. Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)
  4. Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)
  5. Programming against the x64 exception handling support, part 5: Collided unwinds
  6. Programming against the x64 exception handling support, part 6: Frame consolidation unwinds
  7. Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Armed with all of the information in the previous articles in this series, we can now do some fairly interesting things. There are a whole lot of possible applications for the information described here (some common, some estoric); however, for simplicities sake, I am choosing to demonstrate a classical stack walk (albeit one that takes advantage of some of the newly available information in the unwind metadata).

Like unwinding on x86 where FPO is disabled, we are able to do simple tasks like determine frame return addresses and stack pointers throughout the call stack. However, we can expand on this a great deal on x64. Not only are our stack traces guaranteed to be accurate (due to the strict calling convention requirements on unwind metadata), but we can retrieve parts of the nonvolatile context of each caller with perfect reliability, without having to manually disassemble every function in the call stack. Furthermore, we can see (at a glance) which functions modify which non-volatile registers.

For the purpose of implementing a stack walk, it is best to use RtlVirtualUnwind instead of RtlUnwindEx, as the RtlUnwindEx will make irreversible changes to the current execution state (while RtlVirtualUnwind operates on a virtual copy of the execution state that we can modify to our heart’s content, without disturbing normal program flow). With RtlVirtualUnwind, it’s fairly trivial to implement an unwind (as we’ve previously seen based on the internal workings of RtlUnwindEx). The basic algorithm is simply to retrieve the current unwind metadata for the active function, and execute a virtual unwind. If no unwind metadata is present, then we can simply set the virtual Rip to the virtual *Rsp, and increment the virtual *Rsp by 8 (as opposed to invoking RlVirtualUnwind). Since RtlVirtualUnwind does most of the hard work for us, the only thing left after that is to interpret and display the output (or save it away, if we are logging stack traces).

With the above information in mind, I have written an example x64 stack walking routine that implements a basic x64 stack walk, and displays the nonvolatile context as viewed by each successive frame in the stack. The example routine also makes use of the little-used ContextPointers argument to RtlVirtualUnwind in order to detect functions that have used a particular non-volatile register (other than the stack pointer, which is immediately obvious). If a function in the call stack writes to a non-volatile register, the stack walk routine takes note of this and displays the modified register, its original value, and the backing store location on the stack where the original value is stored. The example stack walk routine should work “as-is” in both user mode and kernel mode on x64.

As an aside, there is a whole lot of information that is being captured and displayed by the stack walk routine. Much of this information could be used to do very interesting things (like augment disassembly and code analysis by defintiively identifying saved registers and parameter usage. Other possible uses are more estoric, such as skipping function calls at run-time in a safe fashion, or altering the non-volatile execution context of called functions via modification of the backing store pointers returned by RtlVirtualUnwind in ContextPointers. The stack walk use case, as such, really only begins to scratch the surface as it relates to some of the many very interesting things that x64 unwind metadata allows you to do.

Comparing the output of the example stack walk routine to the debugger’s built-in stack walk, we can see that it is accurate (and in fact, even includes more information; the debugger does not, presently, have support for displaying non-volatile context for frames using unwind metadata (Update: Pavel Lebedinsky points out that the debugger does, in fact, have this capability with the “.frame /r” command)):

StackTrace64: Executing stack trace...
FRAME 00: Rip=00000000010022E9 Rsp=000000000012FEA0
 Rbp=000000000012FEA0
r12=0000000000000000 r13=0000000000000000
 r14=0000000000000000
rdi=0000000000000000 rsi=0000000000130000
 rbx=0000000000130000
rbp=0000000000000000 rsp=000000000012FEA0
 -> Saved register 'Rbx' on stack at 000000000012FEB8
   (=> 0000000000130000)
 -> Saved register 'Rbp' on stack at 000000000012FE90
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FE88
   (=> 0000000000130000)
 -> Saved register 'Rdi' on stack at 000000000012FE80
   (=> 0000000000000000)

FRAME 01: Rip=0000000001002357 Rsp=000000000012FED0
[...]

FRAME 02: Rip=0000000001002452 Rsp=000000000012FF00
[...]

FRAME 03: Rip=0000000001002990 Rsp=000000000012FF30
[...]

FRAME 04: Rip=00000000777DCDCD Rsp=000000000012FF60
[...]
 -> Saved register 'Rbx' on stack at 000000000012FF60
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FF68
   (=> 0000000000000000)
 -> Saved register 'Rdi' on stack at 000000000012FF50
   (=> 0000000000000000)

FRAME 05: Rip=000000007792C6E1 Rsp=000000000012FF90
[...]


0:000> k
Child-SP          RetAddr           Call Site
00000000`0012f778 00000000`010022c6 DbgBreakPoint
00000000`0012f780 00000000`010022e9 StackTrace64+0x1d6
00000000`0012fea0 00000000`01002357 FaultingSubfunction3+0x9
00000000`0012fed0 00000000`01002452 FaultingFunction3+0x17
00000000`0012ff00 00000000`01002990 wmain+0x82
00000000`0012ff30 00000000`777dcdcd __tmainCRTStartup+0x120
00000000`0012ff60 00000000`7792c6e1 BaseThreadInitThunk+0xd
00000000`0012ff90 00000000`00000000 RtlUserThreadStart+0x1d

(Note that the stack walk routine doesn’t include DbgBreakPoint or the StackTrace64 frame itself. Also, for brevity, I have snipped the verbose, unchanging parts of the nonvolatile context from all but the first frame.)

Other interesting uses for the unwind metadata include logging periodic stack traces at choice locations in your program for later analysis when a debugger is active. This is even more powerful on x64, especially when you are dealing with third party code, as even without special compiler settings, you are guaranteed to get good data with no invasive use of symbols. And, of course, a good knowledge of the fundamentals of how the exception/unwind metadata works is helpful for debugging failure reporting code (such as custom unhandled exception filter) on x64.

Hopefully, you’ve found this series both interesting and enlightening. In a future series, I’m planning on running through how exception dispatching works (as opposed to unwind dispatching, which has been relatively thoroughly covered in this series). That’s a topic for another day, though.

Programming against the x64 exception handling support, part 6: Frame consolidation unwinds

Sunday, January 14th, 2007

In the last post in the programming x64 exception handling series, I described how collided unwinds were implemented, and just how they operate. That just about wraps up the guts of unwinding (finally), except for one last corner case: So-called frame consolidation unwinds.

Consolidation unwinds are a special form of unwind that is indicated to RtlUnwindEx with a special exception code, STATUS_UNWIND_CONSOLIDATE. This exception code changes RtlUnwindEx’s behavior slightly; it suppresses the behavior of substituting the TargetIp argument to RtlUnwindEx with the unwound context’s Rip value.

As far as RtlUnwindEx goes, that’s all that there is to consolidation unwinds. There’s a bit more that goes on with this special form of unwind, though. Specifically, as with longjump style unwinds, there is special logic contained within RtlRestoreContext (used by RtlUnwindEx to realize the final, unwound execution context) that detects the consolidation unwind case (by virtue of the ExceptionCode member of the ExceptionRecord argument), and enables a special code path. As it is currently implemented, some of the logic relating to unwind consolidation is also resident within the pseudofunction RcFrameConsolidation. This function is tightly coupled with RtlUnwindContext; it is only separated into another logical function for purposes of describing RtlRestoreContext and RcFrameConsolidation in the unwind metadata for ntdll (or ntoskrnl).

The gist of what RtlRestoreContext/RcFrameConsolidation does in this case is essentially the following:

  1. Make a local copy of the passed-in context record.
  2. Treating ExceptionRecord->ExceptionInformation[0] as a callback function, this callback function is called (given a single argument pointing to the ExceptionRecord provided to RtlRestoreContext).
  3. The return address of the callback function is treated as a new Rip value to place in the context that is about to be restored.
  4. The context is restored as normal after the Rip value is updated based on the callback’s decision.

The callback routine pointed to by ExceptionRecord->ExceptionInformation[0] has the following signature:

// Returns a new RIP value to be used in the restored context.
typedef ULONG64 (* PFRAME_CONSOLIDATION_ROUTINE)(
   __in PEXCEPTION_RECORD ExceptionRecord
   );

After the confusing interaction between multiple instances of the case function that is collided unwinds, frame consolidation unwinds may seem a little bit anti-climactic.

Frame consolidation unwinds are typically used in conjunction with C++ try/catch/throw support. Specifically, when a C++ exception is thrown, a consolidation unwind is executed within a function that contains a catch handler. The frame consolidation callback is used by the CRT to actually invoke the various catch filters and handlers (these do not necessarily directly correspond to standard C SEH scope levels). The C++ exception handling routines use the additional ExceptionInformation fields available in a consolidation unwind in order to pass information about the exception object itself; this usage of the ExceptionInformation fields does not have any special support in the OS-level unwind routines, however. Once the C++ exception is going to cross a function-level boundary, in my observations, it is converted into a normal exception for purposes of unwinding and exception handling. Then, consolidation unwinds are used again to invoke any catch handlers encountered within the next function (if applicable).

Essentially, consolidation unwinds can be thought as a normal unwind, with a conditionally assigned TargetIp whose value is not determined until after all unwind handlers have been called, and the specified context has been unwound. In most circumstances, this functionality is not particularly useful (or critical) when speaking in terms of programming the raw OS-level exception support directly. Nonetheless, knowing how it works is still useful, if only for debugging purposes.

I’m not covering the details of how all of the C++ exception handling framework is built on top of the unwind handling routines in this post series; there is no further OS-level “awareness” of C++ exceptions within the OS’s unwind related support routines, however.

Next up: Tying it all together, or using x64’s improved exception handling support to our gain in the real world.

Programming against the x64 exception handling support, part 5: Collided unwinds

Tuesday, January 9th, 2007

Previously, I discussed the internal workings of RtlUnwindEx. While that posting covered most of the inner details regarding unwind support, I didn’t fully cover some of the corner cases.

Specifically, I haven’t yet discussed just what a “collided unwind” really is, other than providing vague hints as to its existance. A collided unwind occurs when an unwind handler initiates a secondary unwind operation in the context of an unwind notification callback. In other words, a collided unwind is what occurs when, in the process of a stack unwind, one of the call frames changes the target of an unwind. This has several implications and requirements in order to operate as one might expect:

  1. Some unwind handlers that were on the original unwind path might no longer be called, depending on the new unwind target.
  2. The current unwind call stack leading into RtlUnwindEx will need to be interrupted.
  3. The new unwind operation should pick up where the old unwind operation left off. That is, the new unwind operation shouldn’t start unwinding the exception handler stack; instead, it must unwind the original stack, starting from the call frame after the unwind handler which initiated the new unwind operation.

Because of these conditions, the implementation of collided unwinds is a bit more complicated than one might expect. The main difficulty here is that the second unwind operation is initiated within the call stack of an existing unwind operation, but what the unwind handler “wants” to do is to unwind the stack that was already being unwound, except to a different target and with different parameters.

From an unwind handler’s perspective, all that needs to be done to accomplish this is to make a call to RtlUnwindEx in the context of an unwind handler callback for an unwind operation, and RtlUnwindEx magically takes care of all of the work necessary to make the collided unwind “just work”.

Allowing this sort of unwind operation to “just work” requires a bit of creative thinking from the perspective of RtlUnwindEx, however. The main difficult here is that RtlUnwindEx, when called from the unwind handler, somehow needs a way to recover the original context that was being unwound in order to “pick up” where the original call to RtlUnwindEx “left off” (when it called an unwind handler that initiated a collided unwind). Because there is no provision for passing a context record to RtlUnwindEx and indicating that RtlUnwindEx should use it as a starting point for an unwind operation (RtlUnwindEx always initiates the unwind in the current call stack), this poses a problem; how is RtlUnwindEx to recover the original unwind parameters from where it should initiate the “real” unwind?

The way that Microsoft decided to solve this problem is an elegant little hack of sorts. The solution all comes down to that mysterious exception handler around RtlpExecuteHandlerForUnwind: RtlpUnwindHandler. Recall from the previous article that RtlUnwindEx calls RtlpExecuteHandlerForUnwind in order to invoke an exception handler for unwind purposes, and that RtlpExecuteHandlerForUnwind sets up an exception handler (RtlpUnwindHandler) before calling the requested exception handler for unwind. At the time, these extra steps (the use of RtlpExecuteHandlerForUnwind, and its exception handler) probably looked a bit redundant, and in the process of a “conventional” unwind operation, the extra work that RtlUnwindEx goes through before calling an unwind handler doesn’t even come into play as adding any value.

That all changes when a collided unwind occurs, however. In the collided unwind case, RtlpExecuteHandlerForUnwind and RtlpUnwindHandler are critical to solving the problem of how to recover the original unwind parameters so that RtlUnwindEx can perform an unwind operation on the correct call stack. In order to understand just how RtlpUnwindHandler and friends come into play with a collided unwind, it’s necessary to take a closer look about just what RtlUnwindEx will do when it is called from the context of an unwind handler.

Since RtlUnwindEx always begins a call frame unwind from the currently active call stack, the second call to RtlUnwindEx will start unwinding the call stack of the unwind handler that called RtlUnwindEx. But wait, you might say – this isn’t what is supposed to happen! It turns out that unwinding the unwind handler’s call stack will actually lead up to the “right thing” happening, through a bit of clever use of how “conventional” unwind operations work. To better understand what I mean, it’s helpful to look at the stack of a secondary call to RtlUnwindEx (initiating a collided unwind operation). For this purpose, I’ve put together a small problem that initiates a collided unwind (more on how and why you might see a collided unwind in the “real world” later). I’ve set a breakpoint on RtlUnwindEx, and skipped forward until I encountered the nested call to RtlUnwindEx that was initiating a collided unwind operation:

0:000> k
Child-SP          Call Site
00000000`0012e058 ntdll!RtlUnwindEx
00000000`0012e060 ntdll!local_unwind+0x1c
00000000`0012e540 TestApp!`FaultingFunction2'::`1'::fin$2+0x34
00000000`0012e570 ntdll!_C_specific_handler+0x140
00000000`0012e5e0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012e610 ntdll!RtlUnwindEx+0x236
00000000`0012ec90 TestApp!UnwindExceptionHandler2+0xf8
00000000`0012f1b0 TestApp!`FaultingFunction2'::`1'::filt$1+0xe
[...]

At this point, given what we know about RtlUnwindEx, it will start unwinding the stack downward. Since the target of the collided unwind will by definition be lower in the stack than the unwind handler’s stack pointer itself, RtlUnwindEx will continue unwinding downward, calling unwind handlers (if any) for each successive frame. Taking a look at the call stack, we can determine that there are no frames with an exception handler marked for unwind (denotated by a [ U ] in the !fnseh output):

0:000> !fnseh ntdll!RtlUnwindEx
ntdll!RtlUnwindEx L295 22,0A [   ]  (none)
0:000> !fnseh ntdll!local_unwind
ntdll!local_unwind L24 07,02 [   ]  (none)
0:000> !fnseh 00000000`01001f04 
1001ed0 L3a 06,02 [   ]  (none)
0:000> !fnseh ntdll!_C_specific_handler+0x140
ntdll!_C_specific_handler L16a 20,0C [   ]  (none)

(Here, 00000000`01001f04 corresponds to TestApp!`FaultingFunction2′::`1′::fin$2+0x34).

Because none of these call frames have an exception handler marked for unwind callbacks, we can surmise that RtlUnwindEx will blissfully unwind past all of these call frames just as one might expect. At this point, RtlUnwindEx is still unwinding the “wrong” stack though; we’d like it to be unwinding the stack passed to the original call to RtlUnwindEx, and not the unwind/exception handler call stack.

Something that one might not immediately expect happens when RtlUnwindEx reaches the next frame, however. Remember that the current call frame is now _C_specific_handler – the C-language exception handler for the current function that was originally being unwound after an exception occured. This means that the next call frame will be the original RtlUnwindEx, or more precisely, RtlpExecuteHandlerForUnwind.

This is where RtlpExecuteHandlerForUnwind and RtlpUnwindHandler get to shine. If we take a look at the next call frame in the debugger, we see that it is indeed RtlpExecuteHandlerForUnwind, and that it also has (as expected) an exception handler marked for unwind support: RtlpUnwindHandler.

0:000> !fnseh ntdll!RtlpExecuteHandlerForUnwind+0xd
ntdll!RtlpExecuteHandlerForUnwind L13 04,01 [EU ]
  ntdll!RtlpUnwindHandler (assembler/unknown)

Because this call frame does have an exception handler that supports unwind callouts, it will be returned to RtlUnwindEx by RtlVirtualUnwind. This, in turn, will lead to RtlUnwindEx calling RtlpUnwindHandler, as registered by RtlpExecuteHandlerForUnwind in the original call stack (by RtlUnwindEx). We can verify this in the debugger:

0:000> bp ntdll!RtlpUnwindHandler
0:000> g
Breakpoint 1 hit
ntdll!RtlpUnwindHandler:
00000000`779507e0 488b4220  mov rax,qword ptr [rdx+20h]
0:000> k
Child-SP          Call Site
00000000`0012d9a8 ntdll!RtlpUnwindHandler
00000000`0012d9b0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012d9e0 ntdll!RtlUnwindEx+0x236
00000000`0012e060 ntdll!local_unwind+0x1c
00000000`0012e540 TestApp!`FaultingFunction2'::`1'::fin$2+0x34
00000000`0012e570 ntdll!_C_specific_handler+0x140
00000000`0012e5e0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012e610 ntdll!RtlUnwindEx+0x236
00000000`0012ec90 TestApp!UnwindExceptionHandler2+0xf8
00000000`0012f1b0 TestApp!`FaultingFunction2'::`1'::filt$1+0xe
[...]

This is where things start to get a little interesting. From the discussion in the previous article, we know that RtlpUnwindHandler essentially does the following:

  1. Retrieve the PDISPATCHER_CONTEXT argument that RtlpExecuteHandlerForUnwind (the original instance, from the original unwind operation initiated by the first call to RtlUnwindEx) saved on its stack. This is done via the use of the EstablisherFrame argument to RtlpUnwindHandler.
  2. Copy the contents of RtlpExecuteHandlerForUnwind’s DISPATCHER_CONTEXT over the DISPATCHER_CONTEXT of the current RtlUnwindEx instance, through the PDISPATCHER_CONTEXT argument provided to RtlpUnwindHandler. Note that the TargetIp member of the DISPATCHER_CONTEXT is not copied from RtlpExecuteHandlerForUnwind’s DISPATCHER_CONTEXT.
  3. Return the manifest ExceptionCollidedUnwind constant to the caller (RtlpExecuteHandlerForUnwind, which will in turn return this value to RtlUnwindEx).

After all this is done, control returns to RtlUnwindEx. Because RtlpExecuteHandlerForUnwind returned ExceptionCollidedUnwind, though, a previously unused code path is activated. This code path (as described previously) copies the contents of the DISPATCHER_CONTEXT structure whose address was passed to RtlpExecuteHandlerForUnwind back into the internal state of RtlUnwindEx (including the context record), and then attempts to re-start unwinding of the current stack frame.

If you’ve been paying attention so far, then you probably understand what is going to happen next.

Because of the fact that RtlpUnwindHandler copied the DISPATCHER_CONTEXT from the original call to RtlUnwindEx over the DISPATCHER_CONTEXT from the current (collided unwind) call to RtlUnwindEx, the current instance of RtlUnwindEx now has access to all of the state information that the original RtlUnwindEx instance had placed into the PDISPATCHER_CONTEXT passed to RtlpExecuteHandlerForUnwind. Most importantly, this includes access to the original context record descibing the call frame that the original instance of RtlUnwindEx was in the process of unwinding.

Since all of this information has now been copied over the current RtlUnwindEx instance’s internal state, in effect, the current instance of RtlUnwindEx will (for the next unwind iteration) start unwinding the stack where the original RtlUnwindEx instance stopped; in other words, the stack being unwound “jumps” from the currently active call stack to the exception (or other) call stack that was originally being unwound.

At this point, the second instance of RtlUnwindEx is all setup to unwind the call stack to the new unwind target frame (and target instruction pointer; remember that TargetIp was omitted from the copying performed on the PDISPATCHER_CONTEXT in RtlpUnwindHandler) like a “conventional” unwind. The rest is, as they say, history.

Now that we know how collided unwinds work, it is important to know when one would ever see such a thing (after all, interrupting an unwind in-progress is a fairly invasive and atypical operation).

It turns out that collided unwinds are not quite as far-fetched as they might seem; the easiest way to cause such an event is to do something sleazy like execute a return/goto/continue/break to transfer control out of a __finally block. This, in effect, requires that the compiler stop the current unwind operation and transfer control to the target location (which is usually within the function that contained the __finally that the programmer jumped out of). Nevertheless, the compiler still has to deal with the fact that it has been called in the context of an unwind operation, and as such it needs a way to “break out” of the unwind call stack. This is done by executing a “local unwind”, or an unwind to a location within the current function. In order to do this, the compiler calls a small, runtime-supplied helper function known as local_unwind. This function is described below, and is essentially an extremely thin wrapper around RtlUnwindEx that, in practice, adds no value other than providing some default argument values (and scratch space on the stack for RtlUnwindEx to use to store a CONTEXT structure):

0:000> uf ntdll!local_unwind
ntdll!local_unwind:
00000000`7796f580 4881ecd8040000 sub  rsp,4D8h
00000000`7796f587 4d33c0         xor  r8,r8
00000000`7796f58a 4d33c9         xor  r9,r9
00000000`7796f58d 4889642420     mov  qword ptr [rsp+20h],rsp
00000000`7796f592 4c89442428     mov  qword ptr [rsp+28h],r8
00000000`7796f597 e844b0fdff     call ntdll!RtlUnwindEx
00000000`7796f59c 4881c4d8040000 add  rsp,4D8h
00000000`7796f5a3 c3             ret

When the compiler calls local_unwind as a result of the programmer breaking out of a __finally block in some fashion, then execution will eventually end up in RtlUnwindEx. From there, RtlUnwindEx eventually detects the operation as a collided unwind, once it unwinds past the original call to the original unwind handler that started the new unwind operation via local_unwind.

As a result, breaking out of a __finally block instead of allowing it to run to completion (which may result in control being transferred out of the “current function”, from the programmer’s perspective, and “into” the next function in the call stack for unwind processing) is how every-day programs can end up causing a collided unwind.

Next time: More unwind estorica, including details on how RtlUnwindEx and RtlRestoreContext lay the groundwork used to build C++ exception handling support.

Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)

Monday, January 8th, 2007

In the previous article in this series, I discussed the external interface exposed by RtlUnwindEx (and some of how unwinding works at a high level). This posting continues that discussion, and aims to provide insight into the internal workings of RtlUnwindEx (and as such, the inner details of all of the different aspects of unwind support on x64 Windows).

As previously described, the main behavior of RtlUnwindEx is to systematically unwind call frames (with the help of RtlVirtualUnwind) until a specific call frame, which is identified by the TargetFrame argument, is reached. RtlUnwindEx is also responsible for all interactions with language exception handlers for purposes of unwind operations. Additionally, RtlUnwindEx also imposes various validations and restrictions on execution contexts being unwound, and on the behavior of exception handlers being called for an unwind operation.

The first order of business within RtlUnwindEx is to capture the execution context at the time of the call to RtlUnwindEx (specifically, the execution context inside RtlUnwindEx, not of the caller of RtlUnwindEx). This is done with the aid of two helper functions, RtlpGetStackLimits (which retrieves the bounds of the stack for the current thread from the NT_TIB region of the current threads’ TEB), and RtlCaptureContext (which records the complete execution context of its caller within a standard CONTEXT structure). Additionally, if an unwind table is supplied, a special flag is set in it that optimizes the behavior of subsequent calls to RtlLookupFunctionTable for lookups that are unwind-driven (this is a behavior new to Windows Vista, and is a further attempt to improve the performance of unwind support on x64).

If the caller did not supply an EXCEPTION_RECORD argument, RtlUnwindEx will create the default STATUS_UNWIND exception record at this time and substitute it for what would have otherwise been a caller-supplied EXCEPTION_RECORD block. The exception record is initialized with an ExceptionAddress pointing to the Rip value captured previously by RtlCaptureContext, and with no parameters. Additionally, an initial ExceptionFlags value of EXCEPTION_UNWINDING is set, to later indicate to any exception handlers that might be called that an unwind operation is in progress (the EXCEPTION_RECORD pointer, either caller supplied or locally allocated by RtlUnwindEx in the absence of a caller-supplied value, corresponds exactly to the EXCEPTION_RECORD argument passed to any LanguageHandler that is called during unwind processing).

In the event that the caller of RtlUnwindEx did not supply a TargetFrame argument (indicating that the requested unwind operation is an exit unwind), then the EXCEPTION_EXIT_UNWIND flag is set within RtlUnwindEx’s internal ExceptionFlags value. An exit unwind is a special form of unwind where the “target” of the unwind is unknown; in other words, the caller does not have a valid target frame pointer to supply to RtlUnwindEx. Initiating a target unwind is normally dangerous unless the caller has special knowledge of an unwind handler in the call stack that will halt the unwind operation prematurely (either by initiating a secondary unwind, which leads to what is called a collided unwind, or by exiting the thread entirely). The reason for this restriction is that as RtlUnwindEx doesn’t have a clear “stopping point” to halt the unwind cycle at, it will happily unwind past the end of the stack (typically resulting in an access violation) unless an unwind handler along the way does something to halt the unwind. Most unwind operations are not exit unwinds.

At this point, RtlUnwindEx is set up to enter the main loop of the unwind algorithm, which essentially involves repeated calls to RtlVirtualUnwind, and then to unwind handlers (if present). This main loop involves multiple steps:

  1. The RUNTIME_FUNCTION entry for the current frame (given by the Rip member of the context record captured above, and later updated in this loop) is located via RtlLookupFunctionEntry. If no function entry is present, then RtlUnwindEx will load Context->Rip with a ULONG64 value located at Context->Rsp, and then increment Context->Rsp by 8. The behavior when there is no RUNTIME_FUNCTION entry present accounts for leaf functions, for which unwind metadata is optional. If the current frame is a leaf function, then control skips forward to step 8.
  2. Assuming that a RUNTIME_FUNCTION was found, RtlUnwindEx makes a copy of the current execution context that will be unwound – something I call the “unwind context”. After duplicating the context (via the RtlpCopyContext helper function, which only duplicates the non-volatile context), RtlVirtualUnwind is called (with the unwind context), and requested to return the address any associated language handler that is marked for unwind support. RtlVirtualUnwind thus returns several useful pieces of information; a language handler supporting unwind (if any), an updated context describing the caller of the requested call frame, a language-handler-specific (i.e. C scope table) data pointer associated with the requested call frame (if any), and the stack pointer of the call frame being unwound (the establisher frame). These pieces of information are used later in communication with a returned exception handler with unwind support, if one exists.
  3. After calling RtlVirtualUnwind to establish the context of the next location on the stack frame (now contained within the “unwind context” location), RtlUnwindEx performs some validation of the returned EstablisherFrame value. Specifically, the EstablisherFrame value is ensured to be 8-byte aligned and within the stack limits of the current thread (in kernel mode, there is also special support for handling the case of an unwind occcuring within the context of a DPC, which may operate under a secondary stack). If either of these conditions does not hold true, a STATUS_BAD_STACK exception is raised, indicating that the stack pointer in the requested call frame is damaged or corrupted. Additionally, if a TargetFrame value is specified (that is, the unwind operation is not an exit unwind), then the TargetFrame value is validated to be greater than or equal to the EstablisherFrame value returned by RtlVirtualUnwind. This is, in effect, a sanity check designed to ensure that the unwind target actually refers to a previous call frame and not that one that has already be unwound. If this check fails, then a STATUS_BAD_STACK exception is raised.
  4. If a language handler was returned by RtlVirtualUnwind, then RtlUnwindEx sets up for a call to the language handler. This involves the initial setup of a DISPATCHER_CONTEXT structure created on the stack of RtlUnwindEx. The DISPATCHER_CONTEXT structure describes some internal state that RtlUnwindEx shares with all participants in the unwind process, such as language handlers being called for unwind. It contains all of the state information necessary to coordinate operation between RtlUnwindEx and any language handler. Furthermore, it is also instrumental in the processing of collided unwinds; more on that later. The newly initialized DISPATCHER_CONTEXT contains two fields of significance, initially; the TargetIp field (which is simply a copy of the TargetIp argument to RtlUnwindEx), and the ScopeIndex field (which is zero initialized). Both of these fields are unused by RtlUnwindEx itself, and are simply available for the conveniene of language handlers being called for an unwind operation. If no language handler was present for the requested call frame, then control skips forward to step 8.
  5. At this point, RtlUnwindEx is ready to make a call to an unwind handler. This first involves a quick check to determine whether the end of the unwind chain has been reached, through comparing the current frame’s EstablisherFrame value with the TargetFrame argument to RtlUnwindEx. If the two frame pointers match exactly, then the ExceptionFlags value passed in to the unwind handler has an additional bit set, EXCEPTION_TARGET_UNWIND. This flag bit lets the unwind handler know that it is the “last stop” in the unwind process (in other words, that there will be no further frame unwinds after the unwind handler finishes processing). At this point, the ReturnValue argument passed to RtlUnwindEx is copied into the Rax register image in the active context for the current frame (not the unwound context, which refers to the previous frame). Then, the last remaining fields of the DISPATCHER_CONTEXT structure are initialized based on the internal state of RtlUnwindEx; the image base, handler data, instruction pointer (ControlPc), function entry, establisher frame, and language handler values previously returned by RtlLookupFunctionEntry and RtlVirtualUnwind are copied into the DISPATCHER_CONTEXT structure, along with a pointer to the context record describing the execution state at the current frame. After the ExceptionFlags member of RtlUnwindEx’s EXCEPTION_RECORD structure is set, the stack-based exception flags image (from which the copy in the EXCEPTION_RECORD was copied from) has the EXCEPTION_TARGET_UNWIND and EXCEPTION_COLLIDED_UNWIND flags cleared, to ensure that these flags are not inadvertently passed to an exception routine unexpectedly in a future loop iteration.
  6. After preparing the DISPATCHER_CONTEXT for the unwind handler call, RtlUnwindEx makes a call to a small helper function, RtlpExecuteHandlerForUnwind. RtlpExecuteHandlerForUnwind is an assembly-language routine whose prototype matches that of the language specific handler, given below:
    typedef EXCEPTION_DISPOSITION (*PEXCEPTION_ROUTINE) (
        IN PEXCEPTION_RECORD               ExceptionRecord,
        IN ULONG64                         EstablisherFrame,
        IN OUT PCONTEXT                    ContextRecord,
        IN OUT struct _DISPATCHER_CONTEXT* DispatcherContext
    );

    RtlpExecuteHandlerForUnwind is fairly straightforward. All it does is store the DispatcherContext argument on the stack, and then make a call to the LanguageHandler member in the DISPATCHER_CONTEXT structure. RtlpExecuteHandler then returns the return value of the LanguageHandler itself.

    While this may seem like a rather useless helper routine at first, RtlpExecuteHandlerForUnwind actually does add some value, although it might not be immediately apparent unless one looks closely. Specifically, RtlpExecuteHandlerForUnwind registers an exception/unwind handler for itself (RtlpUnwindHandler). RtlpUnwindHandler does not go through _C_specific_handler; in other words, it is a raw exception handler registration. Like RtlpExecuteHandlerForUnwind, RtlpUnwindHandler is a raw assembly language routine. It, too, is fairly simple (and as a language-level exception handler routine, RtlpUnwindHandler is compatible with the LanguageHandler prototype described above); RtlpUnwindHandler uses the EstablisherFrame argument given to a LanguageHandler routine to locate the saved pointer to the DISPATCHER_CONTEXT on the stack of RtlpExecuteHandlerForUnwind, and then copies most of the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandlerForUnwind over the DISPATCHER_CONTEXT structure that was passed to RtlpUnwindHandler itself (conspicuously omitted from the copy is the TargetIp member of the DISPATCHER_CONTEXT structure, for reasons that will become clear later). After performing the copy of the DISPATCHER_CONTEXT structure, RtlpUnwindHandler returns the manifest ExceptionCollidedUnwind constant. Although one might naively assume that all of this just leads up to protecting against the case of an unwind handler throwing an exception, it actually has a much more common (and significant) use; more on that later.

  7. After RtlpExecuteHandlerForUnwind returns, RtlUnwindEx decides what course of action to persue based on the return value. There are two legal return values from an exception handler called for unwind, ExceptionContinueSearch (the general “success”) return, and ExceptionCollidedUnwind. If any other value is returned, then RtlUnwindEx raises a STATUS_INVALID_DISPOSITION exception, indicating that an unwind handler has returned an illegal value (this is typically rarely seen in practice, as most unwind handlers are compiler generated, and therefore always get the return value correct). If ExceptionContinueSearch is returned, and the current EstablisherFrame doesn’t match the TargetFrame argument, then the unwind context and the context for the “current frame” are swapped (this positions the current frame context as referring to the context of the next function in the call chain, which will then be duplicated and unwound in the next loop iteration). If ExceptionCollidedUnwind is returned, then the execution path is a little bit more complicated. In the collided unwind case, all of the internal state information that RtlUnwindEx had previously copied into the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandler back out of the DISPATCHER_CONTEXT structure. RtlVirtualUnwind is then executed to determine the next lowest call frame using the context copied out of the DISPATCHER_CONTEXT structure, the EXCEPTION_COLLIDED_UNWIND flag is set, and control is transferred to step 5. This step may initially seem strange, but it will become clear after it is explained later.
  8. If control reaches this point, then a frame has been successfully unwound, and any applicable unwind handler has been notified of the unwind operation. The next step is a re-validation of the EstablisherFrame value (as it may have changed in the collided unwind case). Assuming that EstablisherFrame is valid, if its value does not match the TargetFrame argument, then control is transferred to step 1. Otherwise, if there is a match, then the loop terminates. (If the EstablisherFrame is not valid, and is not the expected TargetFrame value, then either the unwind exception record is raised as an exception, or a STATUS_BAD_FUNCTION_TABLE exception is raised.)

At this point, RtlUnwindEx has arrived at its target frame, and all intermediary unwind handlers have been called. It is now time to transfer control to the unwind point. The ReturnValue argument is again loaded into the current frame’s context (Rax register), and if the exception code supplied by the RtlUnwindEx caller via the ExceptionRecord argument does not match STATUS_UNWIND_CONSOLIDATE, the Rip value in the current frame’s context is replaced with the TargetIp argument.

The final task is to realize the finalized context; this is done by calling RtlRestoreContext, passing it the current frame’s context and the ExceptionRecord argument (or the default exception record constructed if no ExceptionRecord argument was supplied). RtlRestoreContext will in most cases simply copy the given context into the currently active register set, although in two special cases (if a STATUS_LONGJUMP or STATUS_UNWIND_CONSOLIDATE exception code is set in the optional ExceptionRecord argument), this behavior deviates from the norm. In the long jump case (as previously documented), the ExceptionRecord argument is assumed to contain a pointer to a jmp_buf, which contains a nonvolatile register set to restore on top of the unwound context supplied by RtlUnwindEx. The unwind consolidate case is rather more complicated, and will be discussed in a future posting.

For reference, I have posted some annotated, reverse engineered C and assembler code describing the internal operations of RtlUnwindEx and several of its helper functions (such as RtlpUnwindHander). This C code is based off of the Windows Vista implementation of RtlUnwindEx, and as such takes advantage of new Windows Vista-specific optimizations to unwind handling. Specifically, the “Unwind” flag in the UNWIND_HISTORY_TABLE structure is new in Windows Vista (although the size of the structure has not changed; there used to be empty alignment padding at that offset in previous Windows versions). This flag is used as a hint to RtlLookupFunctionEntry, in order to expedite lookup of function entries for some commonly referenced functions in the unwind path. Between the provided comments and the above description of the overall functionality of RtlUnwindEx, the inner workings of it should begin to come clear. There are some aspects (in particular, collided unwind) that are a bit more complicated than one might initially imagine; I’ll discuss collided unwinds (and more) in the next posting in this series.

It would be best to call the system version of RtlUnwindEx instead of reimplementing it for general purpose use (which I have done so here primarily to illustrate how unwind works on x64 Windows). There have been improvements made to RtlUnwindEx between Windows Server 2003 SP1 x64 and Windows Vista x64, so it would be unwise to assume that RtlUnwindEx will remain devoid of new performance or feature additions forever.

Next up: Collided unwinds, and other things that go “bump” in the dark when you use compiler exception handling and unwind support.