Archive for the ‘Debugging’ Category

Be careful about using the built-in “low privilege” service accounts…

Wednesday, July 25th, 2007

One of the security enhancements in the Windows XP and Windows Server 2003 timeframe was to move a number of the built-in services that ship with the OS to run as a more restricted user account than LocalSystem. Specifically, two new built-in accounts akin to LocalSystem were introduced exclusively for use with services: The local service and network service accounts. These are essentially slightly more powerful than plain user accounts, but not powerful enough such that a compromise will mean the entire system is a write-off.

The intention here was to reduce the attack surface of the system as a whole, such that if a service that is running as LocalService or NetworkService is compromised, then it cannot be used to take over the system as a whole.

(For those curious, the difference between LocalService and NetworkService is only evident in domain scenarios. If the computer is joined to a domain, LocalService authenticates as a guest on the network, while NetworkService (like LocalSystem) authenticates as the computer account.)

Now, reducing the amount of code running as LocalSystem is a great thing pretty much all around, but there are some sticking points with the way the two built-in service accounts work that aren’t really covered in the documentation. Specifically, that there are a whole lot of other services that run as either LocalService or NetworkService nowadays, and by virtue of the fact that they all run as the same security context they can be compromised as one unit. In other words, if you compromise one LocalService process, you can attack all other LocalService processes, because they are running under the same security context.

Think about that for a minute. That effectively means that the attack surface of any LocalService process can in some sense be considered the sum of the attack surface of all LocalService processes on the same computer. Moreover, that means that as you offload more and more services to run as LocalService, the problem gets worse. (Although, it’s still better than the situation when everybody ran as LocalSystem, certainly.)

Windows Vista improves on this a little bit; in Vista, LocalService and NetworkService processes do have a little bit of protection from eachother, in that each service instance is assigned a unique SID that is marked as the owner for the process object (even though the process is running as LocalService or NetworkService). Furthermore, the default DACL for processes running as LocalService or NetworkService only grants access to administrators and the service-unique SID. This means that in Vista, one compromised LocalService process can’t simply use OpenProcess and WriteProcessMemory (or the like) to take complete control over another service process in Vista.

You can easily see this in action in the kernel debugger. Here’s what things look like in Vista:

kd> !process fffffa80022e0c10
PROCESS fffffa80022e0c10
    Token   fffff88001e3e060
kd> !token fffff88001e3e060
_TOKEN fffff88001e3e060
TS Session ID: 0
User: S-1-5-19
 10 S-1-5-5-0-107490
    Attributes - Mandatory Default Enabled Owner LogonId 
kd> !object fffffa80022e0c10
Object: fffffa80022e0c10  Type: (fffffa8000654840) Process
    ObjectHeader: fffffa80022e0be0 (old version)
    HandleCount: 5  PointerCount: 96
kd> dt nt!_OBJECT_HEADER fffffa80022e0be0
   +0x028 SecurityDescriptor : 0xfffff880`01e14c26 
kd> !sd 0xfffff880`01e14c20
->Revision: 0x1
->Dacl    : ->Ace[0]: ->AceType: ACCESS_ALLOWED_ACE_TYPE
->Dacl    : ->Ace[0]: ->AceFlags: 0x0
->Dacl    : ->Ace[0]: ->AceSize: 0x1c
->Dacl    : ->Ace[0]: ->Mask : 0x001fffff
->Dacl    : ->Ace[0]: ->SID: S-1-5-5-0-107490

->Dacl    : ->Ace[1]: ->AceType: ACCESS_ALLOWED_ACE_TYPE
->Dacl    : ->Ace[1]: ->AceFlags: 0x0
->Dacl    : ->Ace[1]: ->AceSize: 0x18
->Dacl    : ->Ace[1]: ->Mask : 0x00001400
->Dacl    : ->Ace[1]: ->SID: S-1-5-32-544

Looking at winnt.h, we can see that S-1-5-5-X-Y corresponds to a logon session SID. In Vista, each LocalService/NetworkService service process gets its own logon session SID.

By making the process owned by a different user than it is running as, and not allowing access to the user that the service is running as (but instead the logon session), the service is provided some measure of protection against processes in the same user context. This may not provide complete protection, though, as in general, any securable objects such as files or registry keys that contain an ACE matching against LocalService or NetworkService will be at the mercy of all such processes. To Microsoft’s credit, however, the default DACL in the token for such LocalService/NetworkService services doesn’t grant GenericAll to the user account for the service, but rather the service SID (another concept that is unique to Vista and future systems).

Furthermore, it seems like many of the ACLs that previously referred to LocalService/NetworkService are being transitioned to use service SIDs instead, which may again over time make LocalService/NetworkService once again viable, after all the third party software in the world that makes security decisions on those two SIDs is updated (hmm…), and the rest of the ACLs that refer to the old generalized SIDs that have fallen through the cracks are updated (check out AccessEnum from SysInternals to see where those ACLs have slipped through the cracks in Vista – there are at least a couple of places in WinSxS that mention LocalService or NetworkService for write access in my machine, and that isn’t even considering the registry or the more ephemeral kernel object namespace yet).

In Windows Server 2003, things are pretty bleak with respect to isolation between LocalService/NetworkService services. Service processes have direct access to eachother, as shown by their default security descriptors. The default security descriptor doesn’t allow direct access, but does allow one to rewrite it to grant oneself access as the owner field matches LocalService:

lkd> !process fffffadff39895c0 1
PROCESS fffffadff3990c20
    Token                             fffffa800132b9e0
lkd> !token fffffa800132b9e0
_TOKEN fffffa800132b9e0
TS Session ID: 0
User: S-1-5-19
 07 S-1-5-5-0-44685
    Attributes - Mandatory Default Enabled LogonId
lkd> !object fffffadff3990c20
Object: fffffadff3990c20  Type: (fffffadff4310a00) Process
    ObjectHeader: fffffadff3990bf0 (old version)
    HandleCount: 3  PointerCount: 21
lkd> dt nt!_OBJECT_HEADER fffffadff3990bf0
   +0x028 SecurityDescriptor : 0xfffffa80`011441ab 
lkd> !sd 0xfffffa80`011441a0
->Revision: 0x1
->Sbz1    : 0x0
->Control : 0x8004
->Owner   : S-1-5-19
->Dacl    : ->Ace[0]: ->AceType: ACCESS_ALLOWED_ACE_TYPE
->Dacl    : ->Ace[0]: ->AceFlags: 0x0
->Dacl    : ->Ace[0]: ->AceSize: 0x1c
->Dacl    : ->Ace[0]: ->Mask : 0x001f0fff
->Dacl    : ->Ace[0]: ->SID: S-1-5-5-0-44685

->Dacl    : ->Ace[1]: ->AceType: ACCESS_ALLOWED_ACE_TYPE
->Dacl    : ->Ace[1]: ->AceFlags: 0x0
->Dacl    : ->Ace[1]: ->AceSize: 0x14
->Dacl    : ->Ace[1]: ->Mask : 0x00100201
->Dacl    : ->Ace[1]: ->SID: S-1-5-18

Again looking at winnt.h, we clearly see that S-1-5-19 is LocalService. So, there is absolutely no protection at all from one compromised LocalService process attacking another, at least in Windows Server 2003.

Note that if you are marked as the owner of an object, you can rewrite the DACL freely by requesting a handle with WRITE_DAC access and then modifying the DACL field with a function like SetKernelObjectSecurity. From there, all you need to do is re-request a handle with the desired access, after modifying the security descriptor to grant yourself said access. This is easy to verify experimentally by writing a test service that runs as LocalService and requesting WRITE_DAC in an OpenProcess call for another LocalService service process.

To make matters worse, nowadays most services run in shared svchost processes, which means if one process in that svchost is compromised, the whole process is a write off.

I would recommend seriously considering using dedicated unique user accounts for your services in certain scenarios as a result of this unpleasant mess. In the case where you have a security sensitive service that doesn’t need high privileges (i.e. it doesn’t require LocalSystem), it is often the wrong thing to do to just stuff it in with the rest of the LocalService or NetworkService services due to the vastly increased attack surface over running as a completely isolated user account, even if setting up a unique user account is a pain to do programmatically.

Note that although Vista attempts to mitigate this problem by ensuring that LocalService/NetworkService services cannot directly interfere with eachother in the most obvious sense of opening eachother’s processes and writing code into eachother’s address spaces, this is really only a small measure of protection due to the problem that one LocalService process’s data files are at the mercy of every other LocalService process out there. I think that it would be extremely unwise to stake your system security on there being no way to compromise one LocalService process from another in Vista, even with its mitigations; it may be slightly more difficult, but I’d hardly write it off as impossible.

Given all of this, I would steer clear of NetworkService and LocalService for sensitive but unprivileged processes (and yes, I would consider such a thing a real scenario, as you don’t need to be a computer administrator to store valuable data on a computer; you just need there to not be an untrusted (or compromised) computer administrator on the box).

One thing I am actually kind of curious about is what the SWI rationale is for even allowing the svchost paradigm by default, given how it tends to (negatively, from the perspective of system security) multiply the attack surface of all svchost’d processes. Using svchosts completely blows away the security improvements Vista makes to LocalService / NetworkService, as far as I can tell. Even though there are some services that are partitioned off in their own svchosts, there’s still one giant svchost group in Vista that has something on the order of like ~20 services in it (ugh!). Not to mention that svchosts make debugging a nightmare, but that’s a topic for another posting.

Update: Andrew Rogers pointed out that I originally posted the security descriptor for a LocalSystem process in the Windows Server 2003 example, instead of for a LocalService process. Whoops! It actually turns out that contrary to what I originally wrote, the DACL on LocalService processes on Windows Server 2003 doesn’t explicitly allow access to LocalService, but LocalService is still named as the object owner, so it is trivial to gain that access anyway, as previously mentioned (at least for Windows Server 2003).

Debugger tricks: Break on a specific Win32 last error value in Windows Vista

Tuesday, July 24th, 2007

Often times, one type of problem that you might want to track down in a debugger (aside from a crash) is a particular function failing in a certain way. In the case of most Win32 functions, you’ll often get some sort of (hopefully meaningful) last error code. Sometimes you might need to know why that error is returned, or where it originated from (in the case of a last error value that is propagated up through several functions).

One way you might approach this is with a conditional breakpoint, but the SetLastError path is typically frequently hit, so this is often be problematic in terms of performance, even in user mode debugging on the local computer.

On Windows Vista, there is an undocumented hook inside of NTDLL (which is now responsible for the bulk of the logic behind SetLastError) that allows you to configure a program to break into the debugger when a particular error code is being set as the last error. This is new to Vista, and as it is not documented (at least not anywhere that I can see), it might not be around indefinitely.

For the moment, however, you can set ntdll!g_dwLastErrorToBreakOn to a non-zero value (via the ed command in the debugger) to ask NTDLL to execute a breakpoint when it sees that last error value being set. Obviously, this won’t catch things that modify the field in the TEB directly, but anything using SetLastError or RtlSetLastWin32Error will be checked against this value (in the context of the debuggee).

For example, you might see something like this if you ask NTDLL to break on error 5 (ERROR_ACCESS_DENIED) and then try to open a file or directory that you don’t have access to:

0:002> ed ntdll!g_dwLastErrorToBreakOn 5
0:002> g

[...] Perform an operation to cause ERROR_ACCESS_DENIED

(1864.2774): Break instruction exception
  - code 80000003 (first chance)
00000000`76d6fdf0 cc              int     3
0:004> k
Call Site
ntdll! ?? ::FNODOBFM::`string'+0x377b

(The debugger is slightly confused about symbol names in NTDLL due to the binary being reorganized into function chunks, but “ntdll! ?? ::FNODOBFM::`string’+0x377b” is part of ntdll!RtlSetLastWin32Error.)

Sometimes, it can be useful to add “debugger knobs” like this to your program that can be used to enable special diagnostics behavior that might be useful while debugging something. Several other components provide options like this; for example, there’s a global variable in NTDLL named ntdll!ShowSnaps that you can set to 1 in order to enable a large volume of debug print spew about the symbol import resolution process when the loader is resolving imported modules and symbols.

(Incidentally, debugger-settable global variables like ntdll!ShowSnaps are a good example of a correct way of using debug prints in release builds, though there are certainly many other good ways to do so.)

Update: Andrew Rogers points out that g_dwLastErrorToBreakOn existed on Srv03 as well, though it was resident in kernel32 (kernel32!g_dwLastErrorToBreakOn) and not NTDLL in that timeframe. As the last error logic was finally moved entirely to NTDLL in the Vista timeframe, so was the last error breakpoint hook.

Update: Pavel Lebedinsky points out that I neglected to mention that as a consequence of the internal BaseSetLastNTError routine in kernel32 on Srv03 not going through kernel32!SetLastError, the functionality available in Srv03 is generally much less useful (only catches things external to kernel32) than in Vista. Which, to be clear, is more my fault in not making this point known and not Andrew getting it wrong.

An introduction to DbgPrintEx (and why it isn’t an excuse to leave DbgPrints on by default in release builds)

Saturday, July 21st, 2007

One of the things that was changed around the Windows XP era or so in the driver development world was the introduction of the DbgPrintEx routine. This routine was introduced to combat the problem of debug spew from all sorts of different drivers running together by allowing debug prints to be filtered by a “component id”, which is supposed to be unique per class of driver. By allowing a user to filter which debug prints are being displayed by driver, on the host-side instead of the debugger side, a better debugging experience can be provided when the system is fairly “crowded” as far as debug prints go. This is especially common on checked builds of the operating system.

Additionally, DbgPrintEx provides an additional mechanism to filter output host-side – a severity level. The way the filtering system works is that each component has an associated set of allowed severity levels, such that only a message with a severity level that is “allowed” will actually be transmitted to the debugger when the DbgPrintEx call is made. This allows debug prints to be filtered on a (component, severity) basis, such that especially verbose debug prints can be turned off at runtime without requiring a rebuild or patching the binary on the fly (which might be a problem if the binary in question is the kernel itself and you’re running a 64-bit build). In order to edit the allowed severity of a component at runtime, you typically modify one of the nt!Kd_<component>_Mask global variables in the kernel with the debugger, setting the global corresponding to the desired component to the desired severity mask.

With respect to older drivers, the old DbgPrint call still works, but it is essentially repackaged into a DbgPrintEx call, with hardcoded (default) component and severity values. This means that you’ll still be able to get output from DbgPrint, but (obviously) you can’t take advantage of the extra host-side filtering. Host-side filtering is much preferable to .ofilter-style filtering (which occurs debugger-side), as debug prints that are transferred “over the wire” to the debugger incur a significant performance penalty – the system is essentially suspended while the transfer occurs, and if you have a large amount of debug print spew, this can quickly make the system unusably slow.

Windows Vista takes this whole mechanism one step further and turns off plain DbgPrint calls by default, by setting the default severity level for the DbgPrint-assigned component value such that DbgPrint calls are not transmitted to the debugger. This can be overridden at runtime by modifying Kd_DEFAULT_Mask, but it’s an extra step that must be taken (and one that may be confusing if you don’t know about the default behavior change in Vista, as your debug prints will seemingly just never work).

However, just because DbgPrintEx now provides a way to filter debug prints host-side, that doesn’t mean that you can just go and turn on all your debug prints for release builds by default. Among other things, it’s possible that someone else is using the same component id as your driver (remember, component ids are for classes of devices, such as “network driver”, or “video driver”). Furthermore, DbgPrintEx calls still do incur some overhead, even if they aren’t transmitted to the debugger on the other end (however as long as the debug print is masked off by the allowed severities mask for your component, the overhead is fairly minimal).

Still, the problem of limited component ids remains a significant enough one that you don’t want to turn on debug prints always, or if someone else with the same component id wants to debug their driver, they’ll have all of your debug print spew mixed in with their own.

Also, there is an option to turn on all debug prints, which can sometimes come in handy, and if every driver in the system has debug prints on by default, this often results in a lot of badness. This can be accomplished by specifying a global severity mask via the nt!Kd_WIN2000_Mask global, which is checked before any component specific masks. (If Kd_WIN2000_Mask allows the debug print, it is short-circuited to being allowed without considering component-specific masks. This makes things easier if you want to grab certain severities of debug print messages from many components at the same time, without having to go and manually poke around with severity levels on every component you’re intersted in.)

Unfortunately, this is already a problem on Vista RTM (even free builds) – there are a couple of in-box Microsoft drivers that are guilty of this sort of thing, making Kd_WIN2000_Mask less than useful on Vista. Specifically, cdrom.sys likes to print debug messages like this every second:

Will not retry; Sense/ASC/ASCQ of 02/3a/00

That’s hardly the worst of it, though. Try starting a manifested program on Vista with all debug print severities turned on in Kd_WIN2000_Mask and you’ll get pages of debug spew (no, I’m not kidding – try it with iexplore.exe and you’ll see what I mean). In that respect, shame on the SxS team for setting a bad example with polluting the world with debug prints that are useless to most of us (and to a lesser extent, cdrom.sys also gets a “black lump of coal”, so to speak). Maybe these two components will be fixed for Srv08 RTM or Vista SP1 RTM, if we’re lucky.

So, take this opportunity to check and make sure that none of your drivers ship with DbgPrints turned on by default – even if you do happen to use DbgPrintEx. Somebody who has to debug a problem on a computer with your driver is installed will be all the more happier as a result.

Debugger tricks: API call logging, the quick’n’dirty way (part 3)

Friday, July 20th, 2007

Previously, I introduced several ways to use the debugger to log API calls. Beyond what was described in that article, there are some other, more complicated examples that are worth reviewing. Additionally, there are certain limitations that should be considered when using the debugger instead of a dedicated API logging program.

Although logging breakpoints like I’ve previously described (i.e. displaying function input parameters and return values) are certainly handy, you’ve probably already come up with a couple of scenarios where breakpoints in the style like I’ve provided won’t give you what you need to track down a problem.

The most notable example of this is when you need to examine an out parameter that is filled by a function call, after the function call is made. This provides a problem, as it’s generally not reliable to access the function parameters on the stack after the function call has returned (in both stack and register based calling conventions in use on Windows, the called function is free to modify the parameter locations as it sees fit, and this is actually fairly common with optimizations enabled). As a result, what we really need is the ability to save some state across the function call, so that we can access some of the function’s arguments after the function returns.

Fortunately, this is doable within the debugger, albeit in a rather roundabout way. The key here is the usage of so-called user-defined pseudo-registers, which are conceptually extra platform-independent storage locations (accessed like regular registers in terms of the expression evaluator, hence the term pseudo-register). These pseudo-registers are essentially just variables in the conventional programming sense, although there are a limited number of them available (20 in the current release). As a result, there are some limitations on what can be accomplished using them, but for most circumstances, 20 is enough. If you find yourself needing to track more state than that, you should strongly consider writing a debugger extension in C instead of using the debugger script language.

(As an aside, at Driver DevCon a couple of years ago, I remember sitting in on a WinDbg-oriented session in which the presenter was at one point going over a large program written in the (then-relatively-new) expanded debugger scripting language, with additional support for conditionals and error handling. I still can’t but help think of debugger-script programs as combining the ugliest parts of Perl with cmd.exe-style batch scripts (although to be fair, the debugger expression evaluator is a bit more powerful than batch scripts, and it was also never originally intended to be used for more than simple expressions). To be honest, I would still strongly recommend against writing highly complex debugger-script programs where possible; they are something of a maintenance nightmare, among other things. For such circumstances, writing a debugger extension (or a program to drive the debugger entirely) is a better choice. I digress, however; back to the subject of call logging.)

The debugger’s user-defined pseudo-register facility provides an effective (if perhaps slightly awkward) means of storing state, and this can be used to save parameter values across a function call. For example, we might want to log all calls to ReadFile, such that we want a dump of the file data being read in. To accomplish this task, we’ll need to dump the contents of the output buffer (and use the bytes transferred count, another out parameter). This could be accomplished like so (in this case, for brevitiy, I am assuming that the program is using ReadFile in synchronous I/O mode):

0:000> bp kernel32!ReadFile "r @$t0 = poi(@esp+8) ; r @$t1 = poi(@esp+10) ; g @$ra ; .if (@eax != 0) { .printf \"Read %lu bytes: \\n\", dwo(@$t1) ; db @$t0 ldwo(@$t1) } .else { .echo Read failed! ; !gle } ; g "

The output of this command might be like so:

Read 22 bytes: 
0016ec3c  54 68 69 73 20 69 73 20-61 20 74 65 78 74 20 66
              This is a text f
0016ec4c  69 6c 65 2e 0d 0a

(Awkward wrapping done by me to avoid breaking the blog layout.)

This command is essentially a logical extension of yesterday’s example, with the addition of some state that is shared across the call. Specifically, the @$t0 and @$t1 user-defined pseudo-registers are used to save the lpBuffer ([esp+08h]) and lpNumberOfBytesRead ([esp+10h]) arguments to the ReadFile call across the function’s execution. When execution is stopped at the return address, the contents of the file data that were just read are dumped by dereferencing the values referred to by @$t0 and @$t1.

Although this sort of state-saving across execution can be useful, there are downsides. Firstly, this sort of breakpoint is fundamentally incompatible with multiple threads (at least in as much as multiple threads hitting the breakpoint in question simultaneously). This is because the debugger provides no provision for “expression-local”, or “thread-local” state – multiple threads hitting the breakpoint at the same time can step on eachothers toes, so to speak. (This problem can also occur with any sort of breakpoint that involves resuming execution until an implicit breakpoint created by a “g <address>” command, although it is arguably more severe with “stateful” breakpoints.)

This limitation in the debugger can be worked around in a limited fashion by making a breakpoint thread-specific via a thread specifier in the g command, although this is typically hardly convenient to do. Many call logging programs will account for multithreading natively and will not require any special work to accomodate multithreaded function calls. (Note that this problem is often not as severe as it might sound – in many cases, even in multithreaded programs, there is typically only one function that calls a function you’re interested in, or the liklihood of a thread collision is sufficiently small that it works anyway the vast majority of the time. However, in some circumstances, these style of breakpoints just do not work well if the function in question is called frequently from many threads and requires inspection of data after the function returns.)

Another significant limitation of using the debugger to do call logging (as opposed to a dedicated program) is that the debugger is typically very slow compared to a dediated program doing logging. The reason here is that for every breakpoint event, essentially all threads in the program are frozen, various state information is copied from the program to the debugger, and then the breakpoint expression is evaluated debugger-side. Additionally, unlike with a dedicated program, the results of the logging breakpoint are displayed in real time, instead of (say) being stored in a binary log buffer somewhere for later format and display. This means that even more overhead is incurred as the debugger UI needs to be updated on every breakpoint. As a result, if you set a conditional breakpoint on a frequently hit function, you may notice the program slow down significantly, perhaps even to the point of being unusable. Dedicated logging programs can employ a variety of techniques to circumvent these limitations of the debugger, which are primarily artifacts of the fact that the debugger is primarily designed to be a debugger and not a high-speed API monitor.

This is even more noticible in the kernel debugger case, as transitions to the debugger in in KD mode are very slow, such that even several transitions per second is enough to make a system all but unusable in practical terms. As a result, one needs to be extra careful in picking locations to set conditional logging breakpoints at in the kernel debugger (perhaps placing them in the middle of a function, in a specific interesting code path, rather than at the start so that all calls will be caught).

Given these limitations, it is worth doing a bit of analysis on the problem to determine if the debugger or a dedicated logging program is the best choice. Both approaches have strengths and weaknesses, and although the debugger is extremely flexible (and often very convenient), it isn’t necessarily the best choice in every conceivable scenario. In other words, use the best tool for the job. However, there are some circumstances where the only option is to use the debugger, such as kernel-mode call logging, so I would recommend at least having some basic knowledge of how to accomplish logging tasks with the debugger, even if you would normally always use a dedicated logging program. (Although, in the case of kernel mode debugging, again, the slowness of debugger transitions makes it important to pick “low-traffic” locations to breakpoint on.)

Still, an important part of being effective at debugging and solving problems is knowing your options and when (and when not) to use them. Using the debugger to perform call logging should just be one of many such options in your “debugging toolkit”.

Debugger tricks: API call logging, the quick’n’dirty way (part 2)

Thursday, July 19th, 2007

Last time, I put forth the notion that WinDbg (and the other DTW debuggers) are a fairly decent choice for API call logging. This article expands on just how to do this sort of logging via the debugger, by starting out with a simple logging breakpoint and expanding on it to be more intelligent.

As previously mentioned, it is really not all that difficult to use the debugger to perform call logging. The basic idea involved is to just set a “conditional” breakpoint (e.g. via the bp command) at the start of a function you’re interested in. From there, the breakpoint can have commands to display input parameters. However, you can also get a bit more clever in some scenarios (e.g. displaying return values, values in output parameters, and the like), although there are some limitations to this that may or may not be a problem based on the characteristics of the program that you’re debugging.

To give a simple example of what I mean, there’s the classic “show all files opened via Win32 CreateFile as they are opened”. In order to do this, the way to go is to set a breakpoint on kernel32!CreateFileW. (Remember that most of the “A” Win32 APIs thunk to the “W” APIs, so you can often set a breakpoint on just the “W” version to get both. Of course, this is not always true (and some bizzare APIs like WinInet actually thunk “W” to “A”), but as a general rule of thumb, it’s more often the case than not.) The kernel32 breakpoint needs to be imbued with the knowledge of how to display the first argument based on the calling convention of the routine in question. Since CreateFile is __stdcall, that would be [esp+4] (for x86), and rcx (for x64).

At it’s most basic, the breakpoint command might look like so:

0:001> bp kernel32!CreateFileW "du poi(@esp+4) ; gc"

(Note that the gc command is similar to g, except that it is designed especially for use in conditional breakpoints. If you trace into a breakpoint that controls execution with gc, it will resume executing the same way the user was controlling the program instead of unconditionally resuming normally. The difference between a breakpoint using g and one using gc is that if you trace into a gc breakpoint, you’ll trace to the next instruction, whereas if you trace into a g breakpoint, control will resume full speed and you’ll lose your place.)

The debugger output for this breakpoint (when hit) lists the names passed to kernel32!CreateFileW, like so (if I were setting this breakpoint in cmd.exe, and then did “type C:\readme.txt”, this might come up in the debugger output):

00657ff0  "C:\\readme.txt"

(Note that as the breakpoint displays the string passed to the function, it will be a relative path if the program uses the relative path.)

Of course, we can do slightly more complicated things as well. For instance, it might be a good idea to display the returned handle and the last error code. This could be done by having the breakpoint go to the return point of the function after it dumps the first parameter, and then display the additional information. To do this, we might use the following breakpoint:

0:001> bp kernel32!CreateFileW "du poi(@esp+4) ; g @$ra ; !handle @eax f ; !gle ; g"

The gist of this breakpoint is to display the returned handle (and last error status) after the function returns. This is accomplished by directing the debugger to resume execution until the return address is hit, and then operate on the return value (!handle @eax f) and last error status (!gle). (The @$ra symbol is a pseudo-register that refers to the current function’s return address in a platform-independent fashion. Essentially, the g @$ra command runs the program until the return address is hit.)

The output from this breakpoint might be like so:

0016f0f0  "coffbase.txt"
Handle 60
  Type         	File
  Attributes   	0
  GrantedAccess	0x120089:
  HandleCount  	2
  PointerCount 	3
  No Object Specific Information available
LastErrorValue: (Win32) 0 (0) - The operation
  completed successfully.
LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0

However, if we fail to open the file, the results are less than ideal:

00657ff0  "c:\\readme.txt"
Handle 4
  Type         	Directory
[...] enumeration of all handles follows [...]
21 Handles
Type           	Count
Event          	3
File           	4
Directory      	3
Mutant         	1
WindowStation  	1
Semaphore      	2
Key            	6
Thread         	1
LastErrorValue: (Win32) 0x2 (2) - The system
  cannot find the file specified.
LastStatusValue: (NTSTATUS) 0xc0000034 -
  Object Name not found.

What went wrong? Well, the !handle command expanded into essentially “!handle -1 f“, since CreateFile returned INVALID_HANDLE_VALUE (-1). This mode of the !handle extension enumerates all handles in the process, which isn’t what we want. However, with a bit of cleverness, we can improve upon this. A second take at the breakpoint might look like so:

0:001> bp kernel32!CreateFileW "du poi(@esp+4) ; g @$ra ; .if (@eax != -1) { .printf \"Opened handle %p\\n\", @eax ; !handle @eax f } .else { .echo Failed to open file, error: ; !gle } ; g"

Although that command might appear a bit intimidating at first, it’s actually fairly straightfoward. Like with the previous version of this breakpoint, the essence of what it accomplishes is to display the filename passed to kernel32!CreateFileW, and then resumes execution until CreateFile returns. Then, depending on whether the function returned INVALID_HANDLE_VALUE (-1), either the handle is displayed or the last error status is displayed. The output from the improved breakpoint might be something like this (with an example of successfully opening a file and then failing to open a file):


0016f0f0  "coffbase.txt"
Opened handle 00000060
Handle 60
  Type         	File
  Attributes   	0
  GrantedAccess	0x120089:
  HandleCount  	2
  PointerCount 	3
  No Object Specific Information available


00657ff0  "C:\\readme.txt"
Failed to open file, error:
LastErrorValue: (Win32) 0x2 (2) - The system
  cannot find the file specified.
LastStatusValue: (NTSTATUS) 0xc0000034 -
  Object Name not found.

Much better. A bit of intelligence in the breakpoint allowed us to skip the undesirable behavior of dumping the entire process handle table in the failure case, and we could even skip displaying the last error code in the success case.

As one can probably imagine, there’s a whole other range of possibilities here once one considers the flexbility offered by conditional breakpoints. However, there are some downsides with this approach that must be considered as well. More on a couple of other more advanced condition breakpoints for logging purposes in a future posting (as well as a careful look at some of the limitations and disadvantages of using the debugger instead of a specialized program, and some “gotchas” you might run into with this sort of approach).

Debugger tricks: API call logging, the quick’n’dirty way (part 1)

Wednesday, July 18th, 2007

One common task that you may be faced with while debugging a problem is to log information about calls to a function or function(s). While if you want to know about a function in your program that you have source code to, you could often just add some sort of debug print and rebuild the program, sometimes this isn’t practical. For example, you might not always be able to reproduce a problem and so it might not viable to have to restart with a debug-ified build because you might blow away your repro. Or, more importantly, you might need to log calls to functions that you don’t have source code to (or aren’t building as part of your program, or otherwise don’t want to modify).

For example, you might want to log calls to various Windows APIs in order to gain information about a problem that you are troubleshooting. Now, depending on what you’re doing, you might be able to do this by adding debug prints before and after every single call to the particular API. However, this is often less than convenient, and if you aren’t the immediate caller of the function you want to log, then you’re not going to be able to take that route anyway.

There are a number of API spy/API logging packages out there (and the Debugging Tools for Windows distribution even ships with one, called Logger, though it tends to be fairly fragile – personally, I’ve had it crash out on me more often than I’ve had it actually work). Although you might be able to use one of those, a big limitation of “shrink-wrapped” logging tools is that they won’t know how to properly log calls to custom functions, or functions that are otherwise not known to the logging tool. The better logging tools out there are user-extensible to a certain extent, in that they typically provide some sort of scripting- or programmming- language that allows the user (i.e. you) to describe function parameters and calling conventions, so that they can be logged.

However, it can often be difficult (or even impossible) to describe many types of functions to these tools – such as functions that contain pointers to structures that contain pointers to other structures, or other such non-trivial constructs. As a result, for many circumstances, I tend to recommend to not use so-called “shrink-wrapped” API logging tools in situations where I want to log calls to functions.

In the event that it’s not a feasible solution to implement debug prints in source code, though, it would appear on the surface that this leaves one without a usable solution for logging calls. Not so, in fact – it turns out that with some careful use of so-called “conditional breakpoints”, you can often use the debugger (e.g. WinDbg/ntsd/cdb/kd, which is what I shall be referring to for the rest of this article) to provide this sort of call logging. Using the debugger has many advantages; for instance, you can do this sort of API logging “on the fly”, and in situations where you can attach the debugger after the process has started, you don’t even need to start the program specially in order to log it. Even better, however, is that the debugger has extensive support for displaying data in meaningful forms to the user.

If you think about it, displaying data to the user is one of the prinicpal functions of the debugger, in fact. It’s also one of the major reasons why the debugger is highly extensible via extensions, such that complicated data structures can be displayed and interpreted in a meaningful fashion. By using the debugger to perform your API logging, you can take advantage of the rich functionality for displaying data that is already baked into the debugger (and its extensions, and even any custom extensions of your own that you have written) to double as a call logging facility.

Even better, because the debugger can read and display many data types in a meaningful fashion based off of symbol files (if you have private symbols, such as for programs you compile or provide), for data types that don’t have specific debugger extensions for displaying them (like !handle, !error (for error codes), !devobj, and soforth), you can often utilize the debugger’s ability to format data based off of type information in symbols. This is typically done via the dt command, and often provides a workable display for most custom data types without having to do any sort of complicated “training” like you might have to do with a logging program. (Some data structures, such as trees and lists may need some more intelligence than what is provided in dt for displaying all parts of the data structure. This is typically true for “container” data types, although even for those types, you can still often use dt to display actual members within the container in a meaningful fashion.) Utilizing the information contained within symbols files (via the debugger) for API logging also frees you from having to ensure that your logging program’s definitions for all of your structures and other types are in synch with the program you are debugging, as the debugger automagically receives the correct definitions based on symbols (and if you are using a symbol server that includes indexed versions of your own internal symbols, the debugger will even be able to find the symbols on its own).

Another plus to this approach is that, provided you are reasonably familiar with the debugger, you probably won’t have to learn a new description language like you might if you were using an API logging program. This is because you’re probably already familiar with many of the commands the debugger makes available for displaying data, from every-day debugger usage. (Even if you aren’t all that familiar with the debugger, there is extensive documentation that ships with the debugger by default which describes how to format and display data via various debugger commands. Additionally, there are many examples describing how to use most of the important or useful debugger commands out there on the Internet.)

Okay, enough about why you might want to consider using the debugger to perform call logging. Next time, a quick look and walkthrough describing how you can do this (it’s really quite simple, as alluded to previously), along with some caveats and gotchas that you might want to watch out for along the way.

Silly debugger tricks: Using KD to reset a forgotten administrator password

Wednesday, July 11th, 2007

One particularly annoying occurance that’s happened to me on a couple of occasions is losing the password to a long-forgotten test VM that I need to thaw for some reason or another, months from the last time I used it. (If you follow good password practices and use differing passwords for accounts, you might find yourself in this position.)

Normally, you’re kind of sunk if you’re in this position, which is the whole idea – no administrator password is no administrative access to the box, right?

The officially supported solution in this case, assuming you don’t have a password reset disk (does anyone actually use those?) is to reformat. Oh, what fun that is, especially if you just need to grab something off of a test system and be done with it in a few minutes.

Well, with physical access (or the equivalent if the box is a VM), you can do a bit better with the kernel debugger. It’s a bit embarassing having to “hack” (and I use that term very loosely) into your own VM because you don’t remember which throwaway password you used 6 months ago, but it beats waiting around for a reformat (and in the case of a throwaway test VM, it’s probably not worth the effort anyway compared to cloning a new one, unless there was something important on the drive).

(Note that as far as security models go, I don’t really think that this is a whole lot of a security risk. After all, to use the kernel debugger, you need physical access to the system, and if you have that much, you could always just use a boot CD, swap out hard drives, or a thousand other different things. This is just more convenient if you’ve got a serial cable and a second box with a serial port, say a laptop, and you just want to reset the password for an account on an existing install.)

This is, however, perhaps an instructive reminder in how much access the kernel debugger gives you over a system – namely, the ability to do whatever you want, like bypass password authentication.

The basic idea behind this trick is to use the debugger to disable the password cheeck used at interactive logon inside LSA.

The first step is to locate the LSA process. The typical way to do this is to use the !process 0 0 command and look for a process name of LSASS.exe. The next step requires that we know the EPROCESS value for LSA, hence the enumeration. For instance:

kd> !process 0 0
PROCESS fffffa80006540d0
    SessionId: none  Cid: 0004    Peb: 00000000
      ParentCid: 0000
    DirBase: 00124000  ObjectTable: fffff88000000080
      HandleCount: 545.
    Image: System
PROCESS fffffa8001a893a0
    SessionId: 0  Cid: 025c    Peb: 7fffffda000
     ParentCid: 01ec
    DirBase: 0cf3e000  ObjectTable: fffff88001b99d90
     HandleCount: 822.
    Image: lsass.exe

Now that we’ve got the LSASS EPROCESS value, the next step is to switch to it as the active process. This is necessary as we’re going to need to set a conditional breakpoint in the context of LSA’s address space. For this task, we’ll use the .process /p /r eprocess-pointer command, which changes the debugger’s process context and reloads user mode symbols.

kd> .process /p /r fffffa8001a893a0
Implicit process is now fffffa80`01a893a0
.cache forcedecodeuser done
Loading User Symbols

Next, we set up a breakpoint on a particular internal LSA function that is used to determine whether a given password is accepted for a local account logon. The breakpoint changes the function to always return TRUE, such that all local account logons will succeed if they get to the point of a password check. After that, execution is resumed.

kd> ba e1 msv1_0!MsvpPasswordValidate
   "g @$ra ; r @al = 1 ; g"
kd> g

We can dissect this breakpoint to understand better just what it is doing:

  • Set a break on execute hardware breakpoint on msv1_0!MsvpPasswordValidate. Why did I use a hardware breakpoint? Well, they’re generally more reliable when doing user mode breakpoints from the kernel debugger, especially if what you’re setting a breakpoint on might be paged out. (Normal breakpoints require overwriting an instruction with an “int 3”, whereas a hardware breakpoint simply programs an address into the processor such that it’ll trap if that address is accessed for read/write/execute, depending on the breakpoint type.)
  • The breakpoint has a condition (or command) attached to it. Specifically, this command runs the target until it returns from the current function (“g @$ra” continues the target until the return address is hit. @$ra is a special platform-independent psueod-register that refers to the return address of the ccurrent function.) Once the function has returned, the al register is set to 1 and execution is resumed. This function returns a BOOLEAN value (in other words an 8-bit value), which is stored in al (the low 8 bits of the eax or rax register, depending on whether you’re on x86 or x64). IA64 targets don’t store return values in this fashion and so the breakpoint is x86/x64-specific.

Now, log on to the console. Make sure to use a local account and not a domain account, so the authentication is processed by the Msv1_0 package. Also, non-console logons might not run through the Msv1_0 package, and may not be affected. (For example, Network Level Authentication (NLA) for RDP in Vista/Srv08 doesn’t seem to use Msv1_0, even for local accounts. The console will still allow you to log in, however.)

From there, you can simply reset the password for your account via the Computer Management console. Be warned that this will wipe out EFS keys and the like, however. To restore password checking to normal, either reboot the box without the kernel debugger, or use the bc* command to disable the breakpoint you set.

(For the record, I can’t really take credit for coming up with this trick, but it’s certainly one I’ve found handy in a number of scenarios.)

Now, one thing that you might take away from this article, from a security standpoint, is that it is important to provide physical security for critical computers. To be honest, if someone really wants to access a box they have physical access to, this is probably not even the easist way; it would be simplere to just pop in a bootable CD or floppy and load a different operating system. As a result, as previously mentioned, I wouldn’t exactly consider this a security hole as it already requires you to have physical access in order to be effective. It is, however, a handy way to reset passwords for your own computers or VMs in a pinch if you happen to know a little bit about the debugger. Conversely, it’s not really a supported “solution” (more of a giant hack at best), so use it with care (and don’t expect PSS to bail you out if you break something by poking around in the kernel debugger). It may break without warning on future OS versions (and there are many cases that won’t be caught by this trick, such as domain accounts that use the Kerberos provider to process authentication).

Update: I forgot to mention the very important fact that you can turn on the kernel debugger from the “F8” boot menu when booting the system, even if you don’t have kernel debugging enabled in the boot configuration or boot.ini. This will enable kernel debugging on the highest numbered COM port, at 19200bps. (On Windows Vista, this also seems capable of auto-selecting 1394 if your machine had a 1394 port, if memory serves. I don’t know offhand whether that translates to downlevel platforms, though.)

The No-Execute hall of shame…

Friday, June 15th, 2007

One of the things that I do when I set up Windows on a new (modern) box that I am going to use for more than just temporary testing is to enable no-execute (“NX”) support by default. Preferable, I use NX in “always on” mode, but sometimes I use “opt-out” mode if I have to. It is possible to bypass NX in opt-out (or opt-in) mode with a “ret2libc”-style attack, which diminishes the security gain in an unfortunately non-trivial way in many cases. For those unclear on what “alwayson”, “optin”, “optout”, and “alwaysoff” mean in terms of Windows NX support, they describe how NX is applied across processes. Alwayson and alwaysoff are fairly straightforward; they mean that all processes either have NX forced or forced off, unconditionally. Opt-in mode means that only programs that mark themselves as NX-aware have NX enabled (or those that the administrator has configured in Control Panel), and opt-out mode means that all programs have NX applied unless the administrator has explicitly excluded them in Control Panel.

(Incidentally, the reason why the Windows implementation of NX was vulnerable to a ret2libc style attack in the first place was for support for extremely poorly written copy protection schemes, of all things. Specifically, there is code baked into the loader in NTDLL to detect SecuROM and SafeDisc modules being loaded on-the-fly, for purposes of automagically turning off NX for these security-challenged copy protection mechanisms. As a result of this, exploit code can effectively do the same thing that the user mode loader does in order to disable NX, before returning to the actual exploit code (see the Uninformed article for more details on how this works “under the hood”). I really just love how end user security is compromised for DRM/copy protection mechanisms as a result of that. Such is the price of backwards compatibility with shoddy code, and the myriad of games using such poorly written copy protection systems…)

As a result of this consequence of “opt-out” mode, I would recommend forcing NX to being always on, if possible (as previously mentioned). To do this, you typically need to edit boot.ini to enable always on mode (/noexecute=alwayson). On systems that use BCD boot databases, e.g. Vista or later, you can use this command to achieve the same effect as editing boot.ini on downlevel systems:

bcdedit /set {current} nx alwayson

However, I’ve found that you can’t always do that, especially on client systems, depending on what applications you run. There are still a lot of things out there which break with NX enabled, unfortunately, and “alwayson” mode means that you can’t exempt applications from NX. Unfortunately, even relatively new programs sometimes fall into this category. Here’s a list of some of the things I’ve ran into that blow up with NX turned on by default; the “No-Execute Hall of Shame”, as I like to call it, since not supporting NX is definitely very lame from a security perspective – even more so, since it requires you to switch from “Alwayson” to “Optout” (so you can use the NX exclusion list in Control Panel) which is in an of itself a security risk even for unrelated programs that do have NX enabled by default). This is hardly an exclusive list, but just some things that I have had to work around in my own personal experience.

  1. Sun’s Java plug-in for Internet Explorer. (Note that to enable NX for IE if you are in “Optout” mode, you need to go grovel around in IE’s security options, as IE disables NX for itself by default if it can – it opts out, in other words). Even with the latest version of Java (“Java Platform 6 update 1”, which also seems to be be known as 1.6.0), it’s “bombs away” for IE as soon as you try to go to a Java-enabled webpage if you have Sun’s Java plugin installed.

    Personally, I think that is pretty inexcusable; it’s not like Sun is new to JIT code generation or anything (or that Java is anything like the “new kid on the block”), and it’s been the recommended practice for a very long time to ensure that code you are going to execute is marked executable. It’s even been enforced by hardware for quite some time now on x86. Furthermore, IE (and Java) are absolutely “high priority” targets for exploits on end users (arguably, IE exploits on unpatched systems are one of the most common delivery mechanisms for malware), and preventing IE from running with NX is clearly not a good thing at all from a security standpoint. Here’s what you’ll see if you try to use Java for IE with NX enabled (which takes a bit of work, as previously noted, in the default configuration on client systems):

    (106c.1074): Access violation - code c0000005 (first chance)
    First chance exceptions are reported before any
    exception handling.
    This exception may be expected and handled.
    eax=6d4a4a79 ebx=00070c5c ecx=0358ea98
    edx=76f10f34 esi=0ab65b0c edi=04da3100
    eip=04da3100 esp=0358ead4 ebp=0358eb1c
    iopl=0         nv up ei pl nz na pe nc
    cs=001b  ss=0023  ds=0023  es=0023  fs=003b
    gs=0000             efl=00210206
    04da3100 c74424040c5bb60a mov dword ptr [esp+4],0AB65B0Ch
    0:005> k
    ChildEBP RetAddr  
    WARNING: Frame IP not in any known module. Following
    frames may be wrong.
    0358ead0 6d4a4abd 0x4da3100
    0358eb1c 76e31ae8 jpiexp+0x4abd
    0358eb94 76e31c03 USER32!UserCallWinProcCheckWow+0x14b
    0:005> !vprot @eip
    BaseAddress:       04da3000
    AllocationBase:    04c90000
    AllocationProtect: 00000004  PAGE_READWRITE
    RegionSize:        000ed000
    State:             00001000  MEM_COMMIT
    Protect:           00000004  PAGE_READWRITE
    Type:              00020000  MEM_PRIVATE
    0:005> .exr -1
    ExceptionAddress: 04da3100
       ExceptionCode: c0000005 (Access violation)
      ExceptionFlags: 00000000
    NumberParameters: 2
       Parameter[0]: 00000008
       Parameter[1]: 04da3100
    Attempt to execute non-executable address 04da3100
    0:005> lmvm jpiexp
    start    end        module name
        CompanyName:  JavaSoft / Sun Microsystems
        ProductName:  JavaSoft / Sun Microsystems
                      -- Java(TM) Plug-in
        InternalName: Java Plug-in für Internet Explorer

    (Yes, the internal name is in German. No, I didn’t install a German-localized build, either. Perhaps whoever makes the public Java for IE releases lives in Germany.)

  2. NVIDIA’s Vista 3D drivers. From looking around in a process doing Direct3D work in the debugger on an NVIDIA system, you can find sections of memory laying around that are PAGE_EXECUTE_READWRITE, or writable and executable, and appear to contain dynamically generated SSE code. Leaving memory around like this diminishes the value of NX, as it provides avenues of attack for exploits that need to store some code somewhere and then execute it. (Fortunately, technologies like ASLR may make it difficult for exploit code to land in such dynamically allocated executable and writable zones in one shot. An example of this sort of code is as follows (observed in a World of Warcraft process):
    0:000> !vprot 15a21a42 
    BaseAddress:       0000000015a20000
    AllocationBase:    0000000015a20000
    AllocationProtect: 00000040  PAGE_EXECUTE_READWRITE
    RegionSize:        0000000000002000
    State:             00001000  MEM_COMMIT
    Protect:           00000040  PAGE_EXECUTE_READWRITE
    Type:              00020000  MEM_PRIVATE
    0:000> u 15a21a42
    15a21a42 8d6c9500       lea     ebp,[rbp+rdx*4]
    15a21a46 0f284d00       movaps  xmm1,xmmword ptr [rbp]
    15a21a4a 660f72f108     pslld   xmm1,8
    15a21a4f 660f72d118     psrld   xmm1,18h
    15a21a54 0f5bc9         cvtdq2ps xmm1,xmm1
    15a21a57 0f590dd0400b05 mulps   xmm1,xmmword ptr
               [nvd3dum!NvDiagUmdCommand+0xd6f30 (050b40d0)]
    15a21a5e 0f59c1         mulps   xmm0,xmm1
    15a21a61 0f58d0         addps   xmm2,xmm0
    0:000> lmvm nvd3dum
    start             end                 module name
    04de0000 0529f000   nvd3dum
        Loaded symbol image file: nvd3dum.dll
        CompanyName:      NVIDIA Corporation
        ProductName:      NVIDIA Windows Vista WDDM driver
        InternalName:     NVD3DUM
        OriginalFilename: NVD3DUM.DLL
        FileDescription:  NVIDIA Compatible Vista WDDM
              D3D Driver, Version 158.28 

    Ideally, this sort of JIT’d code would be reprotected to be readonly after it is generated. While not as severe as leaving the entire process without NX, any “NX holes” like this are to be avoided when possible. (Simply slapping a “PAGE_EXECUTE_READWRITE” on all of your allocations and calling it done is not the proper solution, and compromises security to a certain extent, albeit significantly less so than disabling NX entirely.)

  3. id Software’s Quake 3 crashes out immediately if you have NX enabled. I suppose it can be given a bit of slack given its age, but the SDK documentation has always said that you need to protect executable regions as executable.

Programs that have been “recently rescued” from the No-Execute hall of shame include:

  1. DosBox, the DOS emulator (great for running those old games on new 64-bit computers, or even on 32-bit computers where DosBox has superior support for hardware, such as sound cards, compared to NTVDM. The current release version (0.70) of DosBox doesn’t work with NX if you have the dynamic code generation core enabled. However, after speaking to one of the developers, it turns out that a version that properly protects executable memory was already in the works (in fact, the DosBox team was kind enough to give me a prerelease build with the fix in it, in leui of my trying to get a DosBox build environment working). Now I’m free to indulge in those oldie-but-goodie classic DOS titles like “Master of Magic” from time to time without having to mess around with excluding DosBox.exe from the NX policy.
  2. Blizzard Entertainment’s World of Warcraft used to crash immediately on logging on to the game world if you had NX enabled. It seems as if this has been fixed for the most recent patch level, though.
  3. Another of Blizzard’s games, Starcraft, will crash with NX enabled unless you’ve got a recent patch level installed. The reason is that Starcraft’s rendering engine JITs various rendering operations into native code on the fly for increased performance. Until relatively recently in Starcraft’s lifetime, none of this code was NX-aware and blithely outputted non-executable rendering code which it then tried to run. In fact, Blizzard’s Warcraft II: Edition has not been patched to repair this deficiency and is thus completely unusable with NX enabled as a result of the same underlying problem.

If you’re writing a program that generates code on the fly, please use the proper protection attributes for it – PAGE_READWRITE during generation, and PAGE_EXECUTE_READONLY after the code is ready to use. Windows Server already ships with NX enabled by default (although in “opt-out” mode), and it’s already a matter of time before client SKUs of Windows do the same.

What is the “lpReserved” parameter to DllMain, really? (Or a crash course in the internals of user mode process initialization)

Wednesday, May 9th, 2007

One of the parameters to the DllMain function is the enigmatic lpReserved argument. According to MSDN, this parameter is used to convey whether a DLL is being loaded (or unloaded) as part of process startup or termination, or as part of a dynamic DLL load/unload operation (e.g. LoadLibrary/FreeLibrary).

Specifically, MSDN says that for static DLL operations, lpReserved contains a non-null value, whereas for dynamic DLL operations, it contains a null value.

There’s actually a little bit more to this parameter, though. While it’s true that in the case of DLL_PROCESS_DETACH operations, it is little more than a boolean value, it has some more significance in DLL_PROCESS_ATTACH operations.

To understand this, you need to know a little bit more about how process/thread initialization occurs. When a new user mode thread is started, the kernel queues a user mode APC to it, pointing to a function in ntdll called LdrInitializeThunk (ntdll is always mapped into the address space of a new process before any user mode code runs). The kernel also arranges for the thread to dispatch the user mode APC the first time it begins execution.

One of the arguments to the APC is a pointer to a CONTEXT structure describing the initial execution state of the new thread. (The actual contents of the CONTEXT structure are based at the stack of the thread.)

When the thread is first resumed, the APC executes and control transfers to ntdll!LdrInitializeThunk. From there, depending on whether the process is already initialized or not, either the process initialization code is executed (loading DLLs statically linked to the process, and soforth), or per-thread initialization code is run (for instance, making DLL_THREAD_ATTACH callouts to loaded DLLs).

If a user mode thread always actually begins execution at ntdll!LdrInitializeThunk, then you might be wondering how it ever starts executing at the start address specified in a CreateThread call. The answer is that eventually, the code called by LdrInitializeThunk passes the context record argument supplied by the kernel to the NtContinue system call, which you can think of as simply taking that context and transferring control to it. Because the context record argument to the APC contained the information necessary for control to be transferred to the starting address supplied to CreateThread, the thread then begins executing at expected thread starting address*.

(*: Actually, there is typically another layer here – usually, control would go to a kernel32 or ntdll function (depending on whether you are running on a downlevel platform or on Vista), which sets up a top level exception handler and then calls the start routine supplied to CreateThread. But, for the purposes of this discussion, you can consider it as just running the requested thread start routine.)

As all of this relates to DllMain, the value of the lpReserved parameter to DllMain (when process initialization is occuring and static linked DLLs are being loaded and initialized) corresponds to the context record argument supplied to the LdrInitializeThunk APC, and is thus representative of the initial context that will be set for the thread after initialization completes. In fact, it’s actually the context that will be used after process initialization is complete, and not just a copy of it. This means that by treating the lpReserved argument as a PCONTEXT, the initial execution state for the first thread in the process can be examined (or even altered) from the DllMain of a static-linked DLL.

This can be verified experimentally by using some trickery to step into process initialization DllMain (more on just how to do that with the user mode debugger in a future entry, as it turns out to be a bit more complicated than what one might imagine):

1:001> g
Breakpoint 3 hit
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> r
rax=0000000000000000 rbx=00000000002c4770 rcx=000007fefeaf0000
rdx=0000000000000001 rsi=000007fefeb15580 rdi=0000000000000003
rip=000007fefeb15580 rsp=00000000001defb8 rbp=0000000000000000
 r8=00000000001df530  r9=00000000001df060 r10=00000000002c1310
r11=0000000000000246 r12=0000000000000000 r13=00000000001df0f0
r14=0000000000000000 r15=000000000000000d
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> k
RetAddr           Call Site
00000000`77414664 ADVAPI32!DllInitialize
00000000`77417f29 ntdll!LdrpRunInitializeRoutines+0x257
00000000`7748e974 ntdll!LdrpInitializeProcess+0x16af
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe
1:001> .cxr @r8
rax=0000000000000000 rbx=0000000000000000 rcx=00000000ffb3245c
rdx=000007fffffda000 rsi=0000000000000000 rdi=0000000000000000
rip=000000007742c6c0 rsp=00000000001dfa08 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=0000  es=0000  fs=0000  gs=0000
00000000`7742c6c0 4883ec48        sub     rsp,48h
1:001> u @rcx
00000000`ffb3245c 4883ec28        sub     rsp,28h

If you’ve been paying attention thus far, then you might then be able to explain why when you set a hardware breakpoint at the initial process breakpoint, the debugger warns you that it will not take effect. For example:

(16f0.1890): Break instruction exception
- code 80000003 (first chance)
00000000`7742fdf0 cc              int     3
0:000> ba e1 kernel32!CreateThread
        ^ Unable to set breakpoint error
The system resets thread contexts after the process
breakpoint so hardware breakpoints cannot be set.
Go to the executable's entry point and set it then.
 'ba e1 kernel32!CreateThread'
0:000> k
RetAddr           Call Site
00000000`774974a8 ntdll!DbgBreakPoint
00000000`77458068 ntdll!LdrpDoDebuggerBreak+0x35
00000000`7748e974 ntdll!LdrpInitializeProcess+0x167d
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe

Specifically, this message occurs because the debugger knows that after the process breakpoint occurs, the current thread context will be discarded at the call to NtContinue. As a result, hardware breakpoints (which rely on the debug register state) will be wiped out when NtContinue restores the expected initial context of the new thread.

If one is clever, it’s possible to apply the necessary modifications to the appropriate debug register values in the context record image that is given as an argument to LdrInitializeThunk, which will thus be realized when NTDLL initialization code for the thread runs.

The fact that every user mode thread actually begins life at ntdll!LdrInitializeThunk also explains why, exactly, you can’t create a new thread in the current process from within DllMain and attempt to synchronize with it; by virtue of the fact that the current thread is executing DllMain, it must have the enigmatic loader lock acquired. Because the new thread begins execution at LdrInitializeThunk (even before the start routine you supply is called), for the purpose of making DLL_THREAD_ATTACH callouts, it too will become almost immediately blocked on the loader lock. This results in a classic deadlock if the thread already in DllMain tries to wait for the new thread.

Parting shots:

  • Win9x is, of course, completely dissimilar as far as DllMain goes. None of this information applies there.
  • The fact that lpReserved is a PCONTEXT is only very loosely documented by a couple of ancient DDK samples that used the PCONTEXT argument type, and some more recent SDK samples that name the “lpReserved” parameter “lpvContext”. As far as I know, it’s been around on all versions of NT (including Vista), but like other pseudo-documented things, it isn’t necessarily guaranteed to remain this way forever.
  • Oh, and in case you’re wondering why I used advapi32 instead of kernel32 in this example, it’s because due to a rather interesting quirk in ntdll on recent versions of Windows, kernel32 is always dynamic-loaded for every Win32 process (regardless of whether or not the main process image is static linked to kernel32). To make things even more interesting, kernel32 is dynamic loaded before static-linked DLLs are loaded. As a result, I decided it would be best to steer clear of it for the purposes of making this posting simple; be suitably warned, then, about trying this out on kernel32 at home.

Tricks for getting the most out of your minidumps: Including specific memory regions in a dump

Friday, May 4th, 2007

If you’ve ever worked on any sort of crash reporting mechanism, one of the constraints that you are probably familiar with is the size of the dump file created by your reporting mechanism. Obviously, as developers, we’d really love to write a full dump including the entire memory image of the process, full data about all threads and handles (and the like), but this is often less than possible in the real world (particularly if you are dealing with some sort of automated crash submission system, which needs to be as un-intrusive as possible, including not requiring the transfer of 50MB .dmp files).

One way you can improve the quality of the dumps your program creates without making the resulting .dmp unacceptably large is to just use a bit of intelligence as to what parts of memory you’re interested in. After all, while certainly potentially useful, chances are you probably won’t really need the entire address space of the process at the time of a crash to track down the issue. Often enough, simply a stack trace (+ listing of threads) is enough, which is more along the lines of what you see when you make a fairly minimalistic minidump.

However, there are lots of times where that little piece of state information that might explain how your program got into its crashed state isn’t on the stack, leaving you stuck without some additional information. An approach that can sometimes help is to include specific, “high-value” regions of memory in a memory dump. For example, something that can often be helpful (especially in the days of custom calling conventions that try to avoid using the stack where-ever possible) is to include a small portion of memory around each register in-memory.

The idea here is that when you’re going to write a dump, check each register in the faulting context to see if it points to a valid location in the address space of the crashed process. If so, you can include a bit of memory (say, +/- 128 bytes [or some other small amount] from the register’s value) in the dump. On x86, you can actually optimize this a bit further and typically leave out eip/esp/ebp (and any register that points into an executable section of an image section, on the assumption that you’ll probably be able to grab any relevant images from the symbol server (you are using a symbol repository with your own binaries included, aren’t you?) and don’t need to waste space with that code in the dump).

One class of problem that this can be rather helpful in debugging is a crash where you have some sort of structure or class that is getting used in some partially valid state and you need the contents of the struct/class to figure out just what happened. In many cases, you can probably infer the state of your mystery class/struct from what other threads in a program were doing, but sometimes this isn’t possible. In those cases, having access to the class/struct that was being ‘operated upon’ is a great help, and often times you’ll find code where there is a `this’ pointer to an address on the heap that is tantalyzingly present in the current register context. If you were using a typical minimalistic dump, then you would probably not have access to heap memory (due to size constraints) and might find yourself out of luck. If you included a bit of memory around each register when the crash occured, however, that just might get you the extra data points needed to figure out the problem. (Determining which registers “look” like a pointer is something easily accomplished with several calls to VirtualQueryEx on the target, taking each crash-context register value as an address in the target process and checking to see if it refers to a committed region.)

Another good use case for this technique is to include information about your program state in the form of including the contents of various key heap- (or global- ) based objects that wouldn’t normally be included in the dump. In that case, you probably need to set up some mechanism to convey “interesting” addresses to the crash reporting mechanism before a crash occurs, so that it can simply include them in the dump without having to worry about trying to grovel around in the target’s memory trying to pick out interesting things after-the-fact (something that is generally not practical in an automated fashion, especially without symbols). For example, if you’ve got some kind of server application, you could include pointers to particularly useful per-session-state data (or portions thereof, size constraints considered). The need for this can be reduced somewhat by including useful verbose logging data, but you might not always want to have verbose logging on all the time (for various reasons), which might result in an initial repro of a problem being less than useful for uncovering a root cause.

Assuming that you are following the recommended approach of not writing dumps in-process, the easiest way to handle this sort of communication between the program and the (hopefully isolated in a different process) crash reporting mechanism is to use something like a file mapping that contains a list (or fixed-size array) of pointers and sizes to record in the dump. This can make adding or removing “interesting” pointers from the list to be included as simple as adding or removing an entry in a flat array.

As far as including additional memory regions in a minidump goes, this is accomplished by including a MiniDumpCallback function in your call to MiniDumpWriteDump (via the CallbackParam parameter). The minidump callback is essentially a way to perform advanced customizations on how your dump is processed, beyond a set of general behaviors supplied by the DumpType parameter. Specifically, the minidump callback lets you do things like include/exclude all sorts of things from the dump – threads, handle data, PEB/TEB data, memory locations, and more – in a programmatic fashion. The way it works is that as MiniDumpWriteDump is writing the dump, it will call the callback function you supply a number of times to query you for any data you want to add or subtract from the dump. There’s a huge amount of customization you can do with the minidump callback; too much for just this post, so I’ll simply describe how to use it to include specific memory regions.

As far as including memory regions go, you need to wait for the MemoryCallback event being passed to your minidump callback. The way the MemoryCallback event works is that you are called back repeatedly, until your callback returns FALSE. Each time you are called back (and return TRUE), you are expected to have updated the CallbackOutput->MemoryBase and CallbackOutput->MemorySize output parameter fields with the base address and length of a region that is to be included in the dump. When your callback finally returns FALSE, MiniDumpWriteDump assumes that you’re done specifying additional memory regions to include and continues on to the rest of the steps involved in writing the dump.

So, to provide a quick example, assuming you had a DumpWriter class containing an array of address / length pairs, you might use a minidump callback that looks something like this to include those addresses in the dump:

BOOL CALLBACK DumpWriter::MiniDumpCallback(
 PVOID CallbackParam,
 DumpWriter *Writer;
 BOOL        Status;

 Status = FALSE;

 Writer = reinterpret_cast<DumpWriter*>(CallbackParam);

 switch (CallbackInput->CallbackType)

 ... handle other events ...

 case MemoryCallback:
  // If we have some memory regions left to include then
  // store the next. Otherwise, indicate that we're finished.

  if (Writer->Index == DumpWriter->Count)
   Status = FALSE;
   CallbackOutput->MemoryBase =
     Writer->Addresses[ Writer->Index ].Base;
   CallbackOutput->MemorySize =
     Writer->Addresses[ Writer->Index ].Length;

   Writer->Index += 1;
   Status = TRUE;

 ... handle other events ...

 return Status;

In a future posting, I’ll likely revisit some of the other neat things that you can do with the minidump callback function (as well as other things you can do to make your minidumps more useful to work with). In the mean time, Oleg Starodumov also has some great documentation (beyond that in MSDN) about just what all the other minidump callback events do, so if you’re finding MSDN a little bit lacking in that department, I’d encourage you to check his article out.