Archive for the ‘Windows’ Category

The No-Execute hall of shame…

Friday, June 15th, 2007

One of the things that I do when I set up Windows on a new (modern) box that I am going to use for more than just temporary testing is to enable no-execute (“NX”) support by default. Preferable, I use NX in “always on” mode, but sometimes I use “opt-out” mode if I have to. It is possible to bypass NX in opt-out (or opt-in) mode with a “ret2libc”-style attack, which diminishes the security gain in an unfortunately non-trivial way in many cases. For those unclear on what “alwayson”, “optin”, “optout”, and “alwaysoff” mean in terms of Windows NX support, they describe how NX is applied across processes. Alwayson and alwaysoff are fairly straightforward; they mean that all processes either have NX forced or forced off, unconditionally. Opt-in mode means that only programs that mark themselves as NX-aware have NX enabled (or those that the administrator has configured in Control Panel), and opt-out mode means that all programs have NX applied unless the administrator has explicitly excluded them in Control Panel.

(Incidentally, the reason why the Windows implementation of NX was vulnerable to a ret2libc style attack in the first place was for support for extremely poorly written copy protection schemes, of all things. Specifically, there is code baked into the loader in NTDLL to detect SecuROM and SafeDisc modules being loaded on-the-fly, for purposes of automagically turning off NX for these security-challenged copy protection mechanisms. As a result of this, exploit code can effectively do the same thing that the user mode loader does in order to disable NX, before returning to the actual exploit code (see the Uninformed article for more details on how this works “under the hood”). I really just love how end user security is compromised for DRM/copy protection mechanisms as a result of that. Such is the price of backwards compatibility with shoddy code, and the myriad of games using such poorly written copy protection systems…)

As a result of this consequence of “opt-out” mode, I would recommend forcing NX to being always on, if possible (as previously mentioned). To do this, you typically need to edit boot.ini to enable always on mode (/noexecute=alwayson). On systems that use BCD boot databases, e.g. Vista or later, you can use this command to achieve the same effect as editing boot.ini on downlevel systems:

bcdedit /set {current} nx alwayson

However, I’ve found that you can’t always do that, especially on client systems, depending on what applications you run. There are still a lot of things out there which break with NX enabled, unfortunately, and “alwayson” mode means that you can’t exempt applications from NX. Unfortunately, even relatively new programs sometimes fall into this category. Here’s a list of some of the things I’ve ran into that blow up with NX turned on by default; the “No-Execute Hall of Shame”, as I like to call it, since not supporting NX is definitely very lame from a security perspective – even more so, since it requires you to switch from “Alwayson” to “Optout” (so you can use the NX exclusion list in Control Panel) which is in an of itself a security risk even for unrelated programs that do have NX enabled by default). This is hardly an exclusive list, but just some things that I have had to work around in my own personal experience.

  1. Sun’s Java plug-in for Internet Explorer. (Note that to enable NX for IE if you are in “Optout” mode, you need to go grovel around in IE’s security options, as IE disables NX for itself by default if it can – it opts out, in other words). Even with the latest version of Java (“Java Platform 6 update 1”, which also seems to be be known as 1.6.0), it’s “bombs away” for IE as soon as you try to go to a Java-enabled webpage if you have Sun’s Java plugin installed.

    Personally, I think that is pretty inexcusable; it’s not like Sun is new to JIT code generation or anything (or that Java is anything like the “new kid on the block”), and it’s been the recommended practice for a very long time to ensure that code you are going to execute is marked executable. It’s even been enforced by hardware for quite some time now on x86. Furthermore, IE (and Java) are absolutely “high priority” targets for exploits on end users (arguably, IE exploits on unpatched systems are one of the most common delivery mechanisms for malware), and preventing IE from running with NX is clearly not a good thing at all from a security standpoint. Here’s what you’ll see if you try to use Java for IE with NX enabled (which takes a bit of work, as previously noted, in the default configuration on client systems):

    (106c.1074): Access violation - code c0000005 (first chance)
    First chance exceptions are reported before any
    exception handling.
    This exception may be expected and handled.
    eax=6d4a4a79 ebx=00070c5c ecx=0358ea98
    edx=76f10f34 esi=0ab65b0c edi=04da3100
    eip=04da3100 esp=0358ead4 ebp=0358eb1c
    iopl=0         nv up ei pl nz na pe nc
    cs=001b  ss=0023  ds=0023  es=0023  fs=003b
    gs=0000             efl=00210206
    04da3100 c74424040c5bb60a mov dword ptr [esp+4],0AB65B0Ch
    0:005> k
    ChildEBP RetAddr  
    WARNING: Frame IP not in any known module. Following
    frames may be wrong.
    0358ead0 6d4a4abd 0x4da3100
    0358eb1c 76e31ae8 jpiexp+0x4abd
    0358eb94 76e31c03 USER32!UserCallWinProcCheckWow+0x14b
    [...]
    0:005> !vprot @eip
    BaseAddress:       04da3000
    AllocationBase:    04c90000
    AllocationProtect: 00000004  PAGE_READWRITE
    RegionSize:        000ed000
    State:             00001000  MEM_COMMIT
    Protect:           00000004  PAGE_READWRITE
    Type:              00020000  MEM_PRIVATE
    0:005> .exr -1
    ExceptionAddress: 04da3100
       ExceptionCode: c0000005 (Access violation)
      ExceptionFlags: 00000000
    NumberParameters: 2
       Parameter[0]: 00000008
       Parameter[1]: 04da3100
    Attempt to execute non-executable address 04da3100
    0:005> lmvm jpiexp
    start    end        module name
    [...]
        CompanyName:  JavaSoft / Sun Microsystems
        ProductName:  JavaSoft / Sun Microsystems
                      -- Java(TM) Plug-in
        InternalName: Java Plug-in für Internet Explorer

    (Yes, the internal name is in German. No, I didn’t install a German-localized build, either. Perhaps whoever makes the public Java for IE releases lives in Germany.)

  2. NVIDIA’s Vista 3D drivers. From looking around in a process doing Direct3D work in the debugger on an NVIDIA system, you can find sections of memory laying around that are PAGE_EXECUTE_READWRITE, or writable and executable, and appear to contain dynamically generated SSE code. Leaving memory around like this diminishes the value of NX, as it provides avenues of attack for exploits that need to store some code somewhere and then execute it. (Fortunately, technologies like ASLR may make it difficult for exploit code to land in such dynamically allocated executable and writable zones in one shot. An example of this sort of code is as follows (observed in a World of Warcraft process):
    0:000> !vprot 15a21a42 
    BaseAddress:       0000000015a20000
    AllocationBase:    0000000015a20000
    AllocationProtect: 00000040  PAGE_EXECUTE_READWRITE
    RegionSize:        0000000000002000
    State:             00001000  MEM_COMMIT
    Protect:           00000040  PAGE_EXECUTE_READWRITE
    Type:              00020000  MEM_PRIVATE
    0:000> u 15a21a42
    15a21a42 8d6c9500       lea     ebp,[rbp+rdx*4]
    15a21a46 0f284d00       movaps  xmm1,xmmword ptr [rbp]
    15a21a4a 660f72f108     pslld   xmm1,8
    15a21a4f 660f72d118     psrld   xmm1,18h
    15a21a54 0f5bc9         cvtdq2ps xmm1,xmm1
    15a21a57 0f590dd0400b05 mulps   xmm1,xmmword ptr
               [nvd3dum!NvDiagUmdCommand+0xd6f30 (050b40d0)]
    15a21a5e 0f59c1         mulps   xmm0,xmm1
    15a21a61 0f58d0         addps   xmm2,xmm0
    0:000> lmvm nvd3dum
    start             end                 module name
    04de0000 0529f000   nvd3dum
        Loaded symbol image file: nvd3dum.dll
    [...]
        CompanyName:      NVIDIA Corporation
        ProductName:      NVIDIA Windows Vista WDDM driver
        InternalName:     NVD3DUM
        OriginalFilename: NVD3DUM.DLL
        ProductVersion:   7.15.11.5828
        FileVersion:      7.15.11.5828
        FileDescription:  NVIDIA Compatible Vista WDDM
              D3D Driver, Version 158.28 

    Ideally, this sort of JIT’d code would be reprotected to be readonly after it is generated. While not as severe as leaving the entire process without NX, any “NX holes” like this are to be avoided when possible. (Simply slapping a “PAGE_EXECUTE_READWRITE” on all of your allocations and calling it done is not the proper solution, and compromises security to a certain extent, albeit significantly less so than disabling NX entirely.)

  3. id Software’s Quake 3 crashes out immediately if you have NX enabled. I suppose it can be given a bit of slack given its age, but the SDK documentation has always said that you need to protect executable regions as executable.

Programs that have been “recently rescued” from the No-Execute hall of shame include:

  1. DosBox, the DOS emulator (great for running those old games on new 64-bit computers, or even on 32-bit computers where DosBox has superior support for hardware, such as sound cards, compared to NTVDM. The current release version (0.70) of DosBox doesn’t work with NX if you have the dynamic code generation core enabled. However, after speaking to one of the developers, it turns out that a version that properly protects executable memory was already in the works (in fact, the DosBox team was kind enough to give me a prerelease build with the fix in it, in leui of my trying to get a DosBox build environment working). Now I’m free to indulge in those oldie-but-goodie classic DOS titles like “Master of Magic” from time to time without having to mess around with excluding DosBox.exe from the NX policy.
  2. Blizzard Entertainment’s World of Warcraft used to crash immediately on logging on to the game world if you had NX enabled. It seems as if this has been fixed for the most recent patch level, though.
  3. Another of Blizzard’s games, Starcraft, will crash with NX enabled unless you’ve got a recent patch level installed. The reason is that Starcraft’s rendering engine JITs various rendering operations into native code on the fly for increased performance. Until relatively recently in Starcraft’s lifetime, none of this code was NX-aware and blithely outputted non-executable rendering code which it then tried to run. In fact, Blizzard’s Warcraft II: Battle.net Edition has not been patched to repair this deficiency and is thus completely unusable with NX enabled as a result of the same underlying problem.

If you’re writing a program that generates code on the fly, please use the proper protection attributes for it – PAGE_READWRITE during generation, and PAGE_EXECUTE_READONLY after the code is ready to use. Windows Server already ships with NX enabled by default (although in “opt-out” mode), and it’s already a matter of time before client SKUs of Windows do the same.

Much ado about signed drivers, and yet…

Wednesday, June 13th, 2007

Today, I was installing a Srv03 x64 VM for testing purposes. This is something I’ve done countless times before – install/reinstall/blow away Windows VMs for testing purposes with the standard Windows setup, nothing all that interesting. However, I ran into something extremely bizzare this time around:

Windows Server 2003 x64 Setup - Unsigned Driver Warning

The really weird thing here is that this VM was completely clean; brand new, empty hard disk which had only been just now connected to standard installation media (no third party OEM drivers slipstreamed onto the media or anything either).

Apparently, there’s an unsigned “in-box” driver lurking on the Srv03 x64 installation discs. Nice. The only thing that I can think of that was different about this VM from all the others I’ve made is that I created a serial port (redirected to a named pipe as usual) in the VM before setup, instead of after, and that somehow Windows might have thought that I had one of those old-style serial-port-controlled UPS batteries connected to the box. Still, I’ve been around enough Srv03 installs on “real metal” to know that it’s weird for this to happen; I’ve never seen it on another install, VM or physical hardware.

Given the fuss that Microsoft makes about signed drivers, seeing what appears to be an in-box driver that isn’t signed is, to say the least, amusing. (Note that there are two types of signing; cat file signing, to sign the installation package – this is what suppresses the unsigned driver popup – and driver binary signing, which allows a driver to load on an x64 system but still generates the unsigned driver popup if you install the driver via an INF. I can only assume that for whatever reason, this driver doesn’t have a valid catalog signature, or it wouldn’t load at all. I’ll see if I can confirm this when the VM finishes installing, though…)

Selectively suppress Wow64 filesystem redirection on Vista x64 with ‘Sysnative’

Friday, June 8th, 2007

One of the features implemented by the Wow64 layer on x64 editions of Windows for 32-bit programs is filesystem redirection, which operates similarly to registry redirection in that it provides a Wow64 “sandbox”, for file I/O relating to the System32 directory.

The reasoning behind this is that many programs have the “System32” directory name hardcoded, and more importantly, unlike the transition between 16-bit Windows and Win32, virtually all of the DLLs implementing the Windows API retained their same name (and path) across the 32-bit to 64-bit transition.

Clearly, this creates a conflict, as there can’t be both 32-bit and 64-bit DLLs with the same name and path on the same system at the same time. The solution that Microsoft devised, filesystem redirection, is a layer that intercepts accesses to the system32 directory made by 32-bit applications, and re-points the accesses to the SysWOW64 directory, which functions as the Wow64 version of the system32 directory (where most of the core Wow64 system DLLs reside).

Normally, this works fairly well, but there are times when a program needs to access the real System32 directory. For this purpose, Microsoft has provided a set of APIs to enable and disable the filesystem redirection on a per-thread basis. This is all well and dandy, but sometimes you may find yourself needing to turn off Wow64 filesystem redirection in a place where you can’t modify a program easily.

Until Vista, there was no easy way to do this. Fortunately, Vista adds a “backdoor” to Wow64 filesystem redirection, in the form of a pseudo-directory named “Sysnative” that is present under the Windows directory (e.g. C:\Windows\Sysnative). This pseudo-directory is not visible in directory listings, and does not exist for native 64-bit processes. For Wow64 processes, however, it can be used to access the 64-bit system32 directory.

Why would you need to do that in the first place? Well, there are lots of reasons, most of them dealing with things like where there is only a 64-bit version of a program that you need to launch (or the 64-bit version is necessary, such as if you need to communicate between a 32-bit process and a 64-bit process by launching another 64-bit process or something of the sort). Prior to Vista, situations like this were a pain, as there was no built-in way to access the 64-bit system32 directory. The “sysnative” pseudo-directory allows the problem to be addressed in a general fashion without requiring code changes, as Wow64 programs can address the 64-bit system32 via the special path (of course, this does require the user supplying such a path, but there is little working around this given the fact that 64-bit and 32-bit system DLLs have identical names).

Sysnative is not a universal solution, however. Some functionality does not work well with it, as sysnative does not appear in directory listings; for instance, the common controls open/save file dialogs won’t allow you to navigate through sysnative due to this issue.

Why doesn’t the publicly available kernrate work on Windows x64? (and how to fix it)

Monday, June 4th, 2007

Previously, I wrote up an introduction to kernrate (the Windows kernel profiler). In that post, I erroneously stated that the KrView distribution includes a version of kernrate that works on x64. Actually, it supports IA64 and x86, but not x64. A week or two ago, I had a problem on an x64 box of mine that I wanted to help track down using kernrate. Unfortunately, I couldn’t actually find a working kernrate for Windows x64 (Srv03 / Vista).

It turns out that nowhere is there a published kernrate that works on any production version of Windows x64. KrView ships with an IA64 version, and the Srv03 resource kit only ships with an x86 version only. (While you can run the x86 version of kernrate in Wow64, you can’t use it to profile kernel mode; for that, you need a native x64 version.)

A bit of digging turned up a version of kernrate labeled ‘AMD64’ in the Srv03 SP0 DDK (the 3790.1830 / Srv03 SP1 DDK doesn’t include kernrate at all, only documentation for it, and the 6000 WDK omits even the documentation, which is really quite a shame). Unfortunately, that version of kernrate, while compiled as native x64 and theoretically capable of profiling the x64 kernel, doesn’t actually work. If you try to run it, you’ll get the following error:

NtQuerySystemInformation failed status c0000004
KERNRATE: Failed to get SYSTEM_BASIC_INFORMATION

So, no dice even with the Srv03 SP0 DDK version of kernrate, or so I thought. Ironically, the various flavors of x86 (including 3790.0) kernrate I could find all worked on Srv03 x64 SP1 (3790.1830) / Vista x64 SP0 (6000.0), but as I mentioned earlier, you can’t use the profiling APIs to profile kernel mode from a Wow64 program. So, I had a kernrate that worked but couldn’t profile kernel mode, and a kernrate that should have worked and been able to profile kernel mode except that it bombed out right away.

After running into that wall, I decided to try and get ahold of someone at Microsoft who I suspected might be able to help, Andrew Rogers from the Windows Serviceability group. After trading mails (and speculation) back and forth a bit, we eventually determined just what was going on with that version of kernrate.

To understand what the deal is with the 3790.0 DDK x64 version of kernrate, a little history lesson is in order. When the production (“RTM”) version of Windows Server 2003 was released, it was supported on two platforms: x86, and IA64 (“Windows Server 2003 64-bit”). This continued until the Srv03 Service Pack 1 timeframe, when support for another platform “went gold” – x64 (or AMD64). Now, this means that the production code base for Srv03 x64 RTM (3790.1830) is essentially comparable to Srv03 x86 SP1 (3790.1830). While normally, there aren’t “breaking changes” (or at least, these are tried to kept to a minimum) from service pack to service pack, Srv03 x64 is kind of a special case.

You see, there was no production / RTM release of Srv03 x64 3790.0, or a “Service Pack 0” for the x64 platform. As a result, in a special case, it was acceptable for there to be breaking changes from 3790.0/x64 to 3790.1830/x64, as pre-3790.1830 builds could essentially be considered beta/RC builds and not full production releases (and indeed, they were not generally publicly available as in a normal production release).

If you’re following me this far, you’re might be thinking “wait a minute, didn’t he say that the 3790.0 x86 build of kernrate worked on Srv03 3790.1830?” – and in fact, I did imply that (it does work). The breaking change in this case only relates to 64-bit-specific parts, which for the most part excludes things visible to 32-bit programs (such as the 3790.0 x86 kernrate).

In this particular case, it turns out that part of the SYSTEM_BASIC_INFORMATION structure was changed from the 3790.0 timeframe to the 3790.1830 timeframe, with respect to x64 platforms.

The 3790.0 structure, as viewed from x64 builds, is approximately like this:

typedef struct _SYSTEM_BASIC_INFORMATION_3790 {
    ULONG Reserved;
    ULONG TimerResolution;
    ULONG PageSize;
    ULONG_PTR NumberOfPhysicalPages;
    ULONG_PTR LowestPhysicalPageNumber;
    ULONG_PTR HighestPhysicalPageNumber;
    ULONG AllocationGranularity;
    ULONG_PTR MinimumUserModeAddress;
    ULONG_PTR MaximumUserModeAddress;
    KAFFINITY ActiveProcessorsAffinityMask;
    CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION_3790, *PSYSTEM_BASIC_INFORMATION_3790;

However, by the time 3790.1830 (the “RTM” version of Srv03 x64 / XP x64) was released, a subtle change had been made:

typedef struct _SYSTEM_BASIC_INFORMATION {
	ULONG Reserved;
	ULONG TimerResolution;
	ULONG PageSize;
	ULONG NumberOfPhysicalPages;
	ULONG LowestPhysicalPageNumber;
	ULONG HighestPhysicalPageNumber;
	ULONG AllocationGranularity;
	ULONG_PTR MinimumUserModeAddress;
	ULONG_PTR MaximumUserModeAddress;
	KAFFINITY ActiveProcessorsAffinityMask;
	CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION, *PSYSTEM_BASIC_INFORMATION;

Essentially, three ULONG_PTR fields were “contracted” from 64-bits to 32-bits (by changing the type to a fixed-length type, such as ULONG). The cause for this relates again to the fact that the first 64-bit RTM build of Srv03 was for IA64, and not x64 (in other words, “64-bit Windows” originally just meant “Windows for Itanium”, something that is to this day still unfortunately propagated in certain MSDN documentation).

According to Andrew, the reasoning behind the difference in the two structure versions is that those three fields were originally defined as either a pointer-sized data type (such as SIZE_T or ULONG_PTR – for 32-bit builds), or a 32-bit data type (such as ULONG – for 64-bit builds) in terms of the _IA64_ preprocessor constant, which indicates an Itanium build. However, 64-bit builds for x64 do not use the _IA64_ constant, which resulted in the structure fields being erroneously expanded to 64-bits for the 3790.0/x64 builds. By the time Srv03 x64 RTM (3790.1830) had been released, the page counts had been fixed to be defined in terms of the _WIN64 preprocessor constant (indicating any 64-bit build, not just an Itanium build). As a result, the fields that were originally 64-bits long for the prerelease 3790.0/x64 build became only 32-bits long for the RTM 3790.1830/x64 production build. Note that due to the fact the page counts are expressed in terms of a count of physical pages, which are at least 4096 bytes for x86 and x64 (and significantly more for large physical pages on those platforms), keeping them as a 32-bit quantity is not a terribly limiting factor on total physical memory given today’s technology, and that of the foreseeable future, at least relating to systems that NT-based kernels will operate on).

The end result of this change is that the SYSTEM_BASIC_INFORMATION structure that kernrate 3790.0/x64 tries to retrieve is incompatible with the 3790.1830/x64 (RTM) version of Srv03, hence the call failing with c0000004 (otherwise known as STATUS_INFO_LENGTH_MISMATCH). The culimination of this is that the 3790.0/x64 version of kernrate will abort on production builds of Srv03, as the SYSTEM_BASIC_INFORMATION structure format is incompatible with the version kernrate is expecting.

Normally, this would be pretty bad news; the program was compiled against a different layout of an OS-supplied structure. Nonetheless, I decided to crack open kernrate.exe with IDA and HIEW to see what I could find, in the hopes of fixing the binary to work on RTM builds of Srv03 x64. It turned out that I was in luck, and there was only one place that actually retrieved the SYSTEM_BASIC_INFORMATION structure, storing it in a global variable for future reference. Unfortunately, there were a great many references to that global; too many to be practical to fix to reflect the new structure layout.

However, since there was only one place where the SYSTEM_BASIC_INFORMATION structure was actually retrieved (and even more, it was in an isolated, dedicated function), there was another option: Patch in some code to retrieve the 3790.1830/x64 version of SYSTEM_BASIC_INFORMATION, and then convert it to appear to kernrate as if it were actually the 3790.0/x64 layout. Unfortunately, a change like this involves adding new code to an existing binary, which means that there needs to be a place to put it. Normally, the way you do this when patching a binary on-disk is to find some padding between functions, or at the end of a PE section that is marked as executable but otherwise unused, and place your code there. For large patches, this may involve “spreading” the patch code out among various different “slices” of padding, if there is no one contiguous block that is long enough to contain the entire patch.

In this case, however, due to the fact that the routine to retrieve the SYSTEM_BASIC_INFORMATION structure was a dedicated, isolated routine, and that it had some error handling code for “unlikely” situations (such as failing a memory allocation, or failing the NtQuerySystemInformation(…SystemBasicInformation…) call (although, it would appear the latter is not quite so “unlikely” in this case), a different option presented itself: Dispense with the error checking, most of which would rarely be used in a realistic situation, and use the extra space to write in some code to convert the structure layout to the version expected by kernrate 3790.0. Obviously, while not a completely “clean” solution per se, the idea does have its merits when you’re already into patch-the-binary-on-disk-land (which pretty much rules out the idea of a “clean” solution altogether at that point).

The structure conversion is fairly straightforward code, and after a bit of poking around, I had a version to try out. Lo and behold, it actually worked; it turns out that the only thing that was preventing the prerelease 3790.0/x64 kernrate from doing basic kernel profiling on production x64 kernels was the layout of the SYSTEM_BASIC_INFORMATION structure.

For those so inclined, the patch I created effectively rewrites the original version of the function, as shown below after translated to C:

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformation(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 NTSTATUS                       Status;

 Sbi = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));

 if (!Sbi)
 {
  fprintf(stderr, "Buffer allocation failed"
   " for SystemInformation in "
   "GetSystemBasicInformation\\n");
  exit(1);
 }

 if (!NT_SUCCESS((Status = NtQuerySystemInformation(
  SystemBasicInformation,
  Sbi,
  sizeof(SYSTEM_BASIC_INFORMATION_3790),
  0))))
 {
  fprintf(stderr, "NtQuerySystemInformation failed"
   " status %08lx\\n",
   Status);
  free(Sbi);

  Sbi = 0;
 }

 return Sbi;
}

Conceptually represented in C, the modified (patched) function would appear something like the following (note that the error checking code has been removed to make room for the structure conversion logic, as previously mentioned; the substantial new changes are colored in red):

(Note that the patch code was not compiler-generated and is not truly a function written in C; below is simply how it would look if it were translated from assembler to C.)

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformationFixed(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 PSYSTEM_BASIC_INFORMATION      SbiReal;

 Sbi     = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));
 SbiReal = (PSYSTEM_BASIC_INFORMATION)Sbi;

 NtQuerySystemInformation(
  SystemBasicInformation,
  SbiReal,
  sizeof(SYSTEM_BASIC_INFORMATION),
  0);

 Sbi->NumberOfProcessors           =
  SbiReal->NumberOfProcessors;
 Sbi->ActiveProcessorsAffinityMask =
  SbiReal->ActiveProcessorsAffinityMask;
 Sbi->MaximumUserModeAddress       =
  SbiReal->MaximumUserModeAddress;
 Sbi->MinimumUserModeAddress       =
  SbiReal->MinimumUserModeAddress;
 Sbi->AllocationGranularity        =
  SbiReal->AllocationGranularity;
 Sbi->HighestPhysicalPageNumber    =
  SbiReal->HighestPhysicalPageNumber;
 Sbi->LowestPhysicalPageNumber     =
  SbiReal->LowestPhysicalPageNumber;
 Sbi->NumberOfPhysicalPages        =
  SbiReal->NumberOfPhysicalPages;

 return Sbi;
}

(The structure can be converted in-place due to how the two versions are laid out in memory; the “expected” version is larger than the “real” version (and more specifically, the offsets of all the fields we care about are greater in the “expected” version than the “real” version), so structure fields can safely be copied from the tail of the “real” structure to the tail of the “expected” structure.)

If you find yourself in a similar bind, and need a working kernrate for x64 (until if and when Microsoft puts out a new version that is compatible with production x64 kernels), I’ve posted the patch (assembler, with opcode diffs) that I made to the “amd64” release of kernrate.exe from the 3790.0 DDK. Any conventional hex editor should be sufficient to apply it (as far as I know, third parties aren’t authorized to redistribute kernrate in its entirety, so I’m not posting the entire binary). Note that DDKs after the 3790.0 DDK don’t include any kernrate for x64 (even the prerelease version, broken as it was), so you’ll also need the original 3790.0 DDK to use the patch. Hopefully, we may see an update to kernrate at some point, but for now, the patch suffices in a pinch if you really need to profile a Windows x64 system.

What’s the difference between the Wow64 and native x86 versions of a DLL?

Friday, June 1st, 2007

Wow64 includes a complete set of 32-bit system DLLs implementing the Win32 API (for use by Wow64 programs). So, what’s the difference between the native 32-bit DLLs and their Wow64 versions?

On Windows x64, not much, actually. Most of the DLLs are actually the 32-bit copies from the 32-bit version of the operating system (not even recompiled). For example, the Wow64 ws2_32.dll on Vista x64 is the same file as the native 32-bit ws2_32.dll on Vista x86. If I compare the wow64 version of ws2_32.dll on Vista with the x86 native version, they are clearly identical:

32-bit Vista:

C:\\Windows\\System32>md5sum ws2_32.dll
d99a071c1018bb3d4abaad4b62048ac2 *ws2_32.dll

64-bit Vista (Wow64):

C:\\Windows\\SysWOW64>md5sum ws2_32.dll
d99a071c1018bb3d4abaad4b62048ac2 *ws2_32.dll

Some DLLs, however, are not identical. For instance, ntdll is very different across native x86 and Wow64 versions, due to the way that Wow64 system calls are translated to the 64-bit user mode portion of the Wow64 layer. Instead of calling kernel mode directly, Wow64 system services go through a translation layer in user mode (so that the kernel does not need to implement 32-bit and 64-bit versions of each system service, or stubs thereof in kernel mode). This difference can clearly be seen when comparing the Wow64 ntdll with the native x64 ntdll, with respect to a system service:

If we look at the native x86 ntdll, we see the expected call through the SystemCallStub pointers in SharedUserData:

0:000> u ntdll!NtClose
ntdll!ZwClose:
mov     eax,30h
mov     edx,offset SharedUserData!SystemCallStub
call    dword ptr [edx]
ret     4

However, an examination of the Wow64 ntdll shows something different; a call is made through a field at offset +C0 in the 32-bit TEB:

0:000> u ntdll!NtClose
ntdll!ZwClose:
mov     eax,0Ch
xor     ecx,ecx
lea     edx,[esp+4]
call    dword ptr fs:[0C0h]
ret     4

If we examine the 32-bit TEB, this field as marked as “WOW32Reserved” (kind of a misnomer, since it’s actually reserved for the 32-bit portion of Wow64):

0:000> dt ntdll!_TEB
   +0x000 NtTib            : _NT_TIB
[...]
   +0x0c0 WOW32Reserved    : Ptr32 Void

As I had previously mentioned, this all leads up to a transition to 64-bit mode in user mode in the Wow64 case. From there, the 64-bit portion of the Wow64 layer reformats the request into a 64-bit system call, makes the actual kernel transition, then repackages the results into what a 32-bit system call would return.

Thus, one case where a Wow64 DLL will necessarily differ is when that DLL contains a system call stub. The most common case of this is ntdll, of course, but there are a couple of other modules with inline system call stubs (user32.dll and gdi32.dll are the most common other modules, though winsrv.dll (part of CSRSS) also contains stubs, however it has no Wow64 version).

There are other cases, however, where Wow64 DLLs differ from their native x86 counterparts. Usually, a DLL that makes an LPC (“Local Procedure Call”) call to another component also has a differing version in the Wow64 layer. This is because most of the LPC interfaces in 64-bit Windows have been “widened” to 64-bits, where they included pointers (most notably, this includes the LSA LPC interface for use in implementing code that talks to LSA, such as LogonUser or calls to authentication packages). As a result, modules like advapi32 or secur32 differ on Wow64.

This has, in fact, been the source of some rather insidious bugs where mistakes have been made in the translation layer. For instance, if you call LsaGetLogonSessionData on Windows Server 2003 x64, from a Wow64 process, you may unhappily find out that if your program accesses any fields beyond “LogonTime” in the SECURITY_LOGON_SESSION_DATA structure returned by LsaGetLogonSessionData, it will crash. This is the result of a bug in the Wow64 version of Secur32.dll (on Windows Server 2003 x64 / Windows XP x64), where part of the structure was not properly converted to its 32-bit counterpart after the 64-bit version is read “off the wire” from an LSA LPC call. This particular problem has been fixed in Windows Vista x64 (and can be worked around with some clever hackery if you take a look at how the conversion layer in Secur32 works for Srv03 x64). Note that as far as I know from having spoken to PSS, Microsoft has no plans to fix the bug for pre-Vista platforms (and it is still broken in Srv03 x64 SP2), so if you are affected by that problem then you’ll have to live with implementing a hackish workaround if you can’t rebuild your app as native 64-bit (the problem only occurs for 32-bit processes on Srv03 x64, not 32-bit processes on Srv03 32-bit, or 64-bit processes on Srv03 x64).

There are a couple of other corner cases which result in necessary differences between Wow64 modules and their native 32-bit counterparts, though these are typically few and far between. Most notably absent in the list of things that require code changes for a Wow64 DLL are modules that directly talk to 64-bit drivers via IOCTLs. For example, even “low-level” modules such as mswsock.dll (the AFD Winsock2 Tcpip service provider which implements the user mode to kernel mode translation layer from Winsock2 service provider calls to AFD networking calls in kernel mode) that communicate heavily with drivers via IOCTLs are unchanged in Wow64. This is due to direct support for Wow64 processes in the kernel IOCTL interface (requiring changes to drivers to be aware of 32-bit callers via IoIs32bitProcess), such that 32-bit and 64-bit processes can share the same IOCTL codes even though such structures are typically “widened” to 64-bits for 64-bit processes.

Due to this approach to Wow64, most of the code that implements the 32-bit Win32 API on x64 computers is identical to their 32-bit counterparts (the same binary in the vast majority of cases). For DLLs that do require a rebuild, typically only small portions (system call stubs and LPC stubs in most cases) require any code changes. The major benefits to this design are savings with respect to QA and development time for the Wow64 layer, as most of the code implementing the Win32 API (at least in user mode) did not need to be re-engineered. That’s not to say that Microsoft hasn’t tested it, but it’s a far cry from having to modify every single module in a non-trivial way.

Additionally, this approach has the added benefit of preserving the vast majority of implementation details that have slowly sept into the Win32 programming model, as much as Microsoft has tried to prevent such from happening. Even programs that use undocumented system calls from NTDLL should generally work on the 64-bit version of that operating system via the Wow64 layer, for instance.

In a future posting later on down the line, I’ll be expanding on this high level overview of Wow64 and taking a more in depth look at the guts of Wow64 (of course, subject to change between OS releases without warning, as usual).

Beware GetThreadContext on Wow64

Friday, May 11th, 2007

If you’re planning on running a 32-bit program of yours under Wow64, one of the things that you may need to watch out for is a subtle change in how GetThreadContext / SetThreadContext operate for Wow64 processes.

Specifically, these operations require additional access rights when operating on a Wow64 context (as is the case of a Wow64 process calling Get/SetThreadContext). This is hinted at in MSDN in the documentation for GetThreadContext, with the following note:

WOW64: The handle must also have THREAD_QUERY_INFORMATION access.

However, the reality of the situation is not completely documented by MDSN.

Under the hood, the Wow64 context (e.g. x86 context) for a thread on Windows x64 is actually stored at a relatively well-known location in the virtual address space of the thread’s process, in user-mode. Specifically, one of the TLS slots in thread’s TLS array is repurposed to point to a block of memory that contains the Wow64 context for the thread (used to read or write the context when the Wow64 “portion” of the thread is not currently executing). This is presumably done for performance reasons, as the Wow64 thunk layer needs to be able to quickly transition to/from x86 mode and x64 mode. By storing the x86 context in user mode, this transition can be managed without a kernel mode call. Instead, a simple far call is used to make the transition from x86 to x64 (and an iretq is used to transition from x64 to x86). For example, the following is what you might see when stepping into a Wow64 layer call with the 64-bit debugger while debugging a 32-bit process:

ntdll_777f0000!ZwWaitForMultipleObjects+0xe:
00000000`7783aebe 64ff15c0000000  call    dword ptr fs:[0C0h]
0:000:x86> t
wow64cpu!X86SwitchTo64BitMode:
00000000`759c31b0 ea27369c753300  jmp     0033:759C3627
0:012:x86> t
wow64cpu!CpupReturnFromSimulatedCode:
00000000`759c3627 67448b0424      mov     r8d,dword ptr [esp]

A side effect of the Wow64 register context being stored in user mode, however, is that it is not so easily accessible to a remote process. In order to access the Wow64 context, what needs to occur is that the TEB for a thread in question must be located, then read out of the thread’s process’s address space. From there, the TLS array is processed to locate the pointer to the Wow64 context structure, which is then read (or written) from the thread’s process’s address space.

If you’ve been following along so far, you might see the potential problem here. From the perspective of GetThreadContext, there is no handle to the process associated with the thread in question. In other words, we have a missing link here: In order to retrieve the Wow64 context of the thread whose handle we are given, we need to perform a VM read operation on its process. However, to do that, we need a process handle, but we’ve only got a thread handle.

The way that the Wow64 layer solves this problem is to query the process ID associated with the requested thread, open a new handle to the process with the required access rights, and then performs the necessary VM read / write operations.

Now, normally, this works fine; if you can get a handle to the thread of a process, then you should almost always be able to get a handle to the process itself.

However, the devil is in the details, here. There are situations where you might not have access to open a handle to a process, even though you have a handle to the thread (or even an existing process handle, but you can’t open a new one). These situations are relatively rare, but they do occur from time to time (in fact, we ran into one at work here recently).

The most likely scenario for this issue is when you are dealing with (e.g. creating) a process that is operating under a different security context than your process. For example, if you are creating a process operating as a different user, or are working with a process that has a higher integrity level than the current process, then you might not have access to open a new handle to the process (even if you might already have an existing handle returned by a CreateProcess* family routine.

This turns into a rather frustrating problem, especially if you have a handle to both the process and a thread in the process that you’re modifying; in that case, you already have the handle that the Wow64 layer is trying to open, but you have no way to communicate it to Wow64 (and Wow64 will try and fail to open the handle on its own when you make the GetThreadContext / SetThreadContext call).

There are two effective solutions to this problem, neither of which are particularly pretty.

First, you could reverse engineer the Wow64 layer and figure out the exact specifics behind how it locates the PWOW64_CONTEXT, and implement that logic inline (using your already-existing process handle instead of creating a new one). This has the downside that you’re way into undocumented implementation details land, so there isn’t a guarantee that your code will continue to operate on future Windows versions.

The other option is to temporarily modify the security descriptor of the process to allow you to open a second handle to it for the duration of the GetThreadContext / SetThreadContext calls. Although this works, it’s definitely a pain to have to go muddle around with security descriptors just to get the Wow64 layer to work properly.

Note that if you’re a native x64 process on x64 Windows, and you’re setting the 64-bit context of a 64-bit process, this problem does not apply. (Similarly, if you’re a 32-bit process on 32-bit Windows, things work “as expected” as well.)

So, in case you’ve been getting mysterious STATUS_ACCESS_DENIED errors out of GetThreadContext / SetThreadContext in a Wow64 program, now you know why.

Don’t perform complicated tasks in your unhandled exception filter

Thursday, May 10th, 2007

When it comes to crash reporting, the mechanism favored by many for globally catching “crashes” is the unhandled exception filter (as set by SetUnhandledExceptionFilter).

However, many people tend to go wrong with this mechanism with respect to what actions they take when the unhandled exception filter is called. To understand what I am talking about, it’s necessary to define the conditions under which an unhandled exception filter is executed.

The unhandled exception filter is, by definition, called when an unhandled exception occurs in any Win32 thread in a process. This sort of event is virtually always caused by some sort of corruption of the process state somewhere, such that something eventually probably touched a non-allocated page somewhere and caused an unhandled access violation (or some other similarly severe problem).

In other words, in the context of the unhandled exception filter, you don’t really know what lead up to the current unhandled exception, and more importantly, you don’t know what you can rely on in the process. For example, if you get an AV that bubbles up to your UEF, it might have been caused by corruption in the process heap, which would mean that you probably can’t safely perform heap allocations or you’re risking running into the same problem that caused the original crash in the first place. Or perhaps the problem was an unhandled allocation failure, and another attempt by your unhandled exception filter to allocate memory might just similarly fail.

Actually, the problem gets a bit worse, because you aren’t even guaranteed anything about what the other threads in the process are doing when the crash occurs (in fact, if there are any other threads in the process at the time of the crash, chances are that they’re still running when your unhandled exception filter is called – there is no magical logic to suspend all other activity in the process while your filter is called). This has a couple of other implications for you:

  1. You can’t really rely on the state of synchronization objects in the process. For all you know, the thread that crashed owned a lock that will cause a deadlock if you try to acquire a second lock, which might be owned by a thread that is waiting on the lock owned by the crashed thread.
  2. You can’t with 100% certainty assume that a secondary failure (precipitated by the “original” crash) won’t occur in a different thread, causing your exception filter to be entered by an additional thread at the same time as it is already processing an event from the “first” crash.

In fact, it would be safe to say that there is even less that you can safely do in an unhandled exception filter than under the infamous loader lock in DllMain (which is saying something indeed).

Given these rather harsh conditions, performing actions like heap allocations or writing minidumps from within the current process are likely to fail. Essentially, as whatever kind of recovery action you take from the UEF grows more complicated, it becomes extremely more likely to fail (possibly causing secondary failures that obscure the original problem) as a result of corruption that caused (or is caused by) the original crash. Even something as seemingly innocuous as creating a new process is potentially dangerous (are you sure that nothing in CreateProcess will ever touch the process heap? What about acquire the loader lock? I’ll give you a hint – the latter is definitely not true, such as in the case where Software Restriction Policies are defined).

If you’ve ever taken a look at the kernel32 JIT debugger support in Windows XP, you may have noticed that it doesn’t even follow these rules – it calls CreateProcess, after all. This is part of the reason why sometimes you’ll have programs silently crash even with a JIT debugger. For programs where you want truly robust crash reporting, I would recommend putting as much of the crash reporting logic into a separate process that is started before a crash occurs (e.g. during program initialization), instead of following the JIT launch-reporting-process approach. This “watchdog process” can then sit idle until it is signaled by the process it is watching over that a crash has occured.

This signaling mechanism should, ideally, be pre-constructed during initialization so that the actual logic within the unhandled exception filter just signals the watchdog process that an event occurs, with a pointer to the exception/context information. The filter should then wait for the watchdog process to signal that it is finished before exiting the process.

The mechanism that we use here at work with our programs to communicate between the “guarded” process and the watchdog is simply a file mapping that is mapped into both processes (for passing information between the two, such as the address of the exception record and context record for an active exception event) and a pair of events that are used to communicate “exception occured” and “dump writing completed” between the two processes. With this configuration, all the exception filter needs to do is to store some data in the file mapping view (already mapped ahead of time) and call SetEvent to notify the watchdog process to wake up. It then waits for the watchdog process to signal completion before terminating the process. (This particular mechanism does not address the issue of multiple crashes occuring at the same time, which is something that I deemed acceptable in this case.) The watchdog process is responsible for all of the “heavy lifting” of the crash reporting process, namely, writing the actual dump with MiniDumpWriteDump.

An alternative to this approach is to have the watchdog process act as a debugger on the guarded process; however, I do not typically recommend this as acting as a debugger has a number of adverse side effects (notably, that a great many events cause the process to be suspended entirely while the debugger/watchdog process can inspect the state of the process for a particular debugger event). The watchdog process mechanism is more performant (if ever so slightly less robust), as there is virtually no run-time overhead in the guarded process unless an unhandled exception occurs.

So, the moral of the story is: keep it simple (at least with respect to your unhandled exception filters). I’ve dealt with mechanisms that try to do the error reporting logic in-process, and those that punt the hard work off to a watchdog process in a clean state, and the latter is significantly more reliable in real world cases. The last thing you want to be happening with your crash reporting mechanism is that it causes secondary problems that hide the original crash, so spend the bit of extra work to make your reporting logic that much more reliable and save yourself the headaches later on.

What is the “lpReserved” parameter to DllMain, really? (Or a crash course in the internals of user mode process initialization)

Wednesday, May 9th, 2007

One of the parameters to the DllMain function is the enigmatic lpReserved argument. According to MSDN, this parameter is used to convey whether a DLL is being loaded (or unloaded) as part of process startup or termination, or as part of a dynamic DLL load/unload operation (e.g. LoadLibrary/FreeLibrary).

Specifically, MSDN says that for static DLL operations, lpReserved contains a non-null value, whereas for dynamic DLL operations, it contains a null value.

There’s actually a little bit more to this parameter, though. While it’s true that in the case of DLL_PROCESS_DETACH operations, it is little more than a boolean value, it has some more significance in DLL_PROCESS_ATTACH operations.

To understand this, you need to know a little bit more about how process/thread initialization occurs. When a new user mode thread is started, the kernel queues a user mode APC to it, pointing to a function in ntdll called LdrInitializeThunk (ntdll is always mapped into the address space of a new process before any user mode code runs). The kernel also arranges for the thread to dispatch the user mode APC the first time it begins execution.

One of the arguments to the APC is a pointer to a CONTEXT structure describing the initial execution state of the new thread. (The actual contents of the CONTEXT structure are based at the stack of the thread.)

When the thread is first resumed, the APC executes and control transfers to ntdll!LdrInitializeThunk. From there, depending on whether the process is already initialized or not, either the process initialization code is executed (loading DLLs statically linked to the process, and soforth), or per-thread initialization code is run (for instance, making DLL_THREAD_ATTACH callouts to loaded DLLs).

If a user mode thread always actually begins execution at ntdll!LdrInitializeThunk, then you might be wondering how it ever starts executing at the start address specified in a CreateThread call. The answer is that eventually, the code called by LdrInitializeThunk passes the context record argument supplied by the kernel to the NtContinue system call, which you can think of as simply taking that context and transferring control to it. Because the context record argument to the APC contained the information necessary for control to be transferred to the starting address supplied to CreateThread, the thread then begins executing at expected thread starting address*.

(*: Actually, there is typically another layer here – usually, control would go to a kernel32 or ntdll function (depending on whether you are running on a downlevel platform or on Vista), which sets up a top level exception handler and then calls the start routine supplied to CreateThread. But, for the purposes of this discussion, you can consider it as just running the requested thread start routine.)

As all of this relates to DllMain, the value of the lpReserved parameter to DllMain (when process initialization is occuring and static linked DLLs are being loaded and initialized) corresponds to the context record argument supplied to the LdrInitializeThunk APC, and is thus representative of the initial context that will be set for the thread after initialization completes. In fact, it’s actually the context that will be used after process initialization is complete, and not just a copy of it. This means that by treating the lpReserved argument as a PCONTEXT, the initial execution state for the first thread in the process can be examined (or even altered) from the DllMain of a static-linked DLL.

This can be verified experimentally by using some trickery to step into process initialization DllMain (more on just how to do that with the user mode debugger in a future entry, as it turns out to be a bit more complicated than what one might imagine):

1:001> g
Breakpoint 3 hit
ADVAPI32!DllInitialize:
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> r
rax=0000000000000000 rbx=00000000002c4770 rcx=000007fefeaf0000
rdx=0000000000000001 rsi=000007fefeb15580 rdi=0000000000000003
rip=000007fefeb15580 rsp=00000000001defb8 rbp=0000000000000000
 r8=00000000001df530  r9=00000000001df060 r10=00000000002c1310
r11=0000000000000246 r12=0000000000000000 r13=00000000001df0f0
r14=0000000000000000 r15=000000000000000d
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b
             efl=00000246
ADVAPI32!DllInitialize:
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> k
RetAddr           Call Site
00000000`77414664 ADVAPI32!DllInitialize
00000000`77417f29 ntdll!LdrpRunInitializeRoutines+0x257
00000000`7748e974 ntdll!LdrpInitializeProcess+0x16af
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe
1:001> .cxr @r8
rax=0000000000000000 rbx=0000000000000000 rcx=00000000ffb3245c
rdx=000007fffffda000 rsi=0000000000000000 rdi=0000000000000000
rip=000000007742c6c0 rsp=00000000001dfa08 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=0000  es=0000  fs=0000  gs=0000
             efl=00000200
ntdll!RtlUserThreadStart:
00000000`7742c6c0 4883ec48        sub     rsp,48h
1:001> u @rcx
tracert!mainCRTStartup:
00000000`ffb3245c 4883ec28        sub     rsp,28h

If you’ve been paying attention thus far, then you might then be able to explain why when you set a hardware breakpoint at the initial process breakpoint, the debugger warns you that it will not take effect. For example:

(16f0.1890): Break instruction exception
- code 80000003 (first chance)
ntdll!DbgBreakPoint:
00000000`7742fdf0 cc              int     3
0:000> ba e1 kernel32!CreateThread
        ^ Unable to set breakpoint error
The system resets thread contexts after the process
breakpoint so hardware breakpoints cannot be set.
Go to the executable's entry point and set it then.
 'ba e1 kernel32!CreateThread'
0:000> k
RetAddr           Call Site
00000000`774974a8 ntdll!DbgBreakPoint
00000000`77458068 ntdll!LdrpDoDebuggerBreak+0x35
00000000`7748e974 ntdll!LdrpInitializeProcess+0x167d
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe

Specifically, this message occurs because the debugger knows that after the process breakpoint occurs, the current thread context will be discarded at the call to NtContinue. As a result, hardware breakpoints (which rely on the debug register state) will be wiped out when NtContinue restores the expected initial context of the new thread.

If one is clever, it’s possible to apply the necessary modifications to the appropriate debug register values in the context record image that is given as an argument to LdrInitializeThunk, which will thus be realized when NTDLL initialization code for the thread runs.

The fact that every user mode thread actually begins life at ntdll!LdrInitializeThunk also explains why, exactly, you can’t create a new thread in the current process from within DllMain and attempt to synchronize with it; by virtue of the fact that the current thread is executing DllMain, it must have the enigmatic loader lock acquired. Because the new thread begins execution at LdrInitializeThunk (even before the start routine you supply is called), for the purpose of making DLL_THREAD_ATTACH callouts, it too will become almost immediately blocked on the loader lock. This results in a classic deadlock if the thread already in DllMain tries to wait for the new thread.

Parting shots:

  • Win9x is, of course, completely dissimilar as far as DllMain goes. None of this information applies there.
  • The fact that lpReserved is a PCONTEXT is only very loosely documented by a couple of ancient DDK samples that used the PCONTEXT argument type, and some more recent SDK samples that name the “lpReserved” parameter “lpvContext”. As far as I know, it’s been around on all versions of NT (including Vista), but like other pseudo-documented things, it isn’t necessarily guaranteed to remain this way forever.
  • Oh, and in case you’re wondering why I used advapi32 instead of kernel32 in this example, it’s because due to a rather interesting quirk in ntdll on recent versions of Windows, kernel32 is always dynamic-loaded for every Win32 process (regardless of whether or not the main process image is static linked to kernel32). To make things even more interesting, kernel32 is dynamic loaded before static-linked DLLs are loaded. As a result, I decided it would be best to steer clear of it for the purposes of making this posting simple; be suitably warned, then, about trying this out on kernel32 at home.

Process-level security is not a particularly great way to enforce DRM when users own their own hardware.

Tuesday, May 8th, 2007

Recently, I discussed the basics of the new “process-level security” mechanism introduced with Windows Vista (integrity levels; otherwise known as “mandatory integrity control“, or MIC for short).

Although when combined with more conventional user-level access control, there is the potential to improve security for users to an extent, MIC is ultimately not a mechanism to lock users out of their own computers.

As you might have guessed by this point, I am speaking of the rather less savory topic of DRM. MIC might appear to be attractive to developers that wish to deploy a DRM system, but it really doesn’t provide a particularly effective way to stop a computer owner (administrator) from, well, administering their system.

MIC (and process-level security), on the surface, may appear to be a good way to accomplish this goal. After all, the process-level security model does allow for securable objects (such as processes) to be guarded against other objects – even of the same user sid, which is typically the kind of restriction that a software-based DRM system will try to enforce (i.e. preventing you from debugging a program).

However, it is important to consider that the sort of restrictions imposed by process-level security mechanisms are designed to protect programs from other programs. They are not supposed to protect programs from the user that controls the computer on which they run (in other words, the computer administrator or whatever you wish to call it).

Windows Vista attempts to implement such a (DRM) protection scheme, loosely based on the principals of process-level security, in the form of something called “protected processes“.

If you look through the Vista SDK headers (specifically, winnt.h), you may come across a particularly telling comment that would seem to indicate that protected processes were originally intended to be implemented via the MIC scheme for process-level security in Vista:

#define SECURITY_MANDATORY_LABEL_AUTHORITY       {0,0,0,0,0,16}
#define SECURITY_MANDATORY_UNTRUSTED_RID         (0x00000000L)
#define SECURITY_MANDATORY_LOW_RID               (0x00001000L)
#define SECURITY_MANDATORY_MEDIUM_RID            (0x00002000L)
#define SECURITY_MANDATORY_HIGH_RID              (0x00003000L)
#define SECURITY_MANDATORY_SYSTEM_RID            (0x00004000L)
#define SECURITY_MANDATORY_PROTECTED_PROCESS_RID (0x00005000L)

//
// SECURITY_MANDATORY_MAXIMUM_USER_RID is the highest RID
// that can be set by a usermode caller.
//

#define SECURITY_MANDATORY_MAXIMUM_USER_RID \\
   SECURITY_MANDATORY_SYSTEM_RID

As it would turn out, protected processes (as they are called) are not actually implemented using the integrity level/MIC mechanism on Vista; instead, there is another, alternate mechanism that provides a way to mark protected processes are “untouchable” by “normal” processes (the lack of flexibility in the integrity level ACE system, as far as specifying which access rights are permitted, is the likely reason. If you read the linked article and the paper it includes, there are a new set of access rights defined specially for dealing with protected processes, which are deemed “safe”. These access rights are requestable for such processes, unlike the standard access rights, and there isn’t a good way to convey this with the set of “allow/deny read/write/execute” options available with an integrity level ACE on Vista.)

The end result is however, for the most part, the same; “protected processes” are essentially to high integrity (or lower) processes as high (or medium) integrity processes are to low integrity processes; that is, they cannot be adversely affected by a lesser-trusted process.

This is where the system begins to break down, though. Process integrity is an interesting way to attempt to curtail malware and exploits because the human at the computer (presumably) does not wish such activity to occur. On the other hand, DRM attempts to prevent the human at their computer from performing an action that they (ostensibly) do in fact wish to perform, with their own computer.

This is a fundamental distinction. The difference is that the malware or exploit code that process level security is designed to defend against doesn’t have the benefit of a human with physical (or administrative) access to the computer in question. That little detail turns out to make a world of difference, as we humans aren’t necessarily constrained by the security system like a program would be. For instance, if some evil exploit code running as a low integrity process on a computer wants to gain administrative access to the box, it just can’t do so (excepting the possibility of local privilege escalation exploits or trying to social-engineer the user into giving the program said access – for the moment, ignore those attack vectors, though they are certainly real ones that must be dealt with at some point).

However, if I am a human sitting at my computer, and I am logged on as a “plain user” and wish to perform an administrative task, I am not so constrained. Instead, I simply either log out and log back in as an administrative user (using my administrative account password), or type my password into an elevation prompt. Problem solved!

Now, of course, the protected process mechanism in Vista isn’t quite that dumb. It does try to block administrators from gaining access to protected processes; direct attempts will return STATUS_ACCESS_DENIED. However, again, humans can be a bit more clever here. For one, a user (and by user, I mean a person with full control over their computer) that is intent on bypassing the protected process mechanism could simply load a driver designed to subvert the protected process mechanism.

The DRM system might then counter that attack by then requiring kernel mode code to be signed, on the theory that for wide-scale violations of the DRM system in such a manner, a “cracker” would need to obtain a code-signing cert that would make them more-easily identifiable and vulnerable to legal attack.

However, people are clever (and more specifically, people with physical / administrative access to a computer are not so necessarily constrained by the basic “rules” of the operating system). One could imagine somebody doing something like patching out the driver signing checks on disk, or any number of other approaches. The theoretical counters to attacks like that would be some sort of hardware support to verify the boot process and ensure that only trusted, signed (and thus unmodified by a “cracker”) code can boot the system. Even that is not necessarily foolproof, though; what’s to say that nobody has compromised the task-offload engine on the system’s NIC to run custom code with full physical memory access, outside the confines of the operating system entirely? Free reign over something capable of performing DMA to physical memory means that kernel code and data can be freely rewritten.

Now, where am I going with all of this? I suppose that I am just frustrated that certain people seem to want to continue to invest significant resources into systems that try to wrest control of a computer from an end user, which are simply doomed to fail by the very nature of the diverse and uncontrolled systems upon which that code will run (and which sometimes compromise the security of customer systems in the process). I don’t think the people behind the protected processes system at Microsoft are stupid, not by any means. However, I can’t help but feel that they have know they’re fighting a losing battle, and that their knowledge and expertise would be better spent on more productive things (like working to improve the next release of Windows, or what-have-you).

Now, a couple of parting shots in an effort to quell several potential misconceptions before they begin:

  • I am not advocating that people bypass DRM. This is probably less than legal in many places. I am, however, trying to make a case for the fact that trying to use security models originally designed to protect users from malware as a DRM mechanism is at best a bad idea.
  • I’m also not trying to downplay the negative impact of theft of copyrighted materials, or anything of that sort. As a programmer myself, I’m well aware that if nobody will buy your product because it’s pirated all over the world, then it’s hard to eke out a living. However, I do believe that it is a fallacy to say that it’s impossible to make money out of software or content in the Internet age without layer after layer of customer-unfriendly DRM.
  • I’m not trying to knock the rest of the improvements in Vista (or the start of process-level security being deployed to joe end user, even though it’s probably not yet perfect). There’s a lot of good work that’s been done with Vista, and despite the (ill-conceived, some might say) DRM mechanisms, there is real value that has been added with this release.
  • I’m also not trying to say that Microsoft is devoting so much of its time to DRM that it isn’t paying any attention to adding real value to its products. However, in my view – most of the time spent on DRM is time that could be better spent adding that “real value” instead of doing the dance of security by obscurity (as with today’s systems, that is really all you can do, when it comes down to it) with some enigmatic idea of a “cracker” out there intent on stealing every piece of software or content they get their hands on and redistributing it to every person in the world for free.
  • I’m also not trying to state that the kernel mode code signing requirements for x64 Vista are entirely motivated by DRM (or that all it’s good for is an attempt to enforce DRM), but I doubt that anyone could truthfully say that DRM played no part in the decision to require signed drivers on x64 Vista either. Regardless, there remain other reasons for ostensibly requiring signed code besides trying to block (or at least hold accountable) attempts to bypass the protected process system.

Tricks for getting the most out of your minidumps: Including specific memory regions in a dump

Friday, May 4th, 2007

If you’ve ever worked on any sort of crash reporting mechanism, one of the constraints that you are probably familiar with is the size of the dump file created by your reporting mechanism. Obviously, as developers, we’d really love to write a full dump including the entire memory image of the process, full data about all threads and handles (and the like), but this is often less than possible in the real world (particularly if you are dealing with some sort of automated crash submission system, which needs to be as un-intrusive as possible, including not requiring the transfer of 50MB .dmp files).

One way you can improve the quality of the dumps your program creates without making the resulting .dmp unacceptably large is to just use a bit of intelligence as to what parts of memory you’re interested in. After all, while certainly potentially useful, chances are you probably won’t really need the entire address space of the process at the time of a crash to track down the issue. Often enough, simply a stack trace (+ listing of threads) is enough, which is more along the lines of what you see when you make a fairly minimalistic minidump.

However, there are lots of times where that little piece of state information that might explain how your program got into its crashed state isn’t on the stack, leaving you stuck without some additional information. An approach that can sometimes help is to include specific, “high-value” regions of memory in a memory dump. For example, something that can often be helpful (especially in the days of custom calling conventions that try to avoid using the stack where-ever possible) is to include a small portion of memory around each register in-memory.

The idea here is that when you’re going to write a dump, check each register in the faulting context to see if it points to a valid location in the address space of the crashed process. If so, you can include a bit of memory (say, +/- 128 bytes [or some other small amount] from the register’s value) in the dump. On x86, you can actually optimize this a bit further and typically leave out eip/esp/ebp (and any register that points into an executable section of an image section, on the assumption that you’ll probably be able to grab any relevant images from the symbol server (you are using a symbol repository with your own binaries included, aren’t you?) and don’t need to waste space with that code in the dump).

One class of problem that this can be rather helpful in debugging is a crash where you have some sort of structure or class that is getting used in some partially valid state and you need the contents of the struct/class to figure out just what happened. In many cases, you can probably infer the state of your mystery class/struct from what other threads in a program were doing, but sometimes this isn’t possible. In those cases, having access to the class/struct that was being ‘operated upon’ is a great help, and often times you’ll find code where there is a `this’ pointer to an address on the heap that is tantalyzingly present in the current register context. If you were using a typical minimalistic dump, then you would probably not have access to heap memory (due to size constraints) and might find yourself out of luck. If you included a bit of memory around each register when the crash occured, however, that just might get you the extra data points needed to figure out the problem. (Determining which registers “look” like a pointer is something easily accomplished with several calls to VirtualQueryEx on the target, taking each crash-context register value as an address in the target process and checking to see if it refers to a committed region.)

Another good use case for this technique is to include information about your program state in the form of including the contents of various key heap- (or global- ) based objects that wouldn’t normally be included in the dump. In that case, you probably need to set up some mechanism to convey “interesting” addresses to the crash reporting mechanism before a crash occurs, so that it can simply include them in the dump without having to worry about trying to grovel around in the target’s memory trying to pick out interesting things after-the-fact (something that is generally not practical in an automated fashion, especially without symbols). For example, if you’ve got some kind of server application, you could include pointers to particularly useful per-session-state data (or portions thereof, size constraints considered). The need for this can be reduced somewhat by including useful verbose logging data, but you might not always want to have verbose logging on all the time (for various reasons), which might result in an initial repro of a problem being less than useful for uncovering a root cause.

Assuming that you are following the recommended approach of not writing dumps in-process, the easiest way to handle this sort of communication between the program and the (hopefully isolated in a different process) crash reporting mechanism is to use something like a file mapping that contains a list (or fixed-size array) of pointers and sizes to record in the dump. This can make adding or removing “interesting” pointers from the list to be included as simple as adding or removing an entry in a flat array.

As far as including additional memory regions in a minidump goes, this is accomplished by including a MiniDumpCallback function in your call to MiniDumpWriteDump (via the CallbackParam parameter). The minidump callback is essentially a way to perform advanced customizations on how your dump is processed, beyond a set of general behaviors supplied by the DumpType parameter. Specifically, the minidump callback lets you do things like include/exclude all sorts of things from the dump – threads, handle data, PEB/TEB data, memory locations, and more – in a programmatic fashion. The way it works is that as MiniDumpWriteDump is writing the dump, it will call the callback function you supply a number of times to query you for any data you want to add or subtract from the dump. There’s a huge amount of customization you can do with the minidump callback; too much for just this post, so I’ll simply describe how to use it to include specific memory regions.

As far as including memory regions go, you need to wait for the MemoryCallback event being passed to your minidump callback. The way the MemoryCallback event works is that you are called back repeatedly, until your callback returns FALSE. Each time you are called back (and return TRUE), you are expected to have updated the CallbackOutput->MemoryBase and CallbackOutput->MemorySize output parameter fields with the base address and length of a region that is to be included in the dump. When your callback finally returns FALSE, MiniDumpWriteDump assumes that you’re done specifying additional memory regions to include and continues on to the rest of the steps involved in writing the dump.

So, to provide a quick example, assuming you had a DumpWriter class containing an array of address / length pairs, you might use a minidump callback that looks something like this to include those addresses in the dump:

BOOL CALLBACK DumpWriter::MiniDumpCallback(
 PVOID CallbackParam,
 const PMINIDUMP_CALLBACK_INPUT CallbackInput,
 PMINIDUMP_CALLBACK_OUTPUT CallbackOutput
 )
{
 DumpWriter *Writer;
 BOOL        Status;

 Status = FALSE;

 Writer = reinterpret_cast<DumpWriter*>(CallbackParam);

 switch (CallbackInput->CallbackType)
 {

/*
 ... handle other events ...
 */

 case MemoryCallback:
  //
  // If we have some memory regions left to include then
  // store the next. Otherwise, indicate that we're finished.
  //

  if (Writer->Index == DumpWriter->Count)
   Status = FALSE;
  else
  {
   CallbackOutput->MemoryBase =
     Writer->Addresses[ Writer->Index ].Base;
   CallbackOutput->MemorySize =
     Writer->Addresses[ Writer->Index ].Length;

   Writer->Index += 1;
   Status = TRUE;
  }
  break;

/*
 ... handle other events ...
 */
 }

 return Status;
}

In a future posting, I’ll likely revisit some of the other neat things that you can do with the minidump callback function (as well as other things you can do to make your minidumps more useful to work with). In the mean time, Oleg Starodumov also has some great documentation (beyond that in MSDN) about just what all the other minidump callback events do, so if you’re finding MSDN a little bit lacking in that department, I’d encourage you to check his article out.