Archive for the ‘Programming’ Category

Programming against the x64 exception handling support, part 1: Definitions for x64 versions of exception handling support

Wednesday, December 13th, 2006

This is a series dealing with how to use the new x64 exception handling support from a programmatic perspective (that is, how to write programs that take advantage of the new support, instead of the perspective of how to understand it while reverse engineering or disassembling something. Those topics have been covered in the past on this site already.)

To get started with programming against the new x64 EH support, you’ll need to have the structure and prototype definitions for the standard x64 EH related functions and structures. One’s first instinct here is to go to MSDN. Be warned, that if you are dealing with the low-level SEH routines (such as RtlUnwindEx), the documentation on MSDN is still missing / wrong for x64. For the most part, excepting RtlVirtualUnwind (which is actually correctly documented now), the exception handler support is only properly documented for IA64 (so don’t be surprised if things don’t work out how you would hope when calling RtlUnwindEx with the MSDN prototype).

For a recent project, I had to do some in-depth work with the inner workings of exception handling support on x64. So, if you’ve been ever having to deal with the low-level EH internals on x64 and have been frustrated by documentation on MSDN that is either incomplete or just plain wrong, here’s some of the things that I have run into along the way as far as things that are either missing or incorrect on MSDN while relating to x64 EH support:

  1. When processing an UNWIND_INFO structure, if the UNW_FLAG_CHAININFO flag is set, then there is an additional undocumented possibility for how unwind information can be chained. Specifically, if the low bit is set in the UnwindInfoAddress of the IMAGE_RUNTIME_FUNCTION_ENTRY structure referring to by the parent UNWIND_INFO structure, UnwindInfoAddress is actually the RVA of another IMAGE_RUNTIME_FUNCTION_ENTRY structure after zeroing the first bit (instead of the RVA of an UNWIND_INFO structure). This is used to help more efficiently chain exception data across a binary with minimal waste of space (credits go to skape for telling me about this).
  2. The prototype on MSDN for RtlUnwindEx is only for IA64 and does not apply to x64. The correct prototype is something more on the lines of this:
    VOID
    NTAPI
    RtlUnwindEx(
       __in_opt ULONG64               TargetFrame,
       __in_opt ULONG64               TargetIp,
       __in_opt PEXCEPTION_RECORD     ExceptionRecord,
       __in     PVOID                 ReturnValue,
       __out    PCONTEXT              OriginalContext,
       __in_opt PUNWIND_HISTORY_TABLE HistoryTable
       );
  3. MSDN’s definition of DISPATCHER_CONTEXT (a structure that is passed to the language specific handler) is incomplete. There are some additional fields beyond HandlerData, which is the last field documented in MSDN. You can see this if you disassemble _C_specific_handler, which uses the undocumented ScopeIndex field. Additional credits go to Alex Ionescu for information on a couple of the undocumented DISPATCHER_CONTEXT fields. Here’s the correct definition of this structure for x64:
    typedef struct _DISPATCHER_CONTEXT {
        ULONG64               ControlPc;
        ULONG64               ImageBase;
        PRUNTIME_FUNCTION     FunctionEntry;
        ULONG64               EstablisherFrame;
        ULONG64               TargetIp;
        PCONTEXT              ContextRecord;
        PEXCEPTION_ROUTINE    LanguageHandler;
        PVOID                 HandlerData;
        PUNWIND_HISTORY_TABLE HistoryTable;
        ULONG                 ScopeIndex;
        ULONG                 Fill0;
    } DISPATCHER_CONTEXT, *PDISPATCHER_CONTEXT;
  4. Not all of the flags passed to an exception handler (primarily relating to unwinding) are properly documented on MSDN. These additional flags are included in winnt.h, however, and are actually the same for both x86 and x64. Here’s a listing of the missing flags that apply to the ExceptionFlags member of the EXCEPTION_RECORD structure (only the EXCEPTION_NONCONTINUABLE flag value is documented on MSDN):
    #define EXCEPTION_NONCONTINUABLE   0x0001
    #define EXCEPTION_UNWINDING        0x0002
    #define EXCEPTION_EXIT_UNWIND      0x0004
    #define EXCEPTION_STACK_INVALID    0x0008
    #define EXCEPTION_NESTED_CALL      0x0010
    #define EXCEPTION_TARGET_UNWIND    0x0020
    #define EXCEPTION_COLLIDED_UNWIND  0x0040
    #define EXCEPTION_UNWIND           0x0066

    In particular, EXCEPTION_UNWIND is a bitmask of other flags that indicates all possible flags that are used to signify an unwind operation. This is probably the most interesting bitmask/flag to you, as you’ll need it if you are distinguishing from an exception or an unwind operation from the perspective of an exception handler.

  5. The definition for the C scope-table information emitted by CL for __try/__except/__finally and implicit exception handlers is not documented. Here’s the definition of the scope table used for C exception handling support:
    typedef struct _SCOPE_TABLE {
    	ULONG Count;
    	struct
    	{
    		 ULONG BeginAddress;
    		 ULONG EndAddress;
    		 ULONG HandlerAddress;
    		 ULONG JumpTarget;
    	} ScopeRecord[ 1 ];
     } SCOPE_TABLE, *PSCOPE_TABLE;
    

    This structure was briefly documented in a beta release of the WDK, although it has since disappeared from the RTM build. The ScopeRecord field describes a variable-sized array whose length is given by the Count field.
    You’ll need this structure definition if you are interacting with _C_specific_handler, or implementing assembler routines that are intended to use _C_specific_handler as their language specific handler.
    All of the above addresses are RVAs. BeginAddress and EndAddress are the RVAs for which the current scope record is effective for. HandlerAddress is the RVA of a C-specific exception handler (more on that below) that implements the __except filter routine in C exception support, or the hardcoded value 0x1 to indicate that this is the __except filter unconditionally accepts the exception (this is also set to 0x1 for a __finally block). The JumpTarget member is the RVA of where control is transferred if the C exception handler indicates the address of the body of an __except block (or a __finally block).

  6. The C exception handler routine whose RVA is given by the HandlerAddress of the C scope table for a code block is defined as follows:
    typedef
    LONG
    (NTAPI * PC_LANGUAGE_EXCEPTION_HANDLER)(
       __in    PEXCEPTION_POINTERS    ExceptionPointers,
       __in    ULONG64                EstablisherFrame
       );

    The ExceptionPointers argument is the familiar EXCEPTION_POINTERS structure that the GetExceptionInformation macro returns. The EstablisherFrame argument contains the stack pointer value for the routine associated with the C exception handler in question at the point in which the exception occured. (If the exception occured in a subfunction called by the function that the exception is now being inspected at, then the stack pointer should be relative to the point just after the call to the faulting function was made.) The EstablisherFrame argument is typically used to allow transparent access to the local variables of the current function from within the exception filter, even though technically the exception filter is not part of the current function but actually a completely different function itself. This is the mechanic by which you can access local variables within an __except expression.
    The function definition deserves a bit more explanation than just the parameter value meanings, however, as it is really dual-purpose. There are two modes for this routine, exception handling mode and unwind handling mode. If the low byte of the ExceptionPointers argument is set to the hardcoded value 0x1, then the handler is being called for an unwind operation. In this case, the rest of the ExceptionPointers argument is meaningless, and only the EstablisherFrame argument holds a meaningful value. In addition, when operating in unwind mode, the return value of the exception handler routine is ignored (the compiler often doesn’t even initialize it for that code path). In exception handling mode (where the ExceptionPointers argument’s low byte is not equal to the hardcoded value 0x1), both arguments are significant, and the return value is also used. In this case, the return value is one of the familiar EXCEPTION_EXECUTE_HANDLER, EXCEPTION_CONTINUE_SEARCH, and EXCEPTION_CONTINUE_EXECUTION constants that are returned by an __except filter expression. If EXCEPTION_EXECUTE_HANDLER is returned, then control will eventually be transferred to the JumpTarget member of the current scope table entry.

  7. The definition of the UNWIND_HISTORY_TABLE structure (and associated substructures) for x64 is as follows (this structure is used as a cache to speed up repeated exception handling lookups, and is typically optional as far as usage with RtlUnwindEx goes – though certainly recommended from a performance perspective):
    #define UNWIND_HISTORY_TABLE_SIZE 12
    
    typedef struct _UNWIND_HISTORY_TABLE_ENTRY {
            ULONG64           ImageBase;
            PRUNTIME_FUNCTION FunctionEntry;
    } UNWIND_HISTORY_TABLE_ENTRY,
    *PUNWIND_HISTORY_TABLE_ENTRY;
    
    #define UNWIND_HISTORY_TABLE_NONE 0
    #define UNWIND_HISTORY_TABLE_GLOBAL 1
    #define UNWIND_HISTORY_TABLE_LOCAL 2
    
    typedef struct _UNWIND_HISTORY_TABLE {
            ULONG                      Count;
            UCHAR                      Search;
            ULONG64                    LowAddress;
            ULONG64                    HighAddress;
            UNWIND_HISTORY_TABLE_ENTRY
               Entry[ UNWIND_HISTORY_TABLE_SIZE ];
    } UNWIND_HISTORY_TABLE, *PUNWIND_HISTORY_TABLE;
  8. There are inconsistencies regarding the usage of RUNTIME_FUNCTION and IMAGE_RUNTIME_FUNCTION in various places in the documentation. These two structures are synonymous for x64 and may be used interchangeably.

Most of the other x64 exception handling information on the latest version of MSDN is correct (specifically, parts dealing with dealing with function tables, such as RtlLookupFunctionTableEntry.) Remember that the MSDN documentation also includes IA64 definitions on the same page, though (and the IA64 definition is typically the one presented at the top with all of the arguments explained, where you would expect it). You’ll typically need to scroll through the remarks section to find information on the x64 versions of these routines. Be wary of using your locally installed Platform SDK help with the functions that are correctly documented on MSDN, though, as to my knowledge only the very latest SDK version (e.g. the Vista SDK) actually has correct information for any of the x64 exception handling information; older versions, such as the Platform SDK that shipped with Visual Studio 2005, only include IA64 information for routines like RtlVirtualUnwind or RtlLookupFunctionTableEntry. In general, anywhere you see a reference to a FRAME_POINTERS or Gp structure or value in the documentation, this is a good hint that the documentation is talking exclusively about IA64 and does not directly apply to x64.

That’s all for this installment. More on how to use this information from a programmatic perspective next time…

An introduction to kernrate (the Windows kernel profiler)

Thursday, December 7th, 2006

One useful utility for tracking down performance problems that you might not have heard of until now is kernrate, the Windows kernel profiler. This utility currently ships with the Windows Server 2003 Resource Kit Tools package (though you can use kernate on Windows XP is well) and is freely downloadable. Currently, you’ll have to match the version of kernrate you want to use with your processor architecture, so if you are using your processor in x64 mode with an x64 Windows edition, then you’ll have to dig up an x64 version of kernrate (the one that ships with the Srv03 resource kit tools is x86); KrView (see below) ships with an x64 compatible version of kernrate.

Kernrate requires that you have the SeProfilePrivilege assigned (which is typically only granted to administrators), so in most cases you will need to be a local administrator on your system in order to use it. This privilege allows access to the (undocumented) profile object system services. These APIs allow programmatic access to sample the instruction pointer at certain intervals (typically, a profiler program selects the timer interrupt for use with instruction pointer sampling). This allows you to get a feel for what the system is doing over time, which is in turn useful for identifying the cause of performance issues where a particular operation appears to be processor bound and taking longer than you would like.

There are a multitude of options that you can give kernrate (and you are probably best served by experimenting with them a bit on your own), so I’ll just cover the common ones that you’ll need to get started (use “kernrate -?” to get a list of all supported options).

Kernrate can be used to profile both user mode and kernel mode performance issues. By default, it operates only on kernel mode code, but you can override this via the -a (and -av) options, which cause kernrate to include user mode code in its profiling operations in addition to kernel mode code. Additionally, by default, kernrate operates over the entire system at once; to get meaningful results with profiling user mode code, you’ll want to specify a process (or group of processes) to profile, with the “-p pid” and/or “-n process-name” arguments. (The process name is the first 8 characters of a process’s main executable filename.)

To terminate collection of profiling data, use Ctrl-C. (On pre-Windows-Vista systems where you might be running kernrate.exe via runas, remember that Ctrl-C does not work on console processes started via runas.) Additionally, you can use the “-s seconds” argument to specify that profling should be automagically stopped after a given count of seconds have elapsed.

If you run kernrate on kernel mode code only, or just specify a process (or group of processes) as described above, you’ll notice that you get a whole lot of general system-wide output (information about interrupt counts, global processor time usage, context switch counts, I/O operation counts) in addition to output about which modules used a noteworthy amount of processor time. Here’s an example output of running kernrate on just the kernel on my system, as described above (including just the module totals):

D:\\Programs\\Utilities>kernrate
Kernrate User-Specified Command Line:
kernrate


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on
the Total Hits for the Kernel

Time   197 hits, 25000 events per hit --------
 Module    Hits   msec  %Total  Events/Sec
intelppm     67        980    34 %     1709183
ntkrnlpa     52        981    26 %     1325178
win32k       35        981    17 %      891946
hal          19        981     9 %      484199
dxgkrnl       6        980     3 %      153061
nvlddmkm      6        980     3 %      153061
fanio         3        981     1 %       76452
bcm4sbxp      2        981     1 %       50968
portcls       2        980     1 %       51020
STAC97        2        980     1 %       51020
bthport       1        981     0 %       25484
BTHUSB        1        981     0 %       25484
Ntfs          1        980     0 %       25510

Using kernrate in this fashion is a good first step towards profiling a performance problem (especially if you are working with someone else’s program), as it quickly allows you to narrow down a processor hog to a particular module. While this is useful as a first step, however, it doesn’t really give you a whole lot of information about what specific code in a particular mode is taking a lot of processor time.

To dig in deeper as to the cause of the problem (beyond just tracing it to a particular module), you’ll need to use the “-z module-name” option. This option tells kernrate to “zoom in” on a particular module; that is, for the given module, kernrate will track instruction pointer locations within the module to individual functions. This level of granularity is often what you’ll need for tracking down a performance issue (at least as far as profiling is concerned). You can repeat the “-z” option multiple times to “zoom in” to multiple modules (useful if the problem you are tracking down involves high processor usage across multiple DLLs or binaries).

Because kernrate is resolving instruction pointer sampling down to a more granular level than modules (with the “-z” option), you’ll need to tell it how to load symbols for all affected modules (otherwise, the granularity for profiler output will typically be very poor, often restricted to just exported functions). There are two ways to do this. First, you can use the “-j symbol-path” command line option; this option tells kernrate to pass a particular symbol path to DbgHelp for use with loading symbols. I recommend the second option, however, which is to configure your _NT_SYMBOL_PATH before-hand so that it points to a valid DbgHelp symbol path. This relieves you of having to manually tell kernrate a symbol path every time you execute it.

Continuing with the example I gave above, we might be interested in just what the “win32k” (the Win32 kernel mode support driver for USER/GDI) module is doing that was taking up 17% of the processor time spent in kernel mode on my system (for the interval that I was profiling). To do that, we can use the following command line (the output has been truncated only include information that we are interested in):

D:\\Programs\\Utilities>kernrate -z win32k

Kernrate User-Specified Command Line:
kernrate -z win32k


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
CallBack: Finished Attempt to Load symbols for
90a00000 \\SystemRoot\\System32\\win32k.sys

Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on the
Total Hits for the Kernel

Time   2465 hits, 25000 events per hit --------
 Module      Hits   msec  %Total  Events/Sec
ntkrnlpa     1273      14799    51 %     2150483
win32k        388      14799    15 %      655449
intelppm      263      14799    10 %      444286
hal           236      14799     9 %      398675
bcm4sbxp       66      14799     2 %      111494
spsys          55      14799     2 %       92911
nvlddmkm       48      14799     1 %       81086
STAC97         31      14799     1 %       52368

[...]


===> Processing Zoomed Module win32k.sys...


----- Zoomed module win32k.sys (Bucket size = 16 bytes,
Rounding Down) --------
Percentage in the following table is based on the
Total Hits for this Zoom Module

Time   388 hits, 25000 events per hit --------
 Module                  Hits   msec  %Total  Events/Sec
xxxInternalDoPaint         44      14799    10 %       74329
XDCOBJ::bSaveAttributes    20      14799     4 %       33786
DelayedDestroyCacheDC      20      14799     4 %       33786
HANDLELOCK::vLockHandle    15      14799     3 %       25339
mmxAlphaPerPixelOnly       15      14799     3 %       25339
XDCOBJ::RestoreAttributes  13      14799     2 %       21960
DoTimer                    12      14799     2 %       20271
_SEH_prolog4               11      14799     2 %       18582
memmove                     9      14799     2 %       15203
_GetDCEx                    6      14799     1 %       10135
HmgLockEx                   6      14799     1 %       10135
XDCOBJ::bCleanDC            5      14799     1 %        8446
XEPALOBJ::ulIndexToRGB      5      14799     1 %        8446
HmgShareCheckLock           4      14799     0 %        6757
RGNOBJ::bMerge              4      14799     0 %        6757

[...]

This should give you a feel for the kind of information that you’ll get from kernrate. Although the examples I gave were profiling kernel mode code, the whole process works the same way for user mode if you use the “-p” or “-n” options as I mentioned earlier. In conjunction with a debugger, the information that kernrate gives you can often be a great help in narrowing down CPU usage performance problems (or at the very least point you in the general direction as to where you’ll need to do further research).

There are also a variety of other options that are available in kernrate, such as features for gathering information about “hot” locks that have a high degree of contention, and support for launching new processes under the profiler. There is also support for outputting the raw sampled profile data, which can be used to graph the output (such as you might see used with tools like KrView).

Although kernrate doesn’t have all the “bells and whistles” of some of the high-end profiling tools (like Intel’s vTune), it’s often enough to get the job done, and it’s also available to you at no extra cost (and can be quickly and easily deployed to help find the source of a problem). I’d highly recommend giving it a shot if you are trying to analyze a performance problem and don’t already have a profiling solution that you are using.

Frame pointer omission (FPO) optimization and consequences when debugging, part 1

Tuesday, November 21st, 2006

During the course of debugging programs, you’ve probably ran into the term “FPO” once or twice. FPO refers to a specific class of compiler optimizations that, on x86, deal with how the compiler accesses local variables and stack-based arguments.

With a function that uses local variables (and/or stack-based arguments), the compiler needs a mechanism to reference these values on the stack. Typically, this is done in one of two ways:

  • Access local variables directly from the stack pointer (esp). This is the behavior if FPO optimization is enabled. While this does not require a separate register to track the location of locals and arguments, as is needed if FPO optimization is disabled, it makes the generated code slightly more complicated. In particular, the displacement from esp of locals and arguments actually changes as the function is executed, due to things like function calls or other instructions that modify the stack. As a result, the compiler must keep track of the actual displacement from the current esp value at each location in a function where a stack-based value is referenced. This is typically not a big deal for a compiler to do, but in hand written assembler, this can get a bit tricky.
  • Dedicate a register to point to a fixed location on the stack relative to local variables and and stack-based arguments, and use this register to access locals and arguments. This is the behavior if FPO optimization is disabled. The convention is to use the ebp register to access locals and stack arguments. Ebp is typically setup such that the first stack argument can be found at [ebp+08], with local variables typically at a negative displacement from ebp.

A typical prologue for a function with FPO optimization disabled might look like this:

push   ebp               ; save away old ebp (nonvolatile)
mov    ebp, esp          ; load ebp with the stack pointer
sub    esp, sizeoflocals ; reserve space for locals
...                      ; rest of function

The main concept is that FPO optimization is disabled, a function will immediately save away ebp (as the first operation touching the stack), and then load ebp with the current stack pointer. This sets up a stack layout like so (relative to ebp):

[ebp-01]   Last byte of the last local variable
[ebp+00]   Old ebp value
[ebp+04]   Return address
[ebp+08]   First argument...

Thereafter, the function will always use ebp to access locals and stack based arguments. (The prologue of the function may vary a bit, especially with functions using a variation __SEH_prolog to setup an initial SEH frame, but the end result is always the same with respect to the stack layout relative to ebp.)

This does (as previously stated) make it so that the ebp register is not available for other uses to the register allocator. However, this performance hit is usually not enough to be a large concern relative to a function compiled with FPO optimization turned on. Furthermore, there are a number of conditions that require a function to use a frame pointer which you may hit anyway:

  • Any function using SEH must use a frame pointer, as when an exception occurs, there is no way to know the displacement of local variables from the esp value (stack pointer) at exception dispatching (the exception could have happened anywhere, and operations like making function calls or setting up stack arguments for a function call modify the value of esp).
  • Any function using automatic C++ objects with destructors must use SEH for compiler unwind support. This means that most C++ functions end up with FPO optimization disabled. (It is possible to change the compiler assumptions about SEH exceptions and C++ unwinding, but the default [and recommended setting] is to unwind objects when an SEH exception occurs.)
  • Any function using _alloca to dynamically allocate memory on the stack must use a frame pointer (and thus have FPO optimization disabled), as the displacement from esp for local variables and arguments can change at runtime and is not known to the compiler at compile time when code is being generated.

Because of these restrictions, many functions you may be writing will already have FPO optimization disabled, without you having explicitly turned it off. However, it is still likely that many of your functions that do not meet the above criteria have FPO optimization enabled, and thus do not use ebp to reference locals and stack arguments.

Now that you have a general idea of just what FPO optimization does, I’ll cover cover why it is to your advantage to turn off FPO optimization globally when debugging certain classes of problems in the second half of this series. (It is actually the case that most shipping Microsoft system code turns off FPO as well, so you can rest assured that a real cost benefit analysis has been done between FPO and non-FPO optimized code, and it is overall better to disable FPO optimization in the general case.)

Update: Pavel Lebedinsky points out that the C++ support for SEH exceptions is disabled by default for new projects in VS2005 (and that it is no longer the recommended setting). For most programs built prior to VS2005 and using the defaults at that time, though, the above statement about C++ destructors causing SEH to be used for a function (and thus requiring the use of a frame pointer) still applies.

Win32 calling conventions review

Friday, November 10th, 2006

Recently, I’ve posted about the Win32 calling conventions. Here’s a table of contents of the various different posts I’ve made.

  1. Win32 calling conventions: Concepts
  2. Win32 calling conventions: Usage cases
  3. Win32 calling conventions: __cdecl in assembler
  4. Win32 calling conventions: __stdcall in assembler
  5. Win32 calling conventions: __fastcall in assembler
  6. Win32 calling conventions: __thiscall in assembler

Remember that when picking a calling convention to use, there are a number of factors to consider. There is no one calling convention that fits all cases (however, __stdcall is a good default if you are not sure).

Hopefully, you’ll have found this series to be enlightening, useful, and practically applicable.

The Windows Vista SDK is RTM

Wednesday, November 8th, 2006

The release version of the Windows Vista SDK has been released to the world.

You can download it for free from Microsoft in the full installer image or web installer formats.

The full installer image is almost 1.2GB, so be prepared to have a large chunk of hard drive space to burn on the new SDK.

Win32 calling conventions: __fastcall in assembler

Monday, October 30th, 2006

The __fastcall calling convention is the last major major C-supported Win32 (x86) calling convention that I have not covered yet. (There still exists __thiscall, which I’ll discuss later).

__fastcall is, as you might guess from the name, a calling convention that is designed for speed. In this spirit, it attempts to borrow from many RISC calling conventions in that it tries to be register-based instead of stack based. Unfortunately, for all but the smallest or simplest functions, __fastcall typically does not end up being a particularly stellar thing performance-wise, for x86, primarily due to the (comparatively) extremely limited register set that x86 sports.

This calling convention has a great deal in common with the x64 calling convention that Win64 uses. In fact, aside from the x64-specific parts of the x64 calling convention, you can think of the x64 calling convention as a logical extension of __fastcall that is designed to take advantage of the expanded register set available with x64 processors.

What this boils down to is that __fastcall will try to pass the first two pointer-sized arguments in the ecx and edx registers. Any additional registers are passed on the stack as per __stdcall.

In practice, the key things to look out for with a __fastcall function are thus:

  • The callee assumes a meaningful value in the ecx (or edx and ecx) registers. This is a tell-tale sign of __fastcall (although, you may sometimes see __thiscall make use of ecx for the this pointer).
  • No arguments are cleaned off the stack by the caller. Only __cdecl functions have this property.
  • The callee ends in a retn (args-2)*4 instruction. In general, this is the pattern that you will see with __fastcall functions that use the stack. For __fastcall functions where no stack parameters are used, the function typically ends in a ret instruction with no stack displacement argument.
  • The callee is a short function with very few arguments. These are the most likely cases where a smart programmer will use __fastcall, as otherwise, __fastcall does not tend to buy you very much over __stdcall.
  • Functions that interface directly with assembler. Having access to ecx and edx can be a handy shortcut for a C function that is being called by something that is obviously written in assembler.

Taking these into account, let’s take a look at the same sample function and function call that we have been previously dealing with in our earlier examples, this time in __fastcall.

The function that we are going to call is declared as so:

__declspec(noinline)
int __fastcall FastcallFunction1(int a, int b, int c)
{
	return (a + b) * c;
}

This is consistent with our previous examples, save that it is declared __fastcall.

The function call that we shall make is as so:

FastcallFunction1(1, 2, 3);

With this code, we can expect the function call to look something like so in assembler:

push    3                 ; push 'c' onto the stack
push    2                 ; place a constant 2 on the stack
xor     ecx, ecx          ; move 0 into 'a' (ecx)
pop     edx               ; pop 2 off the stack and into edx.
inc     ecx               ; set 'a' -- ecx to 1 (0+1)
call    FastcallFunction1 ; make the call (a=1, b=2, c=3)

This is actually a bit different than we might expect. Here, the compiler has been a bit clever and used some basic optimizations with setting up constants in registers. These optimizations are extremely common and something that you should get used to seeing as simply constant assignments to registers, given how frequently they show up. In a future series, I’ll go into some more details as to common compiler optimizations like these, but that’s a tale for a different time.

Continuing with __fastcall, here’s what the implementation of FastcallFunction1 looks like in assembler:

FastcallFunction1 proc near

c= dword ptr  4

lea     eax, [ecx+edx] ; eax = a + b
imul    eax, [esp+4]   ; eax = (eax * c)
retn    4              ; return eax;
FastcallFunction1 endp

As you can see, in this particular instance, __fastcall turns out to be a big saver as far as instructions executed (and thus size, and in a lesser degree, speed) of the callee. This kind of benefit is usually restricted to extremely simple functions, however.

The main things, then, to consider if you are trying to identify if a function is __fastcall or not are thus:

  • Usage of the ecx (or ecx and edx) registers in the function without loading them with explicit values before-hand. This typically indicates that they are being used as argument registers, as with __fastcall.
  • The caller does not clean any arguments off the stack (no add esp instruction to clean the stack after the call). With __fastcall, the callee always cleans the arguments (if any).
  • A ret instruction (with no stack displacement argument) terminating the function, if there are two or less arguments that are pointer-sized or smaller. In this case, __fastcall has no stack arguments.
  • A retn (args-2)*4 instruction terminating the function, if there are three or more arguments to the function. In this case, there are stack arguments that must be cleaned off the stack via the retn instruction.

That’s all for __fastcall. More on other calling conventions next time…

Things to watch out for if you hook functions on Windows Vista

Friday, October 27th, 2006

There are a couple of things that I have ran into that you should keep in mind if you are hooking functions and are planning to run under Windows Vista.

First, watch out for things being moved around in memory. For example, in Windows Vista, the VirtualProtect function in kernel32 and the CreateProcessA function in kernel32 are now on the same page, for the x86 build [NOTE: this is subject to rapid change with hotfixes, and may not still be the case on RTM]. If you have some code that works conceptually like so:

DWORD  OldProt;
PVOID  MyCreateProcessA;
PUCHAR _CreateProcessA;
static ULONG MyHook;

MyHook = (ULONG)&MyCreateProcessA;

VirtualProtect(_CreateProcessA, 6,
	PAGE_READWRITE, &OldProt);

//
// [...] Disassembly and stub saving
//       code goes here...
//

//
// jmp dword ptr [MyHook]
//

_CreateProcessA[0] = 0xFF:
_CreateProcessA[1] = 0x25;
*(PULONG)(&_CreateProcessA[2]) = &MyHook;

VirtualProtect(_CreateProcessA, 6,
	OldProt, &OldProt);

… you’ll run into some strange crashes in Vista, because you might end up making the pages backing VirtualProtect’s implementation non-executable by accident. (Remember that memory protections only have page granularity.)

The solution? Use PAGE_EXECUTE_READWRITE for your “intermediate” states when hooking things.

Secondly, watch out for AcLayers.dll and ShimEng.dll. These two DLLs are the core of Microsoft’s Application Compatibility Layer, which is the engine used to apply compatibility fixes at runtime to broken programs that would otherwise fail to work on Windows Vista. (This engine is also used if you select a particular compatibility layer in the property sheet for a shortcut to an executable or an executable.)

The thing to watch out for here is that AcLayers likes to do import table hooking on various kernel32 APIs. In particular, AcLayers tends to hook GetProcAddress and then occasionally redirect returned function pointers to point into AcLayers.dll and not kernel32.dll. If you have a program that assumes that any pointer that it retrieves from kernel32.dll via GetProcAddress will remain at the same address for any other process in the same session, this can result in some unpleasant surprises.

For instance, consider the classic case of wanting to inject some code to run before the main process entrypoint of a child process. You might do something like inject some code that calls kernel32!LoadLibraryA on some DLL your application surprise, and then kernel32!GetProcAddress to get the address of a function in that DLL. Then the patch code might invoke a function in your DLL and return to the initial program entrypoint of the child process. This is actually a fairly common paradigm if you need to modify some sort of behavior of a child process. Unfortunately, it can easily break if the parent process is under the influence of the dreaded application compatibility layer.

The main problem here is that when you, say, find the address of LoadLibraryA or GetProcAddress in kernel32, AcLayers.dll steps in and actually hands you the address of a stub function inside AcLayers.dll which filters requests to load DLLs or get function pointers. This is all well and fine with the parent process; AcLayers.dll is there and can do whatever it’s work is whenever you call GetProcAddress or LoadLibraryA.

The catch is what happens when you try to make a child process call LoadLibraryA on a DLL before it runs the main program entrypoint. In this case, instead of passing a pointer into kernel32 (which is guaranteed to be present and at the same base address in every Win32 process in the same session), you are passing a pointer into AcLayers.dll to the child process. The problem case is when AcLayers.dll is not loaded immediately into the child process. Here, your patch code in the newly created child process might try to call LoadLibraryA to get your custom DLL unloaded. However, it actually tries to call an internal AcLayers.dll function – but AcLayers.dll isn’t actually loaded into the address space of the child process (or might have even been rebased), so your child process mysteriously crashes instantly. This typically manifests itself as nothing happening when you try to launch a child program, depending on computer configuration.

There is unfortunately no particularly elegant way to work around this particular problem that I have found. The best advice I have to offer here is to try and bypass any possibility that any function pointer you pass to another process (in kernel32.dll) is never intercepted by AcLayers.dll. Perhaps the most fool-proof way to do this is to manually walk the export table of kernel32.dll and locate the address of the export that you are interested in, although this is not a particularly easy task.

The kernel object namespace and Win32, part 1

Thursday, October 26th, 2006

The kernel object namespace is partially exposed by various Win32 APIs. Everything that allows you to create a named object that returns a kernel handle is interacting with the kernel object namespace in some form or another, and many Win32 APIs internally use the object namespace under the hood.

The kernel object namespace is fairly similar to a filesystem; there are object directories, which contain named objects. Objects can be of various different types, such as a Device object (created by a kernel driver) or an Event object, a Semaphore object, and soforth. Additionally, there are symbolic link objects, which (like filesystem links on a UNIX-based system) allow you to create one name that simply refers to another named object in the system.

Until the introduction of Windows 2000, the part of the kernel object namespace that Win32 exposed was a fairly limited and simple subset of the full object namespace available to drivers and programs using the native system call interfaces.

First, file-related APIs interact with the \DosDevices object directory (otherwise known as \??). This is the object directory that holds anything that you might open with CreateFile() and related calls, such as drive letter links (say, C:), serial ports (COM1), other standard DOS devices, and custom devices created by kernel drivers. This is why, if you are a driver, you need to explicitly specify \DosDevices\DeviceName instead of that being automatically assumed (as it is in Win32, if you call CreateFile). Otherwise, the created object name will not be easily accessible to Win32.

Secondly, there is the \BaseNamedObjects object directory. This object directory is where named Event, Mutex, Semaphore, and Section (file mapping) objects are based at when created with the Win32 API.

\BaseNamedObjects is managed and created by the Base API server dll (basesrv.dll) running in the context of CSRSS at boot time. This means that, in particular, boot start drivers cannot rely on \BaseNamedObjects as being present early in the boot process (which can be a problem if you want to share a named event object with a user mode program, from a boot start driver). \DosDevices, however, is created by the kernel itself at boot time and is generally always accessible.

In general, that is the limit to how much of the kernel namespace is directly exposed to (and used to support) Win32 prior to Windows 2000. (This is technically not quite true. There is a little used pair of kernel32 APIs called DefineDosDevice and QueryDosDevices that allow limited manipulation of symbolic links based within the \DosDevices object directory. Using these APIs, you can discover the native target names of many of the internal symbolic links (for example, C: -> \Device\HarddiskVolume2). You can also create symbolic links based in \DosDevices that point to other parts of the NT object namespace with the DDD_RAW_TARGET_PATH flag using DefineDosDevice.).

Next time I’ll go into a bit more detail as to how some of the changes to the object manager namespace work with Windows 2000, and then Windows XP, which both introduce some significant changes to how Win32 interacts with object names (first with improved multi-session support for Terminal Server and Fast User Switching, and then with how mapped drive letters work with LSA logon sessions).

Win32 calling conventions: __stdcall in assembler

Friday, October 20th, 2006

It’s been awhile since my last post, unfortunately, primarily due to my being a bit swamped with work and a couple of other things as of late. With that said, I’m going to start by picking up where I had previously left off with the Win32 calling conventions series. Without further ado, here’s the stuff on __stdcall as you’ll see it in assembler…

Like __cdecl, __stdcall is completely stack-based.  The semantics of __stdcall are very similar to __cdecl, except that the arguments are cleaned off the stack by the callee instead of the caller.  Because the number of arguments removed from the stack is burned into the target function at compile time, there is no support for variadic functions (functions that take a variable number of arguments, such as printf) that use the __stdcall calling convention.  The rules for register usage and return values are otherwise identical to __cdecl.

In practice, this typically means that an __stdcall function call will look much like a __cdecl function call until you examine the ret instruction that returns transfer to the caller at the end of the __stdcall function in question.  (Alternatively, you can look to see if it appears as if stack arguments are cleaned after the function call.  However, the compiler/optimizer sometimes likes to be tricky with __cdecl functions, and defer argument removal until several function calls later, so this method is less reliable.)

Because the callee cleans the arguments off the stack in an __stdcall function, you will always[1] see a ret instruction terminating a __stdcall function.  For most functions, this count is four times the number of arguments to the function, but this can vary if arguments that are larger than 32-bits are passed.  On Win32, this argument count in bytes value is virtually always[2] a multiple of four, as the compiler will always generate code that aligns the stack to at least four bytes for x86 targets.

Given this information, it is usually fairly easy to distinguish an __stdcall function from a __cdecl function, as a __cdecl function will never use an argument to ret.  Note that this does imply, however, that it is generally not possible to disinguish between an __stdcall function and a __cdecl function in the case that both take zero arguments (without any other outside information other than disassembly); in this special case, the calling conventions have the same semantics.  This also means that if you have a function that does not clean any bytes off the stack with ret, you’ll technically have to examine any callers of the function to see if any pass more than zero arguments (or the actual function implementation itself, to see if it ever expects more than zero arguments) in order to be absolutely sure if the function is __cdecl or __stdcall.

Here’s an example of a simple __stdcall function call for the following C function:
 

__declspec(noinline)
int __stdcall StdcallFunction1(int a, int b, int c)
{
 return (a + b) * c;
}

If we call the function like this:

StdcallFunction1(1, 2, 3);

… we can expect to see something like so, for the call:

push    3
push    2
push    1
call    StdcallFunction1

(There will be no add esp instruction after the call.)

This is quite similar to a __cdecl declared function with the same implementation.  The only difference is the lack of an add esp instruction following the call.

Looking at the function implementation, we can see that unlike the __cdecl version of this function, StdcallFunction1 removes the arguments from the stack:

StdcallFunction1 proc near

a= dword ptr  4 b= dword ptr  8 c= dword ptr  0Ch mov     eax, [esp+8] ; eax = b mov     ecx, [esp+4] ; ecx = a add     eax, ecx     ; eax = eax + ecx imul    eax, [esp+c] ; eax = eax * c retn    0Ch          ; (return value = eax) StdcallFunction1 endp

As expected, the only difference here is that the __stdcall version of the function cleans the three arguments off the stack.  The function is otherwise identical to the __cdecl version, with the return value stored in eax.

With all of this information, you should be able to rather reliably identify most __stdcall functions.  The key things to look out for are:

  • All arguments are on the stack.
  • The ret instruction terminating the function has a non-zero argument count if the number of arguments for the function is non-zero.
  • The ret instruction terminating the function has an argument count that is at least four times the number of arguments for the function.  (If the count is less than four, then the function might be a __fastcall function with three or more arguments.  The __fastcall calling convention passes the first two 32-bit or smaller arguments in registers.)
  • The function does not depend on the state of the ecx and edx volatile variables.  (If the function expects these registers to have a meaningful value initially, then the function is probably a __fastcall or __thiscall function, as those calling conventions pass arguments in the ecx and edx registers.) 

In the next post in this series, I’ll cover the __fastcall calling convention (and hopefully it won’t be such a long wait this time).  Stay tuned…

 

[1]: For functions declared as __declspec(noreturn) or that otherwise never normally return execution control directly to the caller (i.e. a function that always throws an exception), the ret instruction is typically omitted.  There are a couple of other rare cases where you may see no terminating ret, such as if there are two functions, where one function calls the second, and both have very similar prototypes (such as argument ordering or an additional defaulted argument).  In this case, the compiler may combine two functions by having one perform minor adjustments to the stack and then “falling through” directly to the second function.

[2]: If you see a function with a ret instruction that does not take a multiple of four as its argument, then the function was most likely hand-written in assembler.  The Microsoft compiler will never, to my knowledge, generate code like this (and neither should any sane Win32 compiler).

DxWnd 1.034 released

Sunday, September 24th, 2006

I’ve released a new version of DxWnd (requires the VC8SP0 CRT) – version 1.034. This is a minor release that, among fixing a couple of various bugs and some internal code cleanup and reorganization to build under VC8, adds a new feature: Video output rescaling.

I recently got a nice 20.1″ LCD to use as a second monitor for my main laptop at home. Unfortunately, I discovered that a lot of my old favorite classic games tended to do not-so-great things to your desktop color depth when you run them natively (in fullscreen mode), which while you might normally not care about, turns out to be a real bummer if you have something like an IM client or whatnot up on a second monitor.

So, I turned to a program I had written a couple of years ago – DxWnd. DxWnd is a program that lets you run DirectDraw 7 (or below) programs that only support fullscreen mode in a window. It accomplishes this by hooking various DirectDraw APIs and tricking the program into thinking that it is running at 640×480 (or whatever resolution it wants) fullscreen, when it is in fact running in a plain window at that resolution. Unfortunately, while DxWnd solves the color depth issue, running games at 640×480 on a 1920×1200 desktop is not really the best experience. Thus, I set out to make a couple of minor modifications to DxWnd to support rescaling the output. These are fairly simple in principle:

  • Use StretchBlt instead of BitBlt to copy data from the DirectDraw surface that the program writes to into the GDI device context associated with the actual window I am displaying on screen. The reason why I perform this extra buffering step in the first place is that GDI provides nice automatic palette conversions from DirectDraw surface DCs to plain desktop window DCs. Changing the BitBlt to a StretchBlt simply rescales the current video image to a new resolution as it is copied for display purposes.
  • For programs that call ScreenToClient / ClientToScreen / MapWindowPoints (or deal with mouse cursor coordinates), but do not correctly handle the fact that their program’s client area may not be centered at (0, 0) (after all, the program was written to only run in fullscreen mode, so normally this shortcut can be taken), DxWnd needs to alter the lie it tells in these functions. Previously, DxWnd would “fix up” the coordinates that get returned to a program (or that a program gives to Windows) so that the program only sees things centered at (0, 0). Now, in addition to that, DxWnd needs to scale these coordinates either from the real output resolution to the resolution that the program appears to be running at, or vice versa, depending on whether the coordinates are going “into” or “out of” the program. This does have one unfortunate side effect, which is that relative to a program that natively supports a given resolution, there is a perceived loss of precision when you move the mouse pointer in the rescaled video output window. This is because mouse cursor coordinates must be rescaled to values that are relative to the resolution that the program is expecting to be running at. For example, if you are running at twice the program’s native resolution, and the program draws a custom mouse cursor, then the cursor may only appear to move every two pixels that you move it instead of every one pixel (like you might expect).
  • For programs that use DirectInput for mouse coordinates, these coordinates also need to be scaled so that they are relative to the virtual screen at (0, 0) that the program expects all coordinates to be relative to.
  • Since we are scaling the output of a program, DxWnd can now allow the user to resize, maximize, or restore the window it creates to contain the video data from the program being hooked. For programs where the user has asked DxWnd to capture the mouse to the client area of the video output window, the mouse cursor capture needs to be recalculated if the window size changes (otherwise, you could not move the mouse cursor outside of the original window size).

With the new DxWnd, I can play some old classics like Master of Orion 2 or Privateer 2 rescaled to my desktop resolution on one monitor while still using a second monitor for things like e-mail or IM – and, without the color depth on my auxiliary display being reduced to 8-bit (or worse). There is some more information about DxWnd on the corresponding topic on the Valhalla Legends forum, if you are interested.