Archive for the ‘Reverse Engineering’ Category

Resolution to CreateIpForwardEntry failing on Vista

Friday, November 3rd, 2006

Previously, I had posted about a compatibility problem with Windows Vista if you used CreateIpForwardEntry to manage the IP routing table. In particular, if you call this routine on Vista with the intent to create a new route in the IP routing table, you may get an inexpicibly ERROR_BAD_ARGUMENTS error code returned.

There is an officially supported workaround, though it is not very well documented, and was in fact only recently made available in the Platform SDK documentation to my knowledge.

The official line is that you must call the GetIpInterfaceEntry function on Vista, if you wish to continue to be able to add routes.

(Yes, this does suck. It is a total breaking change for anyone who did route manipulation on OS’s prior to Vista, until you patch your programs out in the field. If this is unacceptable to you, I would encourage you to provide feedback to Microsoft about how this issue impacts customer experiences and your ability to deploy and use your product on Vista.)

I would encourage you to use this documented API instead of the undocumented solution that I posted about earlier, simply because it will (ostensibly) continue to work on future OS versions. (Although, given that the reason you have to call this is because of a breaking future-compatibility change in Vista, I am not sure that it is really justified to use that line here…)

Win32 calling conventions: __thiscall in assembler

Thursday, November 2nd, 2006

The final calling convention that I haven’t gone over in depth is __thiscall.

Unlike the other calling conventions that I have previously discussed, __thiscall is not typically explicitly decorated on functions (and in most cases, you cannot decorate it explicitly).

As the name might imply, __thiscall is used exclusively for functions that have a this pointer – that is, non-static C++ class member functions. In a non-static C++ class member function, the this pointer is passed as a hidden argument. In Microsoft C++, the hidden argument is the first actual argument to the routine.

When a function is __thiscall, you will typically see the this pointer passed in the ecx register. In this respect, __thiscall is rather similar to __fastcall, as the first argument (this) is passed via register in ecx. Unlike __fastcall, however, all remaining arguments are passed via the stack; edx is not supported as an additional argument register. Like __fastcall and __stdcall, when using __thiscall, the callee cleans the stack (and not the caller).

In some circumstances, with non-exported, internal-only-to-a-module functions, CL may use ebx instead of ecx for this. For any exported __thiscall function (or a function whose address escapes a module), the compiler must use ecx for this, however.

Continuing with the previous examples, consider a function implementation like so:

class C
{
public:
	int c;

	__declspec(noinline)
	int ThiscallFunction1(int a, int b)
	{
		return (a + b) * c;
	}
};

This function operates the same as the other example functions that we have used, with the exception that ‘c’ is a member variable and not a parameter.

The implementation of this function looks like so in assembler:

C::ThiscallFunction1 proc near

a= dword ptr  4
b= dword ptr  8

mov     eax, [esp+8]       ; eax=b
mov     edx, [esp+4]       ; edx=a
add     eax, edx           ; eax=a+b
imul    eax, [ecx]         ; eax=eax*this->c
retn    8                  ; return eax;
C::ThiscallFunction1 endp

Note that [ecx+0] is the offset of the member variable ‘c’ from this. This function is similar to the __stdcall version, except that instead of being passed as an explicit argument, the ‘c’ parameter is implicitly passed as part of the class object this and is then referenced off of the this pointer.

Consder a call to this function like this in C:

C* c = new C;
c->c = 3;
c->ThiscallFunction1(1, 2);

This is actually a bit more complicated than the other examples, because we also have a call to operator new to allocate memory for the class object. In this instance, operator new is a __cdecl function that takes a single argument, which is the count in bytes to allocate. Here, sizeof(class C) is 4 bytes.

In assembler, we can thus expect to see something like this:

push    4                    ; sizeof(class C)   
call    operator new         ; allocate a class C object
add     esp, 4               ; clean stack from new call
push    2                    ; 'a'
push    1                    ; 'b'
mov     ecx, eax             ; (class C* c)'this'
mov     dword ptr [eax], 3   ; c->c = 3
call    C::ThiscallFunction1 ; Make the call

Ignoring the call to operator new for the most part, this is relatively what we would expect. ecx is used to pass this, and this->c is set to 3 before the call to ThiscallFunction1, as we would expect, given the C++ code.

With all of this information, you should have all you need to recognize and identify __thiscall functions. The main takeaways are:

  • ecx is used as an argument register, along with the stack, but not edx. This allows you to differentiate between a __fastcall and __thiscall function.
  • Arguments passed on the stack are cleaned by the caller and not the callee, like __stdcall.
  • For virtual function calls, look for a vtable pointer as the first class member (at offset 0) from this. (For multiple inheritance, things are a bit more complex; I am ignoring this case right now). Vtable accesses to retrieve a function pointer to call through after loading ecx before a function call are a tell-table sign of a __thiscall virtual function call.
  • For functions whose visibility scope is confined to one module, the compiler sometimes substitutes ebx for ecx as a volatile argument register for this.

Note that if you explicitly specify a calling convention on a class member function, the function ceases to be __thiscall and takes on the characteristics of the specified calling convention, passing this as the first argument according to the conventions of the requested calling convention.

That’s all for __thiscall. Next up in this series is a brief review and table of contents of what we have covered so far with common Win32 x86 calling conventions.

Win32 calling conventions: __fastcall in assembler

Monday, October 30th, 2006

The __fastcall calling convention is the last major major C-supported Win32 (x86) calling convention that I have not covered yet. (There still exists __thiscall, which I’ll discuss later).

__fastcall is, as you might guess from the name, a calling convention that is designed for speed. In this spirit, it attempts to borrow from many RISC calling conventions in that it tries to be register-based instead of stack based. Unfortunately, for all but the smallest or simplest functions, __fastcall typically does not end up being a particularly stellar thing performance-wise, for x86, primarily due to the (comparatively) extremely limited register set that x86 sports.

This calling convention has a great deal in common with the x64 calling convention that Win64 uses. In fact, aside from the x64-specific parts of the x64 calling convention, you can think of the x64 calling convention as a logical extension of __fastcall that is designed to take advantage of the expanded register set available with x64 processors.

What this boils down to is that __fastcall will try to pass the first two pointer-sized arguments in the ecx and edx registers. Any additional registers are passed on the stack as per __stdcall.

In practice, the key things to look out for with a __fastcall function are thus:

  • The callee assumes a meaningful value in the ecx (or edx and ecx) registers. This is a tell-tale sign of __fastcall (although, you may sometimes see __thiscall make use of ecx for the this pointer).
  • No arguments are cleaned off the stack by the caller. Only __cdecl functions have this property.
  • The callee ends in a retn (args-2)*4 instruction. In general, this is the pattern that you will see with __fastcall functions that use the stack. For __fastcall functions where no stack parameters are used, the function typically ends in a ret instruction with no stack displacement argument.
  • The callee is a short function with very few arguments. These are the most likely cases where a smart programmer will use __fastcall, as otherwise, __fastcall does not tend to buy you very much over __stdcall.
  • Functions that interface directly with assembler. Having access to ecx and edx can be a handy shortcut for a C function that is being called by something that is obviously written in assembler.

Taking these into account, let’s take a look at the same sample function and function call that we have been previously dealing with in our earlier examples, this time in __fastcall.

The function that we are going to call is declared as so:

__declspec(noinline)
int __fastcall FastcallFunction1(int a, int b, int c)
{
	return (a + b) * c;
}

This is consistent with our previous examples, save that it is declared __fastcall.

The function call that we shall make is as so:

FastcallFunction1(1, 2, 3);

With this code, we can expect the function call to look something like so in assembler:

push    3                 ; push 'c' onto the stack
push    2                 ; place a constant 2 on the stack
xor     ecx, ecx          ; move 0 into 'a' (ecx)
pop     edx               ; pop 2 off the stack and into edx.
inc     ecx               ; set 'a' -- ecx to 1 (0+1)
call    FastcallFunction1 ; make the call (a=1, b=2, c=3)

This is actually a bit different than we might expect. Here, the compiler has been a bit clever and used some basic optimizations with setting up constants in registers. These optimizations are extremely common and something that you should get used to seeing as simply constant assignments to registers, given how frequently they show up. In a future series, I’ll go into some more details as to common compiler optimizations like these, but that’s a tale for a different time.

Continuing with __fastcall, here’s what the implementation of FastcallFunction1 looks like in assembler:

FastcallFunction1 proc near

c= dword ptr  4

lea     eax, [ecx+edx] ; eax = a + b
imul    eax, [esp+4]   ; eax = (eax * c)
retn    4              ; return eax;
FastcallFunction1 endp

As you can see, in this particular instance, __fastcall turns out to be a big saver as far as instructions executed (and thus size, and in a lesser degree, speed) of the callee. This kind of benefit is usually restricted to extremely simple functions, however.

The main things, then, to consider if you are trying to identify if a function is __fastcall or not are thus:

  • Usage of the ecx (or ecx and edx) registers in the function without loading them with explicit values before-hand. This typically indicates that they are being used as argument registers, as with __fastcall.
  • The caller does not clean any arguments off the stack (no add esp instruction to clean the stack after the call). With __fastcall, the callee always cleans the arguments (if any).
  • A ret instruction (with no stack displacement argument) terminating the function, if there are two or less arguments that are pointer-sized or smaller. In this case, __fastcall has no stack arguments.
  • A retn (args-2)*4 instruction terminating the function, if there are three or more arguments to the function. In this case, there are stack arguments that must be cleaned off the stack via the retn instruction.

That’s all for __fastcall. More on other calling conventions next time…

Debugging (or reverse engineering…) a real life Windows Vista compatibility problem: CreateIpForwardEntry in iphlpapi

Tuesday, October 24th, 2006

Since I’m at the Microsoft Vista compatibity lab, it only makes sense that I’ve fixed a few Vista compatibility bugs in our product today.

Some of these are real bugs, but I ran into one in particular that is particularly infuriating: a completely undocumented, seemingly completely arbitrary restriction placed on a publicly documented API that has been around since Windows 98.

In this particular case, I was running into a problem where one of our products was being unable to add routes on Vista. This worked fine on prior platforms we supported, and so I started looking into it as a compatibility problem. First things first, I narrowed the problem down to a particular API that was failing.

We have a function that wrappers the various details about creating routes. The function in question went approximately like so:

//
// Add a route through the desired gateway.
//

DWORD
AddRoute(
	__in unsigned long Network,
	__in unsigned long Mask,
	__in unsigned long Gateway
	)
{
	MIB_IPFORWARDROW Row;
	DWORD            Status, ForwardType;
	unsigned long    InterfaceIp, InterfaceIndex;

[...]	// (Code to determine the local
	// interface to add the route on)

	//
	// Setup the IP forward row.
	//

	ZeroMemory(&Row,
		sizeof(Row));

	Row.dwForwardDest    = Network;
	Row.dwForwardMask    = Mask;
	Row.dwForwardPolicy  = 0;
	Row.dwForwardNextHop = Gateway;
	Row.dwForwardIfIndex = InterfaceIndex;
	Row.dwForwardType    = ForwardType;
	Row.dwForwardProto   = PROTO_IP_NETMGMT;
	Row.dwForwardAge     = INFINITE;
	Row.dwForwardMetric1 = 0;

	//
	// Create the route.
	//

	if ((Status = CreateIpForwardEntry(&Row))
		!= NO_ERROR)
	{
		wprintf(L"Creation failed, %lu.\\n",
			Status);
		return Status;
	}

[...]	// (More unrelated boilerplate code)

	return Status;
}

Essentially, the problem here was that CreateIpForwardEntry was failing. Checking logs, the error code logged was 0xA0.

Using the handy Microsoft error code lookup utility (err.exe), it was easy to determine what this error code means:

C:\\>err a0
# for hex 0xa0 / decimal 160 :
  INTERNAL_POWER_ERROR                            bugcodes.h
  LLC_STATUS_BIND_ERROR                           dlcapi.h
  SQL_160_severity_15                             sql_err
# Rule does not contain a variable.
  ERROR_BAD_ARGUMENTS                             winerror.h
# One or more arguments are not correct.
  SCW_E_TOOMUCHDATAIN                             wpscoserr.mc
# Too much incoming data%0
# 5 matches found for "a0"

The only error that makes sense in this context is ERROR_BAD_ARGUMENTS. Unfortunately, that is not really all that helpful. Checking the latest MSDN documentation for CreateIpForwardEntry, there is, of course, no mention of this error code whatsoever.

Additionally, looking at the Microsoft documentation, nothing immediately jumped to mind as to what the problem is.

Although the Microsoft people here for the Vista lab did offer to see about getting me in touch with someone in the product team who might have an explanation for this behavior, I eventually decided that I would just take a crack at digging into the internals of CreateIpForwardEntry and understand the problem myself in the meanwhile to see if I might be able to come up with a fix sooner. After searching around a bit on Google and not coming up with any good explanation for what was going wrong, I eventually decided to step into iphlpapi!CreateIpForwardEntry in the debugger and see just what was going wrong first-hand.

0:000> bu iphlpapi!CreateIpForwardEntry
breakpoint 0 redefined
0:000> g
Breakpoint 0 hit
eax=0012fd6c ebx=00000004 ecx=00000000 edx=00000000
esi=01040a0a edi=00000003
eip=751bdfc1 esp=0012fd58 ebp=0012fdb0 iopl=0
nv up ei pl nz ac pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000216
iphlpapi!CreateIpForwardEntry:
751bdfc1 8bff            mov     edi,edi

Looking at the disassembly of CreateIpForwardEntry, it’s clear that this function is now just a stub that forwards the call onto another function that performs the real work:

0:000> u @eip
iphlpapi!CreateIpForwardEntry:
751bdfc1 8bff       mov     edi,edi
751bdfc3 55         push    ebp
751bdfc4 8bec       mov     ebp,esp
751bdfc6 6a01       push    1
751bdfc8 ff7508     push    dword ptr [ebp+8]
751bdfcb e820ffffff call    CreateOrSetIpForwardEntry
751bdfd0 5d         pop     ebp
751bdfd1 c20400     ret     4

So, I pressed onward, stepping into iphlpapi!CreateOrSetIpForwardEntry

0:000> tc
iphlpapi!CreateIpForwardEntry+0xa:
751bdfcb e820ffffff call    CreateOrSetIpForwardEntry
0:000> t
eax=0012fd6c ebx=00000004 ecx=00000000 edx=00000000
esi=01040a0a edi=00000003
eip=751bdef0 esp=0012fd48 ebp=0012fd54 iopl=0
nv up ei pl nz ac pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000216
iphlpapi!CreateOrSetIpForwardEntry:
751bdef0 8bff            mov     edi,edi

Looking at the disassembly, there appears to be only one place where the error code ERROR_BAD_ARGUMENTS (disassembly truncated for better viewing):

0:000> uf @eip
iphlpapi!CreateOrSetIpForwardEntry:
751bdef0 8bff            mov     edi,edi
751bdef2 55              push    ebp
751bdef3 8bec            mov     ebp,esp
751bdef5 83ec48          sub     esp,48h
751bdef8 8365b800        and     dword ptr [ebp-48h],0
751bdefc 56              push    esi
751bdefd 6a2c            push    2Ch
751bdeff 8d45bc          lea     eax,[ebp-44h]
751bdf02 6a00            push    0
751bdf04 50              push    eax
751bdf05 e8f053ffff      call    memset
751bdf0a 8b7508          mov     esi,dword ptr [ebp+8]

[...]

;
; Convert the interface metric we passed in with
; the pRoute structure into an interface LUID,
; stored at [ebp-30].
;

751bdf36 8d45d0          lea     eax,[ebp-30h]
751bdf39 50              push    eax
751bdf3a ff7610          push    dword ptr [esi+10h]
751bdf3d e86590ffff      call    ConvertInterfaceIndexToLuid
751bdf42 85c0            test    eax,eax
751bdf44 7571            jne     751bdfb7


;
; Get the interface metric for the requested interface,
; and store it at [ebp+8].  We pass in the address of
; the LUID of the requested interface in order to make
; the check.
;

iphlpapi!CreateOrSetIpForwardEntry+0x56:
751bdf46 8d4508          lea     eax,[ebp+8]
751bdf49 50              push    eax
751bdf4a 8d45d0          lea     eax,[ebp-30h]
751bdf4d 50              push    eax
751bdf4e e802f4ffff      call    GetInterfaceMetric

[...]

;
; Load esi with pRoute->dwForwardMetric1
;

751bdf6c 8b7624          mov     esi,dword ptr [esi+24h]
751bdf6f 6a06            push    6
751bdf71 8945e0          mov     dword ptr [ebp-20h],eax
751bdf74 83c8ff          or      eax,0FFFFFFFFh
751bdf77 3b7508          cmp     esi,dword ptr [ebp+8]
751bdf7a 59              pop     ecx
751bdf7b 8d7de8          lea     edi,[ebp-18h]
751bdf7e f3ab            rep stos dword ptr es:[edi]
751bdf80 8945ec          mov     dword ptr [ebp-14h],eax
751bdf83 8945f0          mov     dword ptr [ebp-10h],eax
751bdf86 5f              pop     edi

;
; Check that esi is not less than [ebp+8]
; ... in other words, verify that
; pRoute->dwForwardMetric1 >= InterfaceMetric,
; where InterfaceMetric is set by GetInterfaceMetric()
;

751bdf87 7229            jb      751bdfb2 ; failure

iphlpapi!CreateOrSetIpForwardEntry+0x99:
751bdf89 2b7508          sub     esi,dword ptr [ebp+8]
751bdf8c 6a18            push    18h
751bdf8e 8d45e8          lea     eax,[ebp-18h]
751bdf91 50              push    eax
751bdf92 6a30            push    30h
751bdf94 8d45b8          lea     eax,[ebp-48h]
751bdf97 50              push    eax
751bdf98 6a10            push    10h
751bdf9a 6864331b75      push    751b3364
751bdf9f ff750c          push    dword ptr [ebp+0Ch]
751bdfa2 8975f4          mov     dword ptr [ebp-0Ch],esi
751bdfa5 6a01            push    1
751bdfa7 c645ff01        mov     byte ptr [ebp-1],1

;
; Call the NsiSetAllParameters internal API to create the
; route, and return its return value to the caller.
;

751bdfab e86857ffff      call    NsiSetAllParameters
751bdfb0 eb05            jmp     751bdfb7
[...]

iphlpapi!CreateOrSetIpForwardEntry+0xc2:
;
; Return ERROR_BAD_ARGUMENTS
;
751bdfb2 b8a0000000      mov     eax,0A0h

iphlpapi!CreateOrSetIpForwardEntry+0xc7:
751bdfb7 5e              pop     esi
751bdfb8 c9              leave
751bdfb9 c20800          ret     8

From this annotated disassembly, we can conclude that there are only two possibilities that might result in this behavior. The first is that GetInterfaceMetric(InterfaceIndex, &InterfaceMetric) is returning an InterfaceMetric greater than the metric we are supplying. The second is that NsiSetAllParameters is returning ERROR_BAD_ARGUMENTS.

To test this theory, we need to examine the comparison at 751bdf87 to determine if that is taking the failure branch, and we need to check the return value of NsiSetAllParameters. This is fairly easy to do with a couple of breakpoints:

0:000> bu 751bdf87 
0:000> bu 751bdfb0 
0:000> g
Breakpoint 1 hit
eax=ffffffff ebx=00000004 ecx=00000000 edx=7707e524
esi=00000000 edi=00000003
eip=751bdf87 esp=0012fcf8 ebp=0012fd44 iopl=0
nv up ei ng nz ac pe cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000297
iphlpapi!CreateOrSetIpForwardEntry+0x97:
751bdf87 7229            jb      751bdfb2 [br=1]

Our first breakpoint, the one on the comparison with the “Interface Metric” and the route metric we supplied in pRoute->dwForwardMetric1, was the one that hit first (as expected). Looking at the register context supplied by WinDbg, though, we can clearly see that the program is going to take the branch and head down the code path that returns ERROR_BAD_ARGUMENTS. Problem identified!

There still remains the issue of solving the problem, though. Looking at [ebp+8], it appears that the undocumented iphlpapi!GetInterfaceMetric returned 10:

0:000> ? dwo(@ebp+8)
Evaluate expression: 10 = 0000000a

This makes sense. We supplied a metric of 0, which is obviously less than 10. Unfortunately, now we need a good way to determine whether we should use a zero metric (for previous OS versions) or a different metric (for Vista), assuming we want our route to be the most precedent for a particular network/mask value.

Unfortunately, MSDN doesn’t turn up any hits on GetInterfaceMetric, and neither does Google. Well, that sucks – it looks like that for Vista, unless I want to hardcode 10, I’ll have to go off into undocumented land to use a publicly documented API. There seems to be something a bit ironic about that to me, but, nonetheless, the problem remains to be solved.

Update: There is a (minimally) documented solution that was very recently made available. See the bottom of the post for details.

So, all that we need to do is reverse engineer the parameters to this undocumented GetInterfaceMetric function and call it, right?

Well, no, not exactly – things actually get worse. It turns out that GetInterfaceMeteric is not even exported from iphlpapi.dll – it’s a purely internal function!

The only other option at this point, aside from hardcoding 10 as a minimum metric, is to reimplement all of the functionality of GetInterfaceMetric ourselves. Taking a look at GetInterfaceMetric, things look unfortunately rather complicated:

0:000> uf iphlpapi!GetInterfaceMetric
iphlpapi!GetInterfaceMetric:
751bd355 8bff            mov     edi,edi
751bd357 55              push    ebp
751bd358 8bec            mov     ebp,esp
751bd35a 6a1c            push    1Ch
751bd35c 6a04            push    4
751bd35e ff750c          push    dword ptr [ebp+0Ch]
751bd361 6a00            push    0
751bd363 6a08            push    8
751bd365 ff7508          push    dword ptr [ebp+8]
751bd368 6a07            push    7
751bd36a 6864331b75      push    NPI_MS_IPV4_MODULEID
751bd36f 6a01            push    1
751bd371 e88f5fffff      call    NsiGetParameter
751bd376 5d              pop     ebp
751bd377 c20800          ret     8

NPI_MS_IPV4_MODULEID is a global variable of some sort in iphlpapi:

0:000> db iphlpapi!NPI_MS_IPV4_MODULEID l8
751b3364  18 00 00 00 01 00 00 00  ........

Using the x command with ascending order, we can make an educated guess as to the size of this global by enumerating all symbols in iphlpapi in address space order:

0:000> x /a iphlpapi!*
[...]
751b3364 iphlpapi!NPI_MS_IPV4_MODULEID = <no type information>
751b3381 iphlpapi!NsiAllocateAndGetTable = <no type information>
[...]

So, we know that NPI_MS_IPV4_MODULEID must be no more than 0x1d bytes long. Taking a look around NPI_MS_IPV4_MODULE_ID, we see that past 0x18 bytes in, there appears to be code (nop instructions), making it likely that the global is 0x18 bytes long.

0:000> db 751b3364 
751b3364  18 00 00 00 01 00 00 00-00 4a 00 eb 1a 9b d4 11
751b3374  91 23 00 50 04 77 59 bc-90 90 90 90 90 ff 25 94

(The repeated 90 90 90 90 bytes are a typical sign of code. 90 is the opcode for the nop instruction on x86, which the compiler typically uses for padding out function start offsets for alignment.)

Given this, we should be able to replicate the behavior of GetInterfaceMetrics, as the only function it calls, NsiGetParameter, is exported by nsi.dll (of course, it isn’t documented…). From the above disassembly, we can see that NsiGetParameter takes a ulong-sized argument (constant 0x1), a pointer argument (address of NPI_MS_IPV4_MODULEID), a ulong-sized argument (constant 0x7), a pointer that is the address of the interface LUID (argument 1 of GetInterfaceMetrics, which we saw earlier), a ulong-sized argument (constant 0x8), a ulong or pointer-sized argument (constant 0x0), a pointer-sized argument (address of a ULONG containing the “interface metric”), a ulong-sized argument (constant 0x4), and (finally!) a ulong-sized argument (constant 0x1c). I would surmise that the 0x8 and 0x4 constants are the sizes of the LUID and output buffer, though I haven’t bothered to confirm that at this point.

From our knowledge of __stdcall, we can identify NsiGetParameter as __stdcall quickly by looking at the disassembly of GetInterfaceMetrics and noticing the behavior after the function call (not removing arguments from the stack space, assuming the callee (NsiGetParameter) performs that task.

Given all of this, we can make our own function that implements GetInterfaceMetric. Now, just to be clear, I would not recommend actually using this, unless Microsoft fails to provide a documented mechanism to determine the minimum metric permitted for CreateIpForwardEntry (or removes the restriction) prior to Vista RTM. I am going to try and do whatever I can to see what ISV’s are supposed to do with this particular problem (and whether it can be fixed before RTM) before this week is up, but in the event that I don’t get anywhere, I’ll have a backup plan (as ugly and hackish as it may be) – better than not being able to manipulate the route table, period, on Vista.

Anyway, the basic idea is that we call ConvertInterfaceIndexToLuid on the InterfaceIndex that we already have from iphlpapi, to convert this into a NET_LUID structure (new to Vista). It does so happen that ConvertInterfaceIndexToLuid is a documented API, which makes that the easy part.

Then, we simply replicate the call that we saw in GetInterfaceMetric inside iphlpapi.dll. For brevity, I am not posting the entire source code for my implementation of GetInterfaceMetric inline; you can, however, download it. With this reverse engineered implementation, all that is left is to call it to get the minimum metric for the interface we are about to add a route on, and place that metric in the MIB_IPFORWARDROW that we pass to CreateIpForwardEntry.

I’ll post back when I hear from Microsoft as to the official word as to how one is to handle this situation; I fully expect that there will be a documented API (or the restriction will go away) before RTM, at this point, given that this is a rather bad compatibility bug that breaks a long-existing documented API in such a way that requires you to go into undocumented hackery to continue to use it (especially since there is no other good way that I know of to replicate the functionality of the API in question).

Update: You can use the GetIpInterfaceEntry routine (new to Vista, in iphlpapi) to find the minimum metric for an interface. Note that you will very likely need to search on MSDN to find information on this function, as it’s not been included in recent SDKs to my knowledge.

(Note: Some of the debugger output was slightly modified or truncated by me to keep the formatting sane.)

Win32 calling conventions: __stdcall in assembler

Friday, October 20th, 2006

It’s been awhile since my last post, unfortunately, primarily due to my being a bit swamped with work and a couple of other things as of late. With that said, I’m going to start by picking up where I had previously left off with the Win32 calling conventions series. Without further ado, here’s the stuff on __stdcall as you’ll see it in assembler…

Like __cdecl, __stdcall is completely stack-based.  The semantics of __stdcall are very similar to __cdecl, except that the arguments are cleaned off the stack by the callee instead of the caller.  Because the number of arguments removed from the stack is burned into the target function at compile time, there is no support for variadic functions (functions that take a variable number of arguments, such as printf) that use the __stdcall calling convention.  The rules for register usage and return values are otherwise identical to __cdecl.

In practice, this typically means that an __stdcall function call will look much like a __cdecl function call until you examine the ret instruction that returns transfer to the caller at the end of the __stdcall function in question.  (Alternatively, you can look to see if it appears as if stack arguments are cleaned after the function call.  However, the compiler/optimizer sometimes likes to be tricky with __cdecl functions, and defer argument removal until several function calls later, so this method is less reliable.)

Because the callee cleans the arguments off the stack in an __stdcall function, you will always[1] see a ret instruction terminating a __stdcall function.  For most functions, this count is four times the number of arguments to the function, but this can vary if arguments that are larger than 32-bits are passed.  On Win32, this argument count in bytes value is virtually always[2] a multiple of four, as the compiler will always generate code that aligns the stack to at least four bytes for x86 targets.

Given this information, it is usually fairly easy to distinguish an __stdcall function from a __cdecl function, as a __cdecl function will never use an argument to ret.  Note that this does imply, however, that it is generally not possible to disinguish between an __stdcall function and a __cdecl function in the case that both take zero arguments (without any other outside information other than disassembly); in this special case, the calling conventions have the same semantics.  This also means that if you have a function that does not clean any bytes off the stack with ret, you’ll technically have to examine any callers of the function to see if any pass more than zero arguments (or the actual function implementation itself, to see if it ever expects more than zero arguments) in order to be absolutely sure if the function is __cdecl or __stdcall.

Here’s an example of a simple __stdcall function call for the following C function:
 

__declspec(noinline)
int __stdcall StdcallFunction1(int a, int b, int c)
{
 return (a + b) * c;
}

If we call the function like this:

StdcallFunction1(1, 2, 3);

… we can expect to see something like so, for the call:

push    3
push    2
push    1
call    StdcallFunction1

(There will be no add esp instruction after the call.)

This is quite similar to a __cdecl declared function with the same implementation.  The only difference is the lack of an add esp instruction following the call.

Looking at the function implementation, we can see that unlike the __cdecl version of this function, StdcallFunction1 removes the arguments from the stack:

StdcallFunction1 proc near

a= dword ptr  4 b= dword ptr  8 c= dword ptr  0Ch mov     eax, [esp+8] ; eax = b mov     ecx, [esp+4] ; ecx = a add     eax, ecx     ; eax = eax + ecx imul    eax, [esp+c] ; eax = eax * c retn    0Ch          ; (return value = eax) StdcallFunction1 endp

As expected, the only difference here is that the __stdcall version of the function cleans the three arguments off the stack.  The function is otherwise identical to the __cdecl version, with the return value stored in eax.

With all of this information, you should be able to rather reliably identify most __stdcall functions.  The key things to look out for are:

  • All arguments are on the stack.
  • The ret instruction terminating the function has a non-zero argument count if the number of arguments for the function is non-zero.
  • The ret instruction terminating the function has an argument count that is at least four times the number of arguments for the function.  (If the count is less than four, then the function might be a __fastcall function with three or more arguments.  The __fastcall calling convention passes the first two 32-bit or smaller arguments in registers.)
  • The function does not depend on the state of the ecx and edx volatile variables.  (If the function expects these registers to have a meaningful value initially, then the function is probably a __fastcall or __thiscall function, as those calling conventions pass arguments in the ecx and edx registers.) 

In the next post in this series, I’ll cover the __fastcall calling convention (and hopefully it won’t be such a long wait this time).  Stay tuned…

 

[1]: For functions declared as __declspec(noreturn) or that otherwise never normally return execution control directly to the caller (i.e. a function that always throws an exception), the ret instruction is typically omitted.  There are a couple of other rare cases where you may see no terminating ret, such as if there are two functions, where one function calls the second, and both have very similar prototypes (such as argument ordering or an additional defaulted argument).  In this case, the compiler may combine two functions by having one perform minor adjustments to the stack and then “falling through” directly to the second function.

[2]: If you see a function with a ret instruction that does not take a multiple of four as its argument, then the function was most likely hand-written in assembler.  The Microsoft compiler will never, to my knowledge, generate code like this (and neither should any sane Win32 compiler).

The system call dispatcher on x86

Wednesday, August 23rd, 2006

The system call dispatcher on x86 NT has undergone several revisions over the years.

Until recently, the primary method used to make system calls was the int 2e instruction (software interrupt, vector 0x2e). This is a fairly quick way to enter CPL 0 (kernel mode), and it is backwards compatible with all 32-bit capable x86 processors.

With Windows XP, the mainstream mechanism used to do system calls changed; From this point forward, the operating system selects a more optimized kernel transition mechanism based on your processor type. Pentium II and later processors will instead use the sysenter instruction, which is a more efficient mechanism of switching to CPL 0 (kernel mode), as it dispenses with some needless (in this case) overhead of usual interrupt dispatching.

How is this switch accomplished? Well, starting with Windows XP, the system service call stubs do not hardcode a particular instruction (say, int 2e) anymore. Instead, they indirect through a field in the KUSER_SHARED_DATA block (“SystemCall”). The meaning of this field changed in Windows XP SP2 and Windows Server 2003 SP1; in prior versions, the SystemCall field held the actual code used to make the system call (and was filled in at runtime with the proper values). In XP SP2 and Srv03 SP1, in the interests of reducing system attack surface, the KUSER_SHARED_DATA region was marked non-executable, and SystemCall becomes a pointer to a stub residing in NTDLL (with the pointer value being adjusted at runtime based on the processor type, to refer to an appropriate system call stub).

What this means for you today is that on modern systems, you can expect to see a sequence like so for system calls:

0:001> u ntdll!NtClose
ntdll!ZwClose:
7c821138 b81b000000       mov     eax,0x1b
7c82113d ba0003fe7f       mov     edx,0x7ffe0300
7c821142 ff12             call    dword ptr [edx]
7c821144 c20400           ret     0x4
7c821147 90               nop

0x7ffe0300 is +0x300 bytes into KUSER_SHARED_DATA. Looking at the structure definition, we can see that this is “SystemCall”:

0:001> dt ntdll!_KUSER_SHARED_DATA
   +0x000 TickCountLowDeprecated : Uint4B
   +0x004 TickCountMultiplier : Uint4B
   +0x008 InterruptTime    : _KSYSTEM_TIME
   [...]
   +0x300 SystemCall       : Uint4B
   +0x304 SystemCallReturn : Uint4B
   +0x308 SystemCallPad    : [3] Uint8B
   [...]

Since my system is Srv03 SP1, SystemCall is a pointer to a stub in NTDLL.

0:001> u poi(0x7ffe0300)
ntdll!KiFastSystemCall:
7c82ed50 8bd4             mov     edx,esp
7c82ed52 0f34             sysenter
ntdll!KiFastSystemCallRet:
7c82ed54 c3               ret

On my system, the system call dispatcher is using sysenter. You can look at the old int 2e dispatcher if you wish, as it is still supported for compatibility with older processors:

0:001> u ntdll!KiIntsystemCall
ntdll!KiIntSystemCall:
7c82ed60 8d542408         lea     edx,[esp+0x8]
7c82ed64 cd2e             int     2e
7c82ed66 c3               ret

The actual calling convention used by the system call dispatcher is thus:

  • eax contains the system call ordinal.
  • edx points to either the argument array of the system call on the stack (for int 2e), or the return address plus argument array (for sysenter).

For most of the time, though, you’ll probably not be dealing directly with the system call dispatching mechanism itself. If you are, however, now you know how it works.

You might be using unhandled exception filters without even knowing it.

Friday, August 18th, 2006

In a previous posting, I discussed some of the pitfalls of unhandled exception filters (and how they can become a security problem for your application). I mentioned some guidelines you can use to help work around these problems and minimize the risk, but, as I alluded to earlier, the problem is actually worse than it might appear on the surface.

The real gotcha about unhandled exception filters is that you have probably used them before in programs or DLLs and not even known that you were using them, which makes it very hard to not use them in dangerous situations. How can this be, you might ask? Well, it turns out that the Microsoft C runtime library uses an unhandled exception filter to catch unhandled C++ exceptions and call the terminate handler registered by set_terminate.

This unhandled exception filter is setup by the internal CRT functions _cinit (via _initterm_e). If you have the CRT source handy, this lives in crt0dat.c. The call looks like:

/*
* do initializations
*/
initret = _initterm_e( __xi_a, __xi_z );

Here, “__xi_a” and “__xi_z” define the bounds of an array of function pointers to initializers called during the CRT’s initialization. There is a pointer to a function (_CxxSetUnhandledExceptionFilter) that sets up the unhandled exception filter for C++ exceptions in this array. Unfortunately, source code for the function used to setup _CxxUnhandledExceptionFilter is not present, but you can find it by looking at the CRT in a disassembler.

push    offset CxxUnhandledExceptionFilter
call    SetUnhandledExceptionFilter
mov     lpTopLevelExceptionFilter, eax
xor     eax, eax
retn

This is pretty standard; it is just saving away the old exception filter and registering its new exception filter. The unhandled exception filter itself checks for a C++ exception – if found, it calls terminate, otherwise it tries to verify that the previous exception filter points to executable code, and if so, it will call it.

push    esi
mov     esi, [esp+arg_0]
mov     eax, [esi]
cmp     dword ptr [eax], 0E06D7363h
jnz     short not_cpp_except
cmp     dword ptr [eax+10h], 3
jnz     short not_cpp_except
mov     eax, [eax+14h]
cmp     eax, 19930520h
jz      short is_cpp_except
cmp     eax, 19930521h
jnz     short not_cpp_except 

is_cpp_except:
call    terminate

not_cpp_except:
mov     eax, lpTopLevelExceptionFilter
test    eax, eax
jz      short old_filter_unloaded
push    eax             ; lpfn
call    _ValidateExecute
test    eax, eax
pop     ecx
jz      short old_filter_unloaded
push    esi
call    lpTopLevelExceptionFilter
jmp     short done

old_filter_unloaded:
xor     eax, eax

done:
pop     esi
retn    4

The problem with the latter validation is there is no way to tell if the code is part of a legitimate DLL, or part of the heap or some other allocation that has moved over where a DLL had previously been unloaded, which is where the security risk is introduced.

So, we have established that the CRT potentially does bad things by installing an unhandled exception filter – so what? Well, if you link to the DLL version of the CRT, you are probably fine. The CRT DLL is unlikely to be unloaded during the process lifetime and will only be initialized once.

The kicker is if you linked to the static (non-DLL) version of the CRT. This is where things start to get dicey. The dangerous combination here is that each image linked to the static version of the CRT will have its own copy of _cinit, and its own copy of _CxxSetUnhandledExceptionFilter, its own copy of _CxxUnhandledExceptionFilter, and soforth. What this boils down to is that every image linked to the static version of the Microsoft C runtime installs an unhandled exception filter. So, if you have a DLL (say one that hosts an ActiveX object) which links to the static CRT (which is pretty attractive, as for plugin type DLLs you don’t want to have to write a separate installer to ensure that end users have that cumbersome msvcr80.dll), then you’re in trouble. Since this is an especially common scenario (plugin DLL linking to the static CRT), you have probably ended up using an unhandled exception filter without knowing it (and probably without realizing the implications of doing so) – simply by making an ActiveX control usable by Internet Explorer, for example. This really turns into a worst case scenario when it comes to DLLs that host ActiveX objects. These are DLLs that are going to be frequently loaded and unloaded, are controllable by untrusted script, and are very likely to link to the static CRT to get out of the headache of having to manage installation of the DLL CRT version. If you put all of these things together and throw in any kind of crash bug, you’ve got a recipie for remote code execution. What is even worse is that this isn’t just quick-fixable with a patch to the CRT, as the vulnerable CRT version is compiled into your binaries and not in its own hotfixable standalone DLL.

So, in order to be truly safe from the dangers of unhandled exception filters, you also need to rid your programs of the static CRT. Yes, it does make setup more of a pain, but the DLL CRT is superior in many ways (not to mention that it doesn’t suffer from this security problem!).

Win32 calling conventions: __cdecl in assembler

Thursday, August 17th, 2006

Continuing on the series about Win32 calling conventions, the next topic of discussion is how the various calling conventions look from an assembler level.

This is useful to know for a variety of reasons; if you are reverse engineering (or debugging) something, one of the first steps is figuring out the calling convention for a function you are working with, so that you know how to find the arguments for it, how it deals with the stack, and soforth.

For this post, I’ll concentrate primarily on __cdecl.  Future posts will cover the other major calling conventions.

As I have previously described, __cdecl is an entirely stack based calling convention, in which arguments are cleaned off the stack by the caller.  Given this, you can expect to see all of the arguments for a function placed onto the stack before a function call is main.  If you are using CL, then this is almost always done by using the “push” instruction to place arguments on the stack.

Consider the following simple example function:

__declspec(noinline)
int __cdecl CdeclFunction1(int a, int b, int c)
{
 return (a + b) * c;
}

First, we’ll take a look at what calls to a __cdecl function look like. For example, if we look at a call to the function described above like so:

CdeclFunction1(1, 2, 3);

… we’ll see something like this:

; 119  : 	int v = CdeclFunction1(1, 2, 3);

  00000	6a 03		 push	 3
  00002	6a 02		 push	 2
  00004	6a 01		 push	 1
  00006	e8 00 00 00 00	 call	 CdeclFunction1
  0000b	83 c4 0c	 add	 esp, 12

There are basically three different things going on here.

  1. Setting up arguments for the target function. This is what the three different “push” instructions do. Note that the arguments are pushed in reverse order – you’ll always see them in reverse order if they are placed on the stack via push.
  2. Making the actual function call itself. After all the arguments are in place, the “call” instruction is used to transfer execution to the target. Remember that on x86, the call instruction implicitly pushes the return address on the stack. After the function call returns, the return value of the function is stored in the eax (or edx:eax) registers, typically.
  3. Cleaning arguments off the stack after the function returns. This is the purpose of the “add esp, 0xc” instruction following the “call” instruction. Since the target function does not adjust the stack to remove arguments after the call, this is up to the calller. Sometimes, you may see multiple __cdecl function calls be made in rapid succession, with the compiler only cleaning arguments from the stack after all of the function calls have been made (turning many different “add esp” instructions into just one “add esp” instruction).

It is also worth looking at the implementation of the function to see what it does with the arguments passed in and how it sets up a return value. The assembler for CdeclFunction1 is as so:

CdeclFunction1 proc near

a= dword ptr  4
b= dword ptr  8
c= dword ptr  0Ch

mov     eax, [esp+8]    ; eax = b
mov     ecx, [esp+4]    ; ecx = a
add     eax, ecx        ; eax = eax + ecx
imul    eax, [esp+0Ch]  ; eax = eax * c
retn                    ; (return value = eax)
CdeclFunction1 endp

This function is fairly straightforward. Since __cdecl is stack based for argument passing, all of the parameters are on the stack. Recall that the “call” instruction pushes the return address onto the stack, so the stack will begin with the return value at [esp+0] and have the first argument at [esp+4]. A graphical view of the stack layout (relative to “esp”) of this function is thus:

+00000000  r              db 4 dup(?)      ; (Return address)
+00000004 a               dd ?
+00000008 b               dd ?
+0000000C c               dd ?

In this case, there is no frame pointer in use, so the function accesses all of the arguments directly relative to “esp”. The steps taken are:

  1. The function fills eax with the value of the second argument (b), located at [esp+8] according to our stack layout.
  2. Next, the function loads ecx with the value of the first argument (a), which is located at [esp+4].
  3. Next, the function adds to eax the value of the first argument (a), now stored in the ecx register.
  4. Finally, the function multiplies eax by the value of the third argument (c), located at [esp+c].
  5. After finishing with all of the computations needed to implement the function, it simply returns with a “retn” instruction. Since the caller cleans the stack, the “retn” intruction (with a stack adjustment) is not used here; __cdecl functions never use “retn <displacement>“, only “retn”. Additionally, because the result of the “mul” instruction happened to be stored in the eax register here, no extra instructions are needed to set up the return value, as it is already stored in the return value register (eax) at the end of the function.

Most __cdecl function calls are very similar to the one discussed above, although there will typically be much more code to the actual function and the function call (if there are many arguments), and the compiler may play some optimization tricks (such as deferring cleaning the stack across several function calls). The basic things to look for with a __cdecl function are:

  • All arguments are on the stack.
  • The return instruction is “retn” and not “retn <displacement>“, even when there are a non-zero number of arguments.
  • Shortly after the function call returns, the caller cleans the stack of arguments pushed. This may be deferred later, depending on how the compiler assembled the caller.

Note that if you have been paying attention, given the above criteria, you’ve probably noticed that a __cdecl function with zero arguments will look identical to an __stdcall function with zero arguments. If you don’t have symbols or decorated function names, there is no way to tell the two apart when there are no arguments, as the semantics are the same in that special case.

That’s all for a basic overview of __cdecl from an assembler perspective. Next time: more on the other calling conventions at an assembly level.

Win32 calling conventions: Usage cases

Tuesday, August 15th, 2006

Last time, I talked about some of the general concepts behind the varying calling conventions in use on Win32 (x86).  This posting focuses on the implications and usage cases behind each of the calling conventions, in an effort to provide a better understanding as to when you’ll see them used.

When looking at the different calling conventions, we can see that there are a number of differences between them.  Stack usage vs register parameters, caller vs callee cleans stack, member function calls vs “plain C” function calls, and soforth.  These differences lend each calling convention to specific cases where they are best suited.

To begin with, consider the __cdecl calling convention.  We know that it is a stack based calling convention where the caller cleans the stack.  Furthermore, we know that it is the default calling convention for CL (big hint as to when you’ll see this being used!).  These attributes make it well suited for a couple of cases:

  • Variadic functions, or functions with an ellipsis (…) terminating the argument list.  These functions have a variable number of arguments, which is not known at compile time of the callee.  __cdecl is useful for these functions because the compiler needs to implement a stack displacement to clean arguments off of the stack, but it doesn’t know how many arguments there are.  By leaving the argument-disposal up to the caller, who does know the number of arguments at compile time, the compiler doesn’t need special help from the programmer to correctly adjust the stack when the variadic function is going to return – it “just works”.  __cdecl is the only calling convention on Win32 x86 that supports variadic functions when used with CL.
  • Old-style C functions without prototypes.  For compatibility with legacy C code, the C compiler needs to support making function calls to unprototyped functions.  These must be treated as if they were variadic functions, because the compiler doesn’t know whether the function takes a fixed number of arguments or not (because there is no prototyped argument list).
  • Any other case where the programmer does not explicitly override the calling convention.  The default Visual Studio build environment will use the compiler default calling convention if you do not explicitly tell it otherwise, and this goes to __cdecl.  Some build environments (the DDK/build.exe platform in particular) default to different calling conventions, but Visual Studio built programs will always default to __cdecl if you are using CL.

Next, we’ll take a look at __stdcall.  This calling convention is the standard for Win32 APIs; virtually all system APIs are __stdcall (typically decorated as “WINAPI”, “NTAPI”, or “CALLBACK” in the headers, which are macros that expand to __stdcall).  Here are the typcial usage cases for __stdcall (and the “why” behind them):

  • Library functions.  Excepting the C runtime libraries, virtually all Microsoft-shipped Win32 libraries use __stdcall.  The main reason for this (if you discount the “that’s the way it has always been”) is that you save some instruction code space by using __stdcall and not __cdecl for library functions.  The reason for this is that for __cdecl functions, the caller typically needs to adjust the stack pointer after every “call” instruction to a __cdecl function (which takes up instruction code space – typically an “add esp, imm8” opcode).  For __stdcall functions, you only pay this penality once, in the “retn imm16” opcode at the end of the function (as opposed to once for every caller).  For frequently called functions (say, ReadFile), this begins to add up.  You also theoretically save a bit of processor time and cache space, as there is one less instruction to be executed per “call”.
  • COM functions.  COM uses __stdcall with the “this” pointer being the first argument, which is a required part of the COM API contract for publicly accessible functions.
  • Functions that need to be called from a language other than C/C++.  This also ties back into the COM and library function purposes, but of all of the calling conventions discussed here, only __stdcall has practically universal support among non-Microsoft or non-C/C++ compilers for x86 Win32 (such as Visual Basic, or Delphi).  As a result, it is advantageous to use __stdcall if you are expecting to be called from other languages.
  • Microsoft-built programs.  Microsoft defaults their programs to __stdcall and not __cdecl virtually everywhere, even in images that don’t export functions, or in internally, non-exported funtions within a system library.  This also applies to Microsoft kernel mode code, such as the HAL and the kernel itself.
  • Programs built with the DDK.  The DDK defaults to __stdcall and not __cdecl.
  • NT kernel drivers.  These are always (or at least should always be!) built with the DDK, which again, defaults to __stdcall.

There yet remains __fastcall to discuss.  This calling convention is not used as extensively as the other two (no Microsoft build environment that I am aware of defaults to it), so most of the cases for it being used are the result of a programmer explicitly requesting it.

  • Functions that do not call other functions (“leaf functions”).  These are good candidates for __fastcall because the register arguments are passed in volatile registers, so there is a penalty associated with __fastcall functions that call subfunctions and need to use their arguments across those function calls, as this requires the __fastcall function to save arguments to somewhere nonvolatile (i.e. the stack), and that defeats the whole purpose of __fastcall entirely.
  • Functions that do not use their arguments after the first subfunction call.  These can still benefit from __fastcall without the penalty mentioned above relating to preserving arguments across function calls.
  • Short functions that call other functions and then return.  If you can make both functions __fastcall, then sometimes the compiler can be clever and not need to re-load the argument registers when a __fastcall function calls a __fastcall subfunction.  This can be useful for “wrapper” functions in some cases.
  • Functions that interface with assembly code.  Sometimes it can be more convenient to make a C function called by assembler code __fastcall, because this can save you the work of manually tracking stack displacements.  It can also sometimes be more convenient to make assembler functions called by C code __fastcall as well, for similar reasons.

In general, __fastcall is relatively rare.  There are a couple of kernel functions that use it (KfRaiseIrql, for instance).  A couple of software vendors (such as Blizzard Entertainment) seem to like to ship things compiled with __fastcall as the default calling convention, but this not the common case, and usually not a good idea.

Finally, there is __thiscall.  This calling convention is only used if you are using the default calling convention for member functions.  Note that for member functions that are not accessible cross-program (e.g. not exported somehow), the compiler will sometimes replace ecx with ebx for the “this” pointer as a custom calling convention, depending on your optimization settings.

That’s all for this installment.  In the next posting, I’ll discuss what the various calling conventions look like at a low level (assembly), and what this means to you.

Win32 calling conventions: Concepts

Monday, August 14th, 2006

If you have done debugging work on Win32 (x86) for any length of time, you probably know that there are many different calling conventions.  While I have already covered the Win64 (X64) calling convention, I haven’t yet gone into details about how to work with the various calling conventions present on x86 Windows.  This miniseries is going to go into some detail relating to how the calling conventions work from the perspective of debugging and reverse engineering.

There are three major calling conventions in use in modern Win32 (on x86, anyway): __stdcall, __cdecl, and __fastcall.  All three have different characteristics that are important if you are debugging or reverse engineering something, and it is important to be able to recognize and work with functions written against all three calling conventions – even if you don’t have access to symbols.  To start off, it’s probably best to get a feel for what all of the different calling conventions do, how they work, and soforth.

Most of the calling conventions have some things in common.  All of the calling conventions have the same set of volatile registers, which are not required to remain the same across a call site: eax, ecx, and edx.  Additionally, all three calling conventions use eax for 32-bit return values, and eax:edx for 64-bit return values (high 32 bits in edx).  For large return values (e.g. functions that return a structure, and not a pointer to a structure), a hidden argument is passed to a hidden local variable in the caller’s stack frame that represents the large return value, and the return value is filled in using the hidden argument (that is, large return values are actually implemented as a hidden pointer parameter).

  • __cdecl is the default calling convention for the Microsoft C/C++ compiler.  This is an entirely stack based calling convention for parameter passing; the caller cleans the arguments off of the stack when the call returns.
  • __stdcall has the same semantics as __cdecl, except that the callee (target function) cleans the arguments off of the stack instead of the caller.
  • __fastcall passes the first two register-sized arguments in ecx and edx, with the remaining arguments passed on the stack like __stdcall.  If any stack based arguments were present, the callee cleans them off of the stack.

Additionally, there is a fourth calling convention implemented by CL, known as __thiscall, which is like __stdcall except that there is a hidden pointer argument representing a class object (“this” in C++) that is passed in ecx.  As you might imagine, __thiscall is used for non-static class member function calls (by default).  If you explicitly override the calling convention of member functions, then “this” becomes a hidden (first) argument that is passed according to the conventions of the overriding calling convention.  In general, __thiscall is not really used in the Windows API.

That’s a general overview of the basic concepts behind all of the major Win32 x86 calling convention.  In some upcoming posts, I’ll explore the implications behind the different attributes of these calling conventions, their usage cases, and how they apply to you when you are debugging or reverse engineering something.