Win32 calling conventions: __fastcall in assembler

The __fastcall calling convention is the last major major C-supported Win32 (x86) calling convention that I have not covered yet. (There still exists __thiscall, which I’ll discuss later).

__fastcall is, as you might guess from the name, a calling convention that is designed for speed. In this spirit, it attempts to borrow from many RISC calling conventions in that it tries to be register-based instead of stack based. Unfortunately, for all but the smallest or simplest functions, __fastcall typically does not end up being a particularly stellar thing performance-wise, for x86, primarily due to the (comparatively) extremely limited register set that x86 sports.

This calling convention has a great deal in common with the x64 calling convention that Win64 uses. In fact, aside from the x64-specific parts of the x64 calling convention, you can think of the x64 calling convention as a logical extension of __fastcall that is designed to take advantage of the expanded register set available with x64 processors.

What this boils down to is that __fastcall will try to pass the first two pointer-sized arguments in the ecx and edx registers. Any additional registers are passed on the stack as per __stdcall.

In practice, the key things to look out for with a __fastcall function are thus:

  • The callee assumes a meaningful value in the ecx (or edx and ecx) registers. This is a tell-tale sign of __fastcall (although, you may sometimes see __thiscall make use of ecx for the this pointer).
  • No arguments are cleaned off the stack by the caller. Only __cdecl functions have this property.
  • The callee ends in a retn (args-2)*4 instruction. In general, this is the pattern that you will see with __fastcall functions that use the stack. For __fastcall functions where no stack parameters are used, the function typically ends in a ret instruction with no stack displacement argument.
  • The callee is a short function with very few arguments. These are the most likely cases where a smart programmer will use __fastcall, as otherwise, __fastcall does not tend to buy you very much over __stdcall.
  • Functions that interface directly with assembler. Having access to ecx and edx can be a handy shortcut for a C function that is being called by something that is obviously written in assembler.

Taking these into account, let’s take a look at the same sample function and function call that we have been previously dealing with in our earlier examples, this time in __fastcall.

The function that we are going to call is declared as so:

__declspec(noinline)
int __fastcall FastcallFunction1(int a, int b, int c)
{
	return (a + b) * c;
}

This is consistent with our previous examples, save that it is declared __fastcall.

The function call that we shall make is as so:

FastcallFunction1(1, 2, 3);

With this code, we can expect the function call to look something like so in assembler:

push    3                 ; push 'c' onto the stack
push    2                 ; place a constant 2 on the stack
xor     ecx, ecx          ; move 0 into 'a' (ecx)
pop     edx               ; pop 2 off the stack and into edx.
inc     ecx               ; set 'a' -- ecx to 1 (0+1)
call    FastcallFunction1 ; make the call (a=1, b=2, c=3)

This is actually a bit different than we might expect. Here, the compiler has been a bit clever and used some basic optimizations with setting up constants in registers. These optimizations are extremely common and something that you should get used to seeing as simply constant assignments to registers, given how frequently they show up. In a future series, I’ll go into some more details as to common compiler optimizations like these, but that’s a tale for a different time.

Continuing with __fastcall, here’s what the implementation of FastcallFunction1 looks like in assembler:

FastcallFunction1 proc near

c= dword ptr  4

lea     eax, [ecx+edx] ; eax = a + b
imul    eax, [esp+4]   ; eax = (eax * c)
retn    4              ; return eax;
FastcallFunction1 endp

As you can see, in this particular instance, __fastcall turns out to be a big saver as far as instructions executed (and thus size, and in a lesser degree, speed) of the callee. This kind of benefit is usually restricted to extremely simple functions, however.

The main things, then, to consider if you are trying to identify if a function is __fastcall or not are thus:

  • Usage of the ecx (or ecx and edx) registers in the function without loading them with explicit values before-hand. This typically indicates that they are being used as argument registers, as with __fastcall.
  • The caller does not clean any arguments off the stack (no add esp instruction to clean the stack after the call). With __fastcall, the callee always cleans the arguments (if any).
  • A ret instruction (with no stack displacement argument) terminating the function, if there are two or less arguments that are pointer-sized or smaller. In this case, __fastcall has no stack arguments.
  • A retn (args-2)*4 instruction terminating the function, if there are three or more arguments to the function. In this case, there are stack arguments that must be cleaned off the stack via the retn instruction.

That’s all for __fastcall. More on other calling conventions next time…

3 Responses to “Win32 calling conventions: __fastcall in assembler”

  1. OJ says:

    Hi there,

    I have a few questions.

    I am currently working on a system which was written a few years ago, and the previous programmer seems to have use _fastcall wherever possible. For example, in a single C++ class of over 300 (yes, that’s 300) methods, at least 90% of them are marked as _fastcall, and many of them also have > 20 parameters. Most of these functions are non-leaf functions. I would have thought that, after reading your post, this is particularly dangerous and that we should be seeing some issues arise as a result of parameters being mangled.

    So my questions:
    * is there a difference between _fastcall and __fastcall? If so, what is it?

    * if there isn’t, why would I not be seeing any side effects? Would the compiler be doing some form of trickery to prevent it from happening?

    * does the compiler ultimately determine what is and isn’t a candidate for __fastcall/_fastcall?

    Many thanks for your time. Your blog is always a great read. Cheers!
    OJ

  2. Skywing says:

    _fastcall and __fastcall are aliases and have the same meaning to the compiler.

    The compiler should (always) still get it right, no matter which calling convention you use for a function, as long as both the implementation and everyone calling treat the function as having the same calling convention.

    However, using __fastcall for complex, non-leaf functions is likely to going to be suboptimal (if not “fatal”). The called function will likely need to save the parameters to nonvolatile locations anyway before the first subfunction call, which is a bit of extra overhead. This creates a situation that is in principal worse than if the caller has provided the parameters on the stack in the first place, which would tend to make something besides __fastcall the better calling convention in that case. Many functions that Microsoft exports as fastcall in the kernel fit the profile of small leaf functions, such as KfAcquireSpinLock or InterlockedIncrement.

    That being said, in *most* cases, the actual overhead of a particular calling convention is vanishingly small compared to the time spent in a function. The exception cases tend to be extremely small or simple leaf functions. So although there is very likely going to be some performance overhead for a 20-parameter function that is not a leaf function, if it’s a non-trivial function, the overhead of the calling convention compared to the time spent running the function itself would likely be barely noticible.

    If you turn on link time code generation (LTCG), the compiler is allowed to create optimal, custom calling conventions for functions whose scope can be proved to never escape the current module and are never called via assembler code. For instance, in such cases you may see “strange” things like the compiler passing parameters in eax, or non-standard rules on which registers are volatile or not across a call.

    Aside from the LTCG custom calling convention for internal-only functions case, however, the compiler will just stick to what you specify as far as calling conventions go. So if you don’t have LTCG enabled, you’ll get what you ask for as far as calling conventions go. If you have LTCG enabled, and the function is provable never visible to anything that would require it to really have the calling convention you asked for, then the compiler may do something clever and pick a better calling convention that it designs on the fly.

    Oh, and BTW, although you didn’t say if you’re doing this, I would avoid exporting C++ class functions outside a module. The reason there being that if you change any number of things (including things like compiler version, STL version if you are using STL (new STLs tend to not be binary compatible with old STL version classes), alignment packing, pointer to member function disposition (or various other things), member variable offsets and class sizes may shift and result in strange memory corruption problems if every single module referencing or implementing said exported / imported classes isn’t (re)compiled as a unit. This typically leads to a maintenance headache, to say the least, in my experience.

  3. OJ says:

    Thanks very much! That was very informative.

    Thankfully the codebase isn’t exporting anything at the moment, so I can nip these issues in the bud if and when the time comes. For now I’m just concerned about the stability and quality of the existing code.

    Many thanks!