Compiler optimizer tricks in x86 assembly, part 1

The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious. With a bit of practice, you can train yourself to quickly identify these optimizations and see what they really do. These kinds of optimizations are very common with constant values.

With many of these little “assembly tricks”, the compiler is not simply taking into account instruction speed, but also instruction size, in order to reduce the total amount of opcode bytes required to do an optimization.

This post series attempts to provide a basic overview (and rationale) of several of the most common compiler optimizations that you might see while analyzing x86 assembly code. Knowing what these “assembly tricks” do is a great benefit if you are debugging or reverse engineering something, as it allows one to quickly recognize certain compiler patterns and gain clues as to what a program is doing.

Without further ado, here’s a brief overview of several of these such optimizations:

  1. Clearing a register value by xor reg, reg is a very common sequence in x86 code generated by a compiler. You will almost never see an instruction of the form mov reg, 0. Instead, the compiler will virtually always use the above-mentioned xor construct.

    The reasoning behind this is that the xor reg, reg construct is a very small instruction in terms of opcode bytes (2 bytes), whereas assigning a constant 32-bit value is typically much more expensive in terms of opcode length (say, 5 bytes).

    The gain with this optimization is reduced code size. Reducing code size is always good, and can lead to improved performance in that it improves the cacheability of a particular chunk of code (remember, most processor cache sizes are still very small compared to main system memory or hard drives). Also, if the compiler can shrink down the total image size by even one page (e.g. 4096 bytes) with optimizations like this, that’s one less page fault that needs to be taken when loading the program. This can be noticible even in lengths less than a page, if a function can be kept within one page instead of spilling over to a neighboring page (which might not be used for the current code path), which can eliminate “extraneous” page faults, where most of the data brought in is unnecessary.

    (It’s worth noting that this sort of optimization has come a long way recently, in the form of profile-guided BBT-style optimizations that reorder “hot” code paths to be on the same page in an attempt to make every byte that is in-paged from disk be as relevant as possible to the current execution path.)

  2. The constant zero register another very common optimization technique used by the compiler. Since the value zero appears so frequently in many functions (e.g. default parameter values are typically zero, and it is very common to compare values against zero or assign values to zero), the compiler will sometimes dedicate an entire register to contain the value zero throughout most of the function. Here’s an example, taken from nt!MmZeroPageThread:

    xor     esi, esi
    cmp     [ebp+var_14], esi
    [...]
    push    esi             ; WaitBlockArray
    push    esi             ; Timeout
    push    esi             ; Alertable
    push    esi             ; WaitMode
    push    8               ; WaitReason
    xor     ebx, ebx
    inc     ebx
    push    ebx             ; WaitType
    lea     eax, [ebp+Object]
    push    eax             ; Object
    push    2               ; Count
    call    _KeWaitForMultipleObjects@32

    Here, the compiler has dedicated the “esi” register to be the constant zero value for this function. It is used in many cases; in just a small part of nt!MmZeroPageThread, for instance, we can see that it is being used as both an argument to the “cmp” instruction for a test against constant zero, and we can also see it being used to fill many constant zero parameter values for a call to KeWaitForMultipleObjects.

    The gain from using a constant zero register is typically reduced code size. In x86, in many cases, it takes less opcode bytes to assign or compare a value to a register than to an immediate constant operand. The compiler takes advantage of this fact by only “paying the price” for referencing the value zero in an immediate operand for an instruction once, by storing it into a register. Then, the compiler can simply refer to that register for smaller overall code size if references to constant zero are common enough. For example, in the above code fragment, the compiler saves one opcode byte per push esi instruction over doing a push 0 instruction.

  3. Fast multiplication or addition with the lea instruction is another fairly common optimization that one is likely to run into frequently. The lea instruction (load effective address) was originally intended for use with pointer math, but it also turns out to be quite useful for doing fast constant math on non-pointer values as well, in part due to the wide range of operations that can be done with this instruction.

    Consider the following code fragment:

    mov     eax, [ebp+some_local]
    movzx   ecx, bx
    lea     eax, [eax+eax*4]
    lea     eax, [ecx+eax*2-30h]
    mov     [ebp+other_local], eax

    This instruction sequence may seem a bit convoluted at first, but it’s not too bad if we break it down into its constituent parts (and then combine it into one operation).

    We have the following operations:

    lea     eax, [eax+eax*4]

    This operation is equivalent to the following in C:

    other_local = some_local;
    other_local *= 5;

    Then, we’ve got the second lea operation:

    lea     eax, [ecx+eax*2-30h]

    In C, this might look like so:

    other_local = other_local * 2 + X - 0x30;

    …(where X corresponds to bx (and then ecx)).

    If we combine the two together, we get the following expression in C:

    other_local = some_local * 10 + X - 0x30;

    Now, the compiler could have used a combination of mul, add, and sub instructions to achieve the same effect. However, this would be more expensive in terms of instruction size, as those instructions are designed to work with values that are not known at runtime. By using the lea instruction, the compiler can take advantage of the fact that the lea instruction can perform multiple operations with one instruction in order to reduce code size.

  4. The lea instruction is also useful for scaling array indicies to their native type. Recall that in C, if you subscript an array, the subscript is performed on whole array elements; in the x86 instruction set, however, the processor has no way of magically knowing the size of an array element when the programmer subscripts an array.

    For example, consider the following structure:

    0:000> dt nt!_MMPFN
       +0x000 u1               : __unnamed
       +0x004 PteAddress       : Ptr32 _MMPTE
       +0x008 u2               : __unnamed
       +0x00c u3               : __unnamed
       +0x010 OriginalPte      : _MMPTE
       +0x010 AweReferenceCount : Int4B
       +0x014 u4               : __unnamed ; 4 bytes
    

    In this case, sizeof(nt!_MMPFN) == 24 (0x18). Consider an array of _MMPFN structures like so:

    _MMPFN MmPfnDatabase[ array_size ];

    If the programmer wants to index MmPfnDatabase (i.e. retrieve a pointer to a particular _MMPFN element within the array), then the compiler needs to convert an index into a pointer to an _MMPFN structure contained within the array.

    For example, the programmer might write the following C code:

    _MMPFN* Pfn = &MmPfnDatabase[ PfnIndex ];

    At the x86 instruction set level, though, the compiler needs to convert PfnIndex into a pointer. This requires two operations: First, PfnIndex needs to be scaled to the array size (or multipled by sizeof(_MMPFN). Second, the resultant value needs to be added to the base of MmPfnDatabase to form a complete pointer value to the requested _MMPFN element.

    In order to accomplish this, the compiler might emit the following instruction sequence:

    mov     eax, [ebp+PfnIndex]
    mov     ecx, _MmPfnDatabase
    push    ebx
    mov     ebx, [ebp+arg_4]
    lea     eax, [eax+eax*2]
    push    esi
    lea     esi, [ecx+eax*8]
    

    Here, the lea instruction is used to take the PfnIndex and MmPfnDatabase values and combine them into an _MMPFN pointer (stored in “esi”). If we break down the individual operations performed, what’s going on here isn’t all that difficult to understand:

    1. The initial LEA operation is equivalent to multiplying PfnIndex by 3 (PfnIndex is stored into “eax”).
    2. The final LEA operation multiplies the result of the first LEA operation by 8 (which can be simplified to say that PfnIndex has been multiplied by 24, which is conveniently equal to sizeof(_MMPFN).
    3. Finally (also part of the last LEA operation), “ecx” (which was loaded with the base address of MmPfnDatabase) is added to the intermediate result, which is then placed into “esi”, forming a completed _MMPFN pointer.

    Like with performing constant math for non-array indexing, the advantage of using lea over a series of mul and add operations is primarily code size, as lea allows for several distinct operations (e.g. multiply/add, or multiply/subtract) to be combined into a single compact operation. Most processors also provide very fast implementations of the lea instruction (as compared to the mul instruction for constant multiply operations).

    In order to be able to differentiate between an array indexing operation and plain constant math using lea, I would recommend checking to see whether any of the input values are treated as pointers or if the output value is treated as a pointer. Usually, it’s fairly easy to make this determination, though.

    As an aside, if you are reverse engineering something, constructs like array index operations are very handy as they will definitively tell you the size of the structure comprising an array.

The next post in this series will continue this discussion and examine several more common (and more complicated) optimizations and assembly tricks that the compiler may emit on x86. Stay tuned…

9 Responses to “Compiler optimizer tricks in x86 assembly, part 1”

  1. bw says:

    xor eax,eax is a trick? :) common man, show us something real

  2. Marc Sherman says:

    By “BBT-style optimizations”, does BBT stand for Branch Bias Table?

    thanks,
    Marc

  3. Skywing says:

    BBT refers to an optimization tool developed at Microsoft known as “Basic Block Tools”. It is used to perform profile-based optimization to optimize code page faults; see http://www.microsoft.com/windows/cse/bit_projects.mspx for some brief details.

  4. James Bomb says:

    xor ebx,ebx
    inc ebx
    push ebx

    *** in that case

    push 1

    *** would be smaller, and probably faster

  5. mikeb says:

    A comment on the “xor ebx, ebx/inc ebx/push ebx” vs. “push 1” comment:

    While the push1 sequence is smaller by 2 bytes, it’s proably slower since it would result in a memory read/write sequence. Even if the process can optimize out the read (since it’s been done for the instruction fetch), it’s likely that optimizations in the chip allow a push from a register to happen faster than a push from the instruction stream.

    Also the processor would likely have the xor and inc instruction happen in parallel with the push instructions that preceed them, so those cycles would not add to the over-all time to push the ‘1’.

  6. Gavin says:

    the above statement contains an error.

    There is no memory read/write in “push 1”

    also the wording “instruction stream” is in no way relevant to this example, I believe this term is to do with multithreading.

  7. mikeb says:

    Regarding the “constant zero register” optimization. the ia64 architecture has that baked-in. Integer register r0 is fixed at 0 and float registers f0 and f1 are fixed at 0.0 and 1.0 respectively.

    I don’t know if other architectures do this, but it seems to me to be a good idea. It can’t take too much silicon, and the only drawback is that you lose a register that software can use. On something like x86 that would matter a lot; on something like ia64, there’s a lot of regs so losing one doesn’t hurt too much.

    Not that the ia64 is a raging success…

  8. d says:

    Basically every RISC architecture uses register 0 as a 0-register.