Compiler tricks in x86 assembly, part 2

A common operation in many C programs is a division by a constant quantity. This sort of operation occurs all over the place; in explicit mathematical calculations, and in other, more innocuous situations such as where the programmer does something like take a size in bytes and convert it into a count of structures.

This fundamental operation (division by a constant) is also one that the compiler will tend to optimize in ways that may seem a bit strange until you know what is going on.

Consider this simple C program:

int
__cdecl
wmain(
	int ac,
	wchar_t** av)
{
	unsigned long x = rand();

	return x / 1518;
}

In assembly, one would expect to see the following operations emitted by the compiler:

  1. A call to rand, to populate the ‘x’ variable.
  2. A div instruction to perform the requested division of ‘x’ by 1518.

However, if optimizations are turned on and the compiler is not configured to prioritize small code over performance exclusively, then the result is not what we might expect:

; int __cdecl main(int argc,
  const char **argv,
  const char *envp)
_main proc near
call    ds:__imp__rand
mov     ecx, eax
mov     eax, 596179C3h
mul     ecx
mov     eax, ecx
sub     eax, edx
shr     eax, 1
add     eax, edx
shr     eax, 0Ah ; return value in eax
retn

The first time you encounter something like this, you’ll probably be wondering something along the lines of “what was the compiler thinking when it did that, and how does that set of instructions end up dividing out ‘x’ by 1518?”.

What has happened here is that the compiler has used a trick sometimes known as magic number division to improve performance. In x86 (and in most processor architectures, actually), it is typically much more expensive to perform a division than a multiplication (or most any other basic mathematical primative, for that matter). While ordinarily, it is typically not possible to make division any faster than what the ‘div’ instruction gives you performance-wise, under special circumstances it is possible to do a bit better. Specifically, if the compiler knows that one of the operands to the division operation is a constant (specifically, the divisor), then it can engage some clever tricks in order to eliminate the expensive ‘div’ instruction and turn it into a sequence of other instructions that, while initially longer and more complicated on the surface, are actually faster to execute.

In this instance, the compiler has taken one ‘div’ instruction and converted it into a multiply, subtract, add, and two right shift operations. The basic idea behind how the code the compiler produced actually gives the correct result is that it typically involves a multiplication with a large constant, where the low 32-bits are then discarded and then remaining high 32-bits (recall that the ‘mul’ instruction returns the result in edx:eax, so the high 32-bits of the result are stored in edx) have some transform applied, typically one that includes fast division by a power of 2 via a right shift.

Sure enough, we can demonstrate that the above code works by manually comparing the result to a known value, and then manually performing each of the individual operations. For example, assume that in the above program, the call to rand returns 10626 (otherwise known as 7 * 1518, for simplicity, although the optimized code above works equally well for non-multiples of 1518).

Taking the above set of instructions, we perform the following operations:

  1. Multiply 10626 by the magic constant 0x596179C3. This yields 0x00000E7E00001006.
  2. Discard the low 32-bits (eax), saving only the high 32-bits of the multiplication (edx; 0xE7E, or 3710 in decimal).
  3. Subtract 3710 from the original input value, 10626, yielding 6916.
  4. Divide 6916 by 2^1 (bit shift right by 1), yielding 3458.
  5. Add 3710 to 3458, yielding 7168.
  6. Divide 7168 by 2^10 (bit shift right by 10), yielding 7, our final answer. (10626 / 1518 == 7)

Depending on the values being divided, it is possible that there may only be a right shift after the large multiply. For instance, if we change the above program to divide by 1517 instead of 1518, then we might see the following code generated instead:

mov     eax, 5666F0A5h
mul     ecx
shr     edx, 9

Coming up with the “magical divide” constants in the first place involves a bit of math; interested readers can read up about it on DevDev.

If you’re reverse engineering something that has these characteristics (large multiply that discards the low 32-bits and performs some transforms on the high 32-bits), then I would recommend breaking up each instruction into the corresponding mathematical operation (be aware of tricks like shift right to divide by a power of 2, or shift left to multiply by a power of 2). As a quick “litmus test”, you can always just pick an input and do the operations, and then simply compare the result to the input and see if it makes sense in the context of a division operation.

More on other optimizer tricks in a future posting…

One Response to “Compiler tricks in x86 assembly, part 2”

  1. bw says:

    AMD has its all described in its manuals