{"id":64,"date":"2007-02-10T13:10:33","date_gmt":"2007-02-10T18:10:33","guid":{"rendered":"http:\/\/www.nynaeve.net\/?p=64"},"modified":"2019-12-13T17:38:39","modified_gmt":"2019-12-13T22:38:39","slug":"compiler-optimizer-tricks-in-x86-assembly","status":"publish","type":"post","link":"http:\/\/www.nynaeve.net\/?p=64","title":{"rendered":"Compiler optimizer tricks in x86 assembly, part 1"},"content":{"rendered":"<p>The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious.  With a bit of practice, you can train yourself to quickly identify these optimizations and see what they really do.  These kinds of optimizations are very common with constant values.<\/p>\n<p>With many of these little &#8220;assembly tricks&#8221;, the compiler is not simply taking into account instruction speed, but also instruction <em>size<\/em>, in order to reduce the total amount of opcode bytes required to do an optimization.<\/p>\n<p>This post series attempts to provide a basic overview (and rationale) of several of the most common compiler optimizations that you might see while analyzing x86 assembly code.  Knowing what these &#8220;assembly tricks&#8221; do is a great benefit if you are debugging or reverse engineering something, as it allows one to quickly recognize certain compiler patterns and gain clues as to what a program is doing.<\/p>\n<p>Without further ado, here&#8217;s a brief overview of several of these such optimizations:<\/p>\n<ol>\n<li>\n<p>Clearing a register value by xor <em>reg<\/em>, <em>reg<\/em> is a very common sequence in x86 code generated by a compiler.  You will almost never see an instruction of the form mov <em>reg<\/em>, 0.  Instead, the compiler will virtually always use the above-mentioned xor construct.<\/p>\n<p>The reasoning behind this is that the xor <em>reg<\/em>, <em>reg<\/em> construct is a very small instruction in terms of opcode bytes (2 bytes), whereas assigning a constant 32-bit value is typically much more expensive in terms of opcode length (say, 5 bytes).<\/p>\n<p>The gain with this optimization is reduced code size.  Reducing code size is always good, and can lead to improved performance in that it improves the cacheability of a particular chunk of code (remember, most processor cache sizes are still very small compared to main system memory or hard drives).  Also, if the compiler can shrink down the total image size by even one page (e.g. 4096 bytes) with optimizations like this, that&#8217;s one less page fault that needs to be taken when loading the program.  This can be noticible even in lengths less than a page, if a function can be kept within one page instead of spilling over to a neighboring page (which might not be used for the current code path), which can eliminate &#8220;extraneous&#8221; page faults, where most of the data brought in is unnecessary.<\/p>\n<p>(It&#8217;s worth noting that this sort of optimization has come a long way recently, in the form of profile-guided BBT-style optimizations that reorder &#8220;hot&#8221; code paths to be on the same page in an attempt to make every byte that is in-paged from disk be as relevant as possible to the current execution path.)<\/p>\n<\/li>\n<li>\n<p>The constant zero register another very common optimization technique used by the compiler.  Since the value zero appears so frequently in many functions (e.g. default parameter values are typically zero, and it is very common to compare values against zero or assign values to zero), the compiler will sometimes dedicate an entire register to contain the value zero throughout most of the function.  Here&#8217;s an example, taken from <em>nt!MmZeroPageThread<\/em>:<\/p>\n<pre>xor     esi, esi\r\ncmp     [ebp+var_14], esi\r\n[...]\r\npush    esi             ; WaitBlockArray\r\npush    esi             ; Timeout\r\npush    esi             ; Alertable\r\npush    esi             ; WaitMode\r\npush    8               ; WaitReason\r\nxor     ebx, ebx\r\ninc     ebx\r\npush    ebx             ; WaitType\r\nlea     eax, [ebp+Object]\r\npush    eax             ; Object\r\npush    2               ; Count\r\ncall    _KeWaitForMultipleObjects@32<\/pre>\n<p>Here, the compiler has dedicated the &#8220;esi&#8221; register to be the constant zero value for this function.  It is used in many cases; in just a small part of nt!MmZeroPageThread, for instance, we can see that it is being used as both an argument to the &#8220;cmp&#8221; instruction for a test against constant zero, and we can also see it being used to fill many constant zero parameter values for a call to <em>KeWaitForMultipleObjects<\/em>.<\/p>\n<p>The gain from using a constant zero register is typically reduced code size.  In x86, in many cases, it takes less opcode bytes to assign or compare a value to a register than to an immediate constant operand.  The compiler takes advantage of this fact by only &#8220;paying the price&#8221; for referencing the value zero in an immediate operand for an instruction once, by storing it into a register.  Then, the compiler can simply refer to that register for smaller overall code size if references to constant zero are common enough.  For example, in the above code fragment, the compiler saves one opcode byte per <em>push esi<\/em> instruction over doing a <em>push 0<\/em> instruction.\n<\/p>\n<\/li>\n<li>\n<p>Fast multiplication or addition with the <em>lea<\/em> instruction is another fairly common optimization that one is likely to run into frequently.  The lea instruction (load effective address) was originally intended for use with pointer math, but it also turns out to be quite useful for doing fast constant math on non-pointer values as well, in part due to the wide range of operations that can be done with this instruction.<\/p>\n<p>Consider the following code fragment:<\/p>\n<pre>mov     eax, [ebp+some_local]\r\nmovzx   ecx, bx\r\nlea     eax, [eax+eax*4]\r\nlea     eax, [ecx+eax*2-30h]\r\nmov     [ebp+other_local], eax<\/pre>\n<p>This instruction sequence may seem a bit convoluted at first, but it&#8217;s not too bad if we break it down into its constituent parts (and then combine it into one operation).<\/p>\n<p>We have the following operations:<\/p>\n<pre>lea     eax, [eax+eax*4]<\/pre>\n<p>This operation is equivalent to the following in C:<\/p>\n<pre>other_local = some_local;\r\nother_local *= 5;<\/pre>\n<p>Then, we&#8217;ve got the second <em>lea<\/em> operation:<\/p>\n<pre>lea     eax, [ecx+eax*2-30h]<\/pre>\n<p>In C, this might look like so:<\/p>\n<pre>other_local = other_local * 2 + X - 0x30;<\/pre>\n<p>&#8230;(where X corresponds to bx (and then ecx)).<\/p>\n<p>If we combine the two together, we get the following expression in C:<\/p>\n<pre>other_local = some_local * 10 + X - 0x30;<\/pre>\n<p>Now, the compiler could have used a combination of <em>mul<\/em>, <em>add<\/em>, and <em>sub<\/em> instructions to achieve the same effect.  However, this would be more expensive in terms of instruction size, as those instructions are designed to work with values that are not known at runtime.  By using the lea instruction, the compiler can take advantage of the fact that the lea instruction can perform multiple operations with one instruction in order to reduce code size.\n<\/p>\n<\/li>\n<li>\n<p>The lea instruction is also useful for scaling array indicies to their native type.  Recall that in C, if you subscript an array, the subscript is performed on whole array elements; in the x86 instruction set, however, the processor has no way of magically knowing the size of an array element when the programmer subscripts an array.<\/p>\n<p>For example, consider the following structure:<\/p>\n<pre>\r\n0:000&gt; dt nt!_MMPFN\r\n   +0x000 u1               : __unnamed\r\n   +0x004 PteAddress       : Ptr32 _MMPTE\r\n   +0x008 u2               : __unnamed\r\n   +0x00c u3               : __unnamed\r\n   +0x010 OriginalPte      : _MMPTE\r\n   +0x010 AweReferenceCount : Int4B\r\n   +0x014 u4               : __unnamed ; 4 bytes\r\n<\/pre>\n<p>In this case, sizeof(nt!_MMPFN) == 24 (0x18).  Consider an array of _MMPFN structures like so:<\/p>\n<p>_MMPFN MmPfnDatabase[ array_size ];<\/p>\n<p>If the programmer wants to index <em>MmPfnDatabase<\/em> (i.e. retrieve a pointer to a particular _MMPFN element within the array), then the compiler needs to convert an index into a pointer to an _MMPFN structure contained within the array.<\/p>\n<p>For example, the programmer might write the following C code:<\/p>\n<pre>_MMPFN* Pfn = &MmPfnDatabase[ PfnIndex ];<\/pre>\n<p>At the x86 instruction set level, though, the compiler needs to convert <em>PfnIndex<\/em> into a pointer.  This requires two operations: First, PfnIndex needs to be <em>scaled<\/em> to the array size (or multipled by sizeof(_MMPFN).  Second, the resultant value needs to be added to the base of <em>MmPfnDatabase<\/em> to form a complete pointer value to the requested _MMPFN element.<\/p>\n<p>In order to accomplish this, the compiler might emit the following instruction sequence:<\/p>\n<pre>mov     eax, [ebp+PfnIndex]\r\nmov     ecx, _MmPfnDatabase\r\npush    ebx\r\nmov     ebx, [ebp+arg_4]\r\nlea     eax, [eax+eax*2]\r\npush    esi\r\nlea     esi, [ecx+eax*8]\r\n<\/pre>\n<p>Here, the lea instruction is used to take the PfnIndex and MmPfnDatabase values and combine them into an _MMPFN pointer (stored in &#8220;esi&#8221;).  If we break down the individual operations performed, what&#8217;s going on here isn&#8217;t all that difficult to understand:<\/p>\n<ol>\n<li>The initial LEA operation is equivalent to multiplying PfnIndex by 3 (PfnIndex is stored into &#8220;eax&#8221;).<\/li>\n<li>The final LEA operation multiplies the result of the first LEA operation by 8 (which can be simplified to say that PfnIndex has been multiplied by 24, which is conveniently equal to sizeof(_MMPFN).<\/li>\n<li>Finally (also part of the last LEA operation), &#8220;ecx&#8221; (which was loaded with the base address of MmPfnDatabase) is added to the intermediate result, which is then placed into &#8220;esi&#8221;, forming a completed _MMPFN pointer.<\/li>\n<\/ol>\n<p>Like with performing constant math for non-array indexing, the advantage of using lea over a series of mul and add operations is primarily code size, as lea allows for several distinct operations (e.g. multiply\/add, or multiply\/subtract) to be combined into a single compact operation.  Most processors also provide very fast implementations of the lea instruction (as compared to the mul instruction for constant multiply operations).<\/p>\n<p>In order to be able to differentiate between an array indexing operation and plain constant math using lea, I would recommend checking to see whether any of the input values are treated as pointers or if the output value is treated as a pointer.  Usually, it&#8217;s fairly easy to make this determination, though.<\/p>\n<p>As an aside, if you are reverse engineering something, constructs like array index operations are very handy as they will definitively tell you the size of the structure comprising an array.\n<\/p>\n<\/li>\n<\/ol>\n<p>The next post in this series will continue this discussion and examine several more common (and more complicated) optimizations and assembly tricks that the compiler may emit on x86.  Stay tuned&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious. With a bit of practice, you can train yourself to quickly identify these optimizations and see what they really do. These kinds of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[2,8],"tags":[],"_links":{"self":[{"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=\/wp\/v2\/posts\/64"}],"collection":[{"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=64"}],"version-history":[{"count":1,"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=\/wp\/v2\/posts\/64\/revisions"}],"predecessor-version":[{"id":606,"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=\/wp\/v2\/posts\/64\/revisions\/606"}],"wp:attachment":[{"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=64"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=64"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.nynaeve.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=64"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}