Don’t always trust the compiler… (or when reverse engineering comes in handy even when you’ve got source code)

Usually, when a program breaks, you look for a bug in the program. On the rare occasion, however, compilers have been known to malfunction.

I ran into such a problem recently. At my apartment, I have a video streaming system setup, wherein I have a TV tuner plugged into a dedicated desktop box. That desktop box has been setup to run VLC noninteractively in order to stream (broadcast) TV from the TV tuner onto my apartment LAN. Then, if I want to watch TV, all I have to do is pull up VLC at a computer and tell it to display the MPEG stream I have configured to be broadcast on my local network.

This works fairly well (although VLC isn’t without it’s quirks), and it’s got the nice side effect of that I have a bit more flexibility as to where I want to watch TV at without having to invest in extra hardware (beyond a TV tuner). Furthermore, I can even do silly things like put TV up on multiple monitors if I really wanted to, something not normally doable if you just use a “plain” TV set (the old fashioned way!).

Recently, though, one of my computers ceased being able to play the MPEG stream I was running over my network. Investigation showed that other computers on the LAN weren’t having problems displaying the stream; only this one system in particular wouldn’t play the stream correctly. When I connected VLC to the stream, I’d get a blank black screen with no audio or video. I checked out the VLC debug message log and found numerous instances of this log message:

warning: received bufer in the future

Hmm. It seemed like VLC was having timing-related problems that were causing it to drop frames. My first reaction was that VLC had some broken-ness relating to the handling of large uptimes (this system in question had recently exceeded the “49.7 day boundary”, wherein the value returned by GetTickCount, a count in milliseconds of time elapsed since the system booted, wraps around to zero). I set out to prove this assumption by setting a breakpoint on kernel32!GetTickCount in the debugger and attaching VLC to the stream. While GetTickCount was occasionally called, it turned out that it wasn’t being used in the critical code path in question.

So, I set out to find that log message in the VLC source code (VLC is open source). It turned out to be coming from a function relating to audio decoding (aout_DecPlay). The relevant code turned out to be as follows (reformatting by me):

[...]
if ( p_buffer->start_date > mdate() +
     p_input->i_pts_delay           +
     AOUT_MAX_ADVANCE_TIME )

{
     msg_Warn( p_aout,
      "received buffer in the future ("I64Fd")",
       p_buffer->start_date - mdate());

[...]

After logging this warning, the function in question drops the frame with the assumption that it is probably bogus due to bad timing information.

Clearly, there was nothing wrong with the stream itself, as I could still play the stream fine on other computers. In fact, restarting VLC on the computer hosting the stream, or the computer hosting the VLC stream itself both did nothing to resolve the problem; other computers could play the stream, except for one system (with a high uptime) that would always fail due to bad timing information.

In this case, it turns out that the mdate function is an internal VLC function used all over the place for high resolution timing. It returns a microsecond-precision counter that is monotically incrementing since VLC started (or in the case of Win32 VLC, since Windows was started). I continued to suspect that something was wrong here (as the only system that was failing to play the stream had a fairly high uptime). Looking into the source for mdate, there were two code paths that could be taken on Win32; one that used GetTickCount for timing resolution (though this code path in question does handle tick count wraparound), and another path that utilizes QueryPerformanceCounter and QueryPerformanceFrequency for high resolution timing, if VLC thinks that the performance counter is slaved to the system timer clock. (Whether or not the latter is really a good thing to do period on Windows is debatable; I would say no, but it appears to work for VLC.)

As I had already ruled out GetTickCount as being used in the timing-critical parts of VLC, I ignored the GetTickCount-related code path in mdate. This left the following segment of code in the Win32 version of mdate:

/**
 * Return high precision date
 *
 * Uses the gettimeofday() function when
 *  possible (1 MHz resolution) or the
 * ftime() function (1 kHz resolution).
 */
mtime_t mdate( void )
{
 /* We don't need the real date, just the value of
    a high precision timer */
 static mtime_t freq = I64C(-1);

 if( freq == I64C(-1) )
 {
  /* Extract from the Tcl source code:
   * (http://www.cs.man.ac.uk/fellowsd-bin/TIP/7.html)
   *
   * Some hardware abstraction layers use the CPU clock
   * in place of the real-time clock as a performance counter
   * reference.  This results in:
   * - inconsistent results among the processors on
   *   multi-processor systems.
   * - unpredictable changes in performance counter frequency
   *   on "gearshift" processors such as Transmeta and
   *   SpeedStep.
   * There seems to be no way to test whether the performance
   * counter is reliable, but a useful heuristic is that
   * if its frequency is 1.193182 MHz or 3.579545 MHz, it's
   * derived from a colorburst crystal and is therefore
   * the RTC rather than the TSC.  If it's anything else, we
   * presume that the performance counter is unreliable.
   */

  freq = ( QueryPerformanceFrequency( (LARGE_INTEGER *)&freq )
      && (freq == I64C(1193182) || freq == I64C(3579545) ) )
      ? freq : 0;
 }

 if( freq != 0 )
 {
  LARGE_INTEGER counter;
  QueryPerformanceCounter (&counter);

  /* Convert to from (1/freq) to microsecond resolution */
  /* We need to split the division to avoid 63-bits
       overflow */
  lldiv_t d = lldiv (counter.QuadPart, freq);

  return (d.quot * 1000000)
    + ((d.rem * 1000000) / freq);
 }
[...]
}

This code isn’t all that hard to follow. The idea is that the first time around, mdate will check the performance counter frequency for the current system. If it is one of two magical values, then mdate will be configured to use the performance counter for timing. Otherwise, it is configured to use an alternate method (not shown here), which is based on GetTickCount. On the system in question, mdate was being set to use the performance counter and not GetTickCount.

Assuming that mdate has decided on using the system performance counter for timing purposes (which, again, I do not believe is a particularly good (portable) choice, though it does happen to work on my system), then mdate simply divides out the counter value by the frequency (count of counter units per second), adjusted to return a nanosecond value (hence the constant 1000000 vales). The reason why the original author split up the division into two parts is evident by the comment; it is an effort to avoid an integer overflow when performing math on large quantities (it avoids multiplying an already very large (64-bit) value by 1000000 before the divission, which might then exceed 64 bits in the resultant quantity). (In case you were wondering, lldiv is a 64-bit version of the standard C runtime function ldiv; that is, it performs an integral 64-bit division with remainder.)

Given this code, it would appear that mtime should be working fine. Just to be sure, though, I decided to double check what was going on the debugger. Although VLC was built with gcc (and thus doesn’t ship with WinDbg-compatible symbol files), mtime is a function exported by one of the core VLC DLLs (libvlc.dll), so there wasn’t any great difficulty in setting a breakpoint on it with the debugger.

What I found was that mdate was in fact returning a strange value (to be precise, a large negative value – mtime_t is a signed 64-bit integer). Given the expression used in the audio decoding function snippet I listed above, it’s no surprise why that would break if mdate returned a negative value (and it’s a good assumption that other code in VLC would similarly break).

The relevant code portions for the actual implementation of mdate that gcc built were as so:

libvlc!mdate+0xe0:
62e20aa0 8d442428     lea     eax,[esp+0x28]
62e20aa4 890424       mov     [esp],eax
;
; QueryPerformanceCounter(&counter)
;
62e20aa7 e874640800   call    QueryPerformanceCounter
62e20aac 83ec04       sub     esp,0x4
62e20aaf b940420f00   mov     ecx,0xf4240 ; 1000000
62e20ab4 8b742428     mov     esi,[esp+0x28]
62e20ab8 8b7c242c     mov     edi,[esp+0x2c]
62e20abc 89f0         mov     eax,esi
62e20abe f7e1         mul     ecx
62e20ac0 89c1         mov     ecx,eax
62e20ac2 69c740420f00 imul    eax,edi,0xf4240 ; 1000000
62e20ac8 890c24       mov     [esp],ecx
62e20acb 8b3dcc7a2763 mov     edi,[freq.HighPart]
62e20ad1 8d3402       lea     esi,[edx+eax]
62e20ad4 897c240c     mov     [esp+0xc],edi
62e20ad8 8b15c87a2763 mov     edx,[freq.LowPart]
62e20ade 89742404     mov     [esp+0x4],esi
62e20ae2 89542408     mov     [esp+0x8],edx
;
; lldiv(...)
;
62e20ae6 e815983e00   call    lldiv
62e20aeb 8b5c2430     mov     ebx,[esp+0x30]
62e20aef 8b742434     mov     esi,[esp+0x34]
62e20af3 8b7c2438     mov     edi,[esp+0x38]
62e20af7 83c43c       add     esp,0x3c
62e20afa c3           ret

This bit of code might look a bit daunting at first, but it’s not too bad. Translated into C, it looks approximately like so:

LARGE_INTEGER counter, tmp;

QueryPerformanceCounter(&counter);

tmp.LowPart  = Counter.LowPart  * 1000000;
tmp.HighPart = Counter.HighPart * 1000000 +
    (((unsigned __int64)counter.LowPart  * 1000000) >> 32);

d = lldiv(tmp.QuadPart, freq);

return d.quot;

This looks code looks a little bit weird, though. It’s not exactly the same thing that we see in the VLC source code, even counting for differences that might arise between original C source code and reverse enginereed C source code; in the compiled code, the expression in the return statement has been moved before the call to lldiv.

In fact, the code has been heavily optimized. The compiler (gcc, in this case) apparently assumed some knowledge about the inner workings of lldiv, and decided that it would be safe to pre-calculate an input value instead of perform post-calculations on the result of lldiv. The calculations do appear to be equivalent, at first; the compiler simply moved a multiply around relative to a division that used remainders. Basic algebra tells us that there isn’t anything wrong with doing this.

However, there’s one little complication: computers don’t really do “basic algebra”. Normally, in math, you typically assume an unlimited space for variables and intermediate values, but in computer-land, this isn’t really the case. Computers approximate the set of all integer values in a 32-bit (or 64-bit) number-space, and as a result, there is a cap on how large (or small) of an integer you can represent natively, at least without going to a lot of extra work to support truly arbitrarily large integers (as is often done in public key cryptography implementations).

Taking a closer look at this case, there is a problem; the optimizations done by gcc cause some of the intermediate values of this calculation to grow to be very large. While the end result might be equivalent in pure math, when dealing with computers, the rules change a bit due to the fact that we are dealing with an integer with a maximum size of 64 bits. Specifically, this ends up being a problem because the gcc-optimized version of mdate multiplies the raw value of “counter” by 1000000 (as opposed to multiplying the result of the first division by 1000000). Presumably, gcc has performed this optimization as multiply is fairly cheap as far as computers go (and division is fairly expensive in comparison).

Now, while one might naively assume that the original version of mdate and the instructions emitted by gcc are equivalent, with the above information in mind, it’s clear that this isn’t really the case for the entire range of values that might be returned by QueryPerformanceCounter. Specifically, if the counter value multiplied by 1000000 exceeds the range of a 64-bit integer, then the two versions of mdate will not return the same value, as in the second version, one of the intermediate values of this calculation will “wrap around” (and in fact, to make matters worse, mdate is dealing with signed 64-bit values here, which limits the size of an integer to 63 significant bits, with one bit reserved for the representation of the integer’s sign).

This can be experimentally confirmed in the debugger, as I previously alluded to. Stepping through mdate, around the call to lldiv specifically, we can see that the intermediate value has exceeded the limits of a 63-bit integer with sign bit:

0:007>
eax=d5ebcb40 ebx=00369e99 ecx=6f602e40
edx=00369e99 esi=d5f158ce edi=00000000
eip=62e20ae6 esp=01d0fbfc ebp=00b73068
iopl=0         ov up ei ng nz nape cy
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000a83
libvlc!mdate+0x126:
62e20ae6 e815983e00       call    lldiv
0:007> dd @esp
01d0fbfc  6f602e40 d5f158ce 00369e99 00000000
01d0fc0c  00000060 01d0ffa8 77e6b7d0 77e6bb00
01d0fc1c  ffffffff 77e6bafd 5d29c231 00000e05
01d0fc2c  00000000 00b73068 00000000 62e2a544
01d0fc3c  00000f08 ffffffff 01d0fd50 00b72fd8
01d0fc4c  03acc670 00a49530 01d0fd50 6b941bd2
01d0fc5c  00a49530 00000f30 00000000 00000003
01d0fc6c  00b24130 00b53e20 0000000f 00b244a8
0:007> p
eax=e1083396 ebx=00369e99 ecx=ffffffff
edx=ffffff3a esi=d5f158ce edi=00000000
eip=62e20aeb esp=01d0fbfc ebp=00b73068
iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000206
libvlc!mdate+0x12b:
62e20aeb 8b5c2430         mov     ebx,[esp+0x30]

Using our knowledge of calling conventions, it’s easy to retrieve the arguments to lldiv from the stack: tmp.QuadPart is 0xd5f158ce6f602e40, and freq is 0x0000000000369e99

It’s clear that counter.QuadPart has overflowed here; considering the sign bit is set, it now holds a (very large) negative quantity. Since the remainder of the function does nothing that would influence the sign bit of the result, after the division, we get another (large, but closer to zero) negative value back (stored in edx:eax, the value 0xffffff3ae1083396). This is the final return value of mdate, which explains the problems I was experiencing with playing video streams; large negative values were being returned and causing sign-sensitive inequality tests on the return value of mdate (or a derivative thereof) to operate unexpectedly.

In this case, it turned out that VLC failing to play my video stream wasn’t really the fault of VLC; it ended up being a bug in gcc’s optimizer that caused it to make unsafe optimizations that introduce calculation errors. What’s particularly insidious about this mis-optimization is that it is invisible until the input values for the operations involve grow to a certain size, after which calculation results are wildly off. This explains why nobody else has run into this problem in VLC enough to get it fixed by now; unless you run VLC on Windows systems with a high uptime, where VLC is convinced that it can use the performance counter for timing, you would never know that the compiler had introduced a subtle (but serious) bug due to optimizations.

As to fixing this problem, there are a couple of approaches the VLC team could take. The first is to update to a more recent version of gcc (if newer gcc versions fix this problem; I don’t have a build environment that would let me compile all of VLC, and I haven’t really had much luck in generating a minimal repro for this problem, unfortunately). Alternatively, the function could be rewritten until gcc’s optimizer decided to stop trying to optimize the division (and thus introduce calculation errors).

A better solution would be to just drop the usage of QueryPerformanceCounter entirely, though. For VLC’s usage, GetTickCount should be close enough timing-wise, and you can even increase the resolution of GetTickCount up to around 1ms (with good hardware) using timeBeginTime. GetTickCount does have the infamous 49.7-day wraparound problem, but VLC does have a workaround that works. Furthermore, on Windows Vista and later, GetTickCount64 could be used, turning the 49.7-day limit into a virtual non-issue (at least in our lifetimes, anyway).

(Oh, and in case you’re wondering why I didn’t just fix this myself and submit a patch to VLC (after all, it’s open source, so why can’t I just “fix it myself”?): VLC’s source code distribution is ~100mb uncompressed, and I don’t really want to go spending a great deal of time to find a cygwin version that works correctly on Vista x64 with ASLR and NX enabled (cygwin’s fault, not Vista’s) so that I can get a build environment for VLC up and running so that I could test any potential fix I make (after debugging the inevitable build environment difficulties along the way for such a large project). I might still do this at some point, perhaps to see if recent gcc versions fix this optimizer bug, though.)

For now, I just patched my running VLC instance in-memory to use GetTickCount instead, using the debugger. Until I restart VLC, that will have to do for now.

8 Responses to “Don’t always trust the compiler… (or when reverse engineering comes in handy even when you’ve got source code)”

  1. Zorm says:

    Thats a cool setup, but are you able to watch more than one channel at once?

    Also you do realize thats an insane amount of work for something so trivial? I think the majority of people would just give up and reboot the computer.

  2. It seems you don’t like QueryPerformanceFrequency much. As you’re obviously much more into this subject than I am, I’m curious as to why.

  3. Addendum: (QueryPerformanceFrequency, QueryPerformanceCounter, etc.)

  4. Skywing says:

    Zorm: I refuse to bring down my systems for an unexpected reboot periodically because of a problem with my video player. That’s about the last thing that you should need to reboot for.

    Soren: Well, here’s the thing. QueryPerformanceCounter/QueryPerformanceFrequency are *supposed* to be used for benchmarking things, not long-term timekeeping. The same things that make the processor performance counters good for very fine grained measurements also make them *bad* for accurate timekeeping over a long time, as there aren’t checks all over to make sure that counters are perfectly synchronized between processors, or to make up for time slippage when processors power down to low usage states, and soforth.

    Mechanisms like what GetTickCount uses handle all of these special cases that would cause problems if you use them for long-term timekeeping. They also have a higher overhead as a result, but that’s okay unless you are using them for benchmarking.

    The performance counter mechanisms aren’t any good for added accuracy 99% of the time anyway, because the dispatcher’s time granularity is the same as GetTickCount, so all the trouble that you might go to in order to improve timer resolution by using the performance counter goes down the drain the moment you do anything that involves a kernel mode sleep (i.e. explicitly enter a wait state – Sleep, WaitForSingleObject, etc, take a page fault, or any number of other things). As a result, the performance counter is good in about two circumstances: 1) you’re doing benchmarking over a short period of time and need exact measurements, or 2) your program is busy looping constantly instead of relinquishing time slices. The first case is highly specialized, and any program that sits there dragging down system performance with useless busy looping is just plain bad by design.

    I don’t have a problem with the performance counters being used for what they are designed for, but they’re not a magical way to perform high resolution timing and scheduling.

  5. Excellent. That’s good and valuable insight. Thanks :]

  6. Matt says:

    Very nice debugging war story. I ran into an issue a few years ago that has a few points of similarity to yours, viz., timing-related code involving 64-bit values. In the unoptimized version, some inline assembly that executed rdtsc would save the two 32-bit words of its result to the stack, and then those results got used in further computations. The optimizer decided that it could load from one of those locations on the stack before it had done the relevant store… oops.

  7. […] examined all other possibilities. I guess I got blinded by certain optimization bugs I had read about in the past. Unfortunately, the bug also manifested itself even when the server was compiled with […]

  8. […] examined all other possibilities. I guess I got blinded by certain optimization bugs I had read about in the past. Unfortunately, the bug also manifested itself even when the server was compiled with […]