Archive for the ‘Debugging’ Category

New WinDbg (6.7.5.0) released

Friday, April 27th, 2007

It’s finally here – WinDbg 6.7.5.0.

I haven’t gotten around to trying out all of the new goodies yet, but there are some nice additions. For one, .fnent now decodes unwind information in a more meaningful way on x64 (although it still doesn’t understand C scope table entries, making it less useful than SDbgExt’s !fnseh if that is what you were interested in.

Looks like they’ve finally gotten around to signing WinDbg.exe too (though, curiously, not the .msi the installer extracts), so the elevation prompts for WinDbg are now of the more friendlier sort instead of the “this program will destroy your computer” sort.

There is also reportedly source server support for CVS included; I imagine that I’ll be taking a stab at that again now that it is supposedly fully baked now.

In other news, the blog (and DNS) will be moving to a more ideal hosting location (read: not my apartment) as early as this weekend (if all goes according to plan, that is). It’ll be moving to a yummy new quad core Xeon box (with a real connection), a nice step up from the original hardware that it has been running on until a short while ago (good riddance). Crossing my fingers, but hopefully the random unavailability been hardware dying on me and Road Runner sucking should be going away Real Soon Now(tm).

How I ended up in the kernel debugger while trying to get PHP and Cacti working…

Saturday, April 14th, 2007

Some days, nothing seems to work properly. This is the sad story of how something as innocent as trying to install a statistics graphing Web application culminated in my breaking out the kernel debugger in an attempt to get things working. (I don’t seem to have a lot of luck with web applications. So much for the way of the future being “easy to develop/deploy/use” web-based applications…)

Recently, I decided that to try installing Cacti in order to get some nice, pretty graphs describing resource utilization on several boxes at my apartment. Cacti is a PHP program that queries SNMP data and, with the help of a program called RRDTool, creates friendly historical graphs for you. It’s commonly used for monitoring things like network or processor usage over time.

In this particular instance, I was attempting to get Cacti working on a Windows Server 2003 x64 SP2 box. Running an amalgam of unix-ish programs on Windows is certainly “fun”, and doing it on native x64 is even more “interesting”. I didn’t expect to find myself in the kernel debugger while trying to get Cacti working, though…

To start out, the first thing I had to do was convert IIS6’s worker processes to 32-bit instead of 64-bit, as the standard PHP 5 distribution doesn’t support x64. (No, I don’t consider spending who knows how many hours to get PHP building on x64 natively a viable solution here, so I just decided to stick with the 32-bit release. I don’t particularly want to be in the habit of having to then maintain rebuild my own PHP distribution from a custom build environment each time security updates come out either…).

This wasn’t too bad (at least not at first); a bit of searching revealed this KB article that documented an IIS metabase flag that you can set to turn on 32-bit worker processes (with the help of the adsutil.vbs script included in the IIS Adminscripts directory).

One small snag here was that I happened to be running a symbol proxy in native x64 mode on this system already. Since the 32-bit vs 64-bit IIS worker process flag is an all-or-nothing option, I had to go install the 32-bit WinDbg distribution on this system and copy over the 32-bit symproxy.dll and symsrv.dll into %systemroot%\system32\inetsrv. Additionally, the registry settings used by the 64-bit symproxy weren’t directly accessible to the 32-bit version (due to a compatiblity feature in 64-bit versions of Windows known as Registry Reflection), so I had to manually copy over the registry settings describing which symbol paths the symbol proxy used to the Wow64 version of HKLM\Software. No big deal so far, just a minor annoyance.

The first unexpected problem that cropped up happened after I had configured the 32-bit symbol proxy ISAPI filter and installed PHP; after I enabled 32-bit worker processes, IIS started tossing HTTP 500 Internal Server Error statuses whenever I tried to browse any site on the box. Hmm, not good…

After determining that everything was still completely broken even after disabling the symbol proxy and PHP ISAPI modules, I discovered some rather unpleasant-looking event log messages:

ISAPI Filter ‘%SystemRoot%\Microsoft.NET\Framework64\v2.0.50727\aspnet_filter.dll’ could not be loaded due to a configuration problem. The current configuration only supports loading images built for a x86 processor architecture. The data field contains the error number. To learn more about this issue, including how to troubleshooting this kind of processor architecture mismatch error, see http://go.microsoft.com/fwlink/?LinkId=29349.

It seemed that the problem was the wrong version of ASP.NET being loaded (still the x64 version). The link in the event message wasn’t all that helpful, but a bit of searching located yet another knowledge base article – this time, about how to switch back and forth between 32-bit and 64-bit versions of ASP.NET. After running aspnet_regiis as described in that article, IIS was once again in a more or less working state. Another problem down, but the worst was yet to come…

With IIS working again, I turned towards configuring Cacti in IIS. Although, at first it appeared as though everything might actually go as planned (after configuring Cacti’s database settings, I soon found myself at its php-based initial configuration page), such things were not meant to be. The first sign of trouble appeared after I completed the initial configuration page and attempted to log on with the default username and password. Doing so resulted in my being thrown back to the log on page, without any error messages. A username and password combination not matching the defaults did result in a logon failure error message, so something besides a credential failure was up.

After some digging around in the Cacti sources, it appeared that the way that Cacti tracks whether a user is logged in or not is via setting some values in the standard PHP session mechanism. Since Cacti was apparently pushing me back to the log on page as soon as I logged on, I guessed that there was probably some sort of failure with PHP’s session state management.

Rewind a bit to back when I installed PHP. In the interest of expediency (hah!), I decided to try out the Win32 installer package (as opposed to just the zip distribution for a manual install) for PHP. Typically, I’ve just installed PHP for IIS the manual way, but I figured that if they had an installer nowadays, it might be worth giving it a shot and save myself some of the tedium.

Unfortunately, it appears that PHP’s installer is not all that intelligent. It turns out that in the IIS ISAPI mode, PHP configures the system-wide PHP settings to point the session state file directory to the user-specific temp directory (i.e. pointing to a location under %userprofile%). This, obviously, isn’t going to work; anonymous users logged on to IIS aren’t going to have access to the temp directory of the account I used to install PHP with.

After manually setting up a proper location for PHP’s session state with the right security permissions (and reconfiguring php.ini to match), I tried logging in to Cacti again. This time, I actually got to the main screen after changing the password (hooray, progress!).

From here, all that I had left to do was some minor reconfiguring of the Windows SNMP service in order to allow Cacti to query it, set up the Cacti poller task job (Cacti is designed to poll data from its configured data sources at regular intervals), and configure my graphs in Cacti.

Configuring SNMP wasn’t all that difficult (though something I hadn’t done before with the Windows SNMP service), and I soon had Cacti successfully querying data over SNMP. All that was left to do was graph it, and I was home free…

Unfortunately, getting Cacti to actually graph the data turned out to be rather troublesome. In fact, I still haven’t even got it working, though I’ve at least learned a bit more about just why it isn’t working…

When I attempted to create graphs in Cacti, everything would appear to work okay, but no RRDTool datafiles would ever appear. No amount of messing with filesystem permissions resolved the problem, and the Cacti log files were not particularly helpful (even on debug severity). Additionally, attempting to edit graph properties in Cacti would result in that HTTP session mysteriously hanging forever more (definitely not a good sign). After searching around (unsuccessfully) for any possible solutions, I decided to try and take a closer look at what exactly was going on when my requests to Cacti got stuck.

Checking the process list after repeating the sequence that caused a particular Cacti session to hang several times, I found that there appeared to be a pair of cmd.exe and rrdtool.exe instances corresponding to each hung session. Hmm, it would appear that something RRDTool was doing was freezing and PHP was waiting for it… (PHP uses cmd.exe to call RRDTool, so I guessed that PHP would be waiting for cmd.exe, which would be waiting for RRDTool).

At first, I attempted to attach to one of the cmd processes with WinDbg. (Incidentally, it would appear that there are currently no symbols for the Wow64 versions of the Srv03SP2 ntdll, kernel32, user32, and a large number of other core DLLs with Wow64 builds available on the Microsoft symbol server for some reason. If any Microsoft people are reading this, it would be greaaaat if you could fix the public symbol server for Srv03 SP2 x64 Wow64 DLLs …) However, symbols for cmd.exe were fortunately available, so it was relatively easy to figure out what it was up to, and prove my earlier hypothesis that it was simply waiting on an rrdtool instance:

0:001:x86> ~1k
ChildEBP RetAddr
0012fac4 7d4d8bf1 ntdll_7d600000!NtWaitForSingleObject+0x15
0012fad8 4ad018ea KERNEL32!WaitForSingleObject+0x12
0012faec 4ad02611 cmd!WaitProc+0x18
0012fc24 4ad01a2b cmd!ExecPgm+0x3e2
0012fc58 4ad019b3 cmd!ECWork+0x84
0012fc70 4ad03c58 cmd!ExtCom+0x40
0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
0012fee0 4ad06cf6 cmd!Dispatch+0x137
0012ff44 4ad07786 cmd!main+0x108
0012ffc0 7d4e7d2a cmd!mainCRTStartup+0x12f
0012fff0 00000000 KERNEL32!BaseProcessInitPostImport+0x8d
0:001:x86> !peb
[...]
CommandLine: 'cmd.exe /c c:/progra~2/rrdtool/rrdtool.exe -'
[...]

Given this, the next logical step to investigate would be the RRDTool.exe process. Unfortunately, something really weird seemed to be going on with all the RRDTool.exe processes (naturally). WinDbg would give me an access denied error for all of the RRDTool PIDs in the F6 process list, despite my being a local machine administrator.

Attempting to attach to these processes failed as well:

Microsoft (R) Windows Debugger Version 6.6.0007.5
Copyright (c) Microsoft Corporation. All rights reserved.

Cannot debug pid 4904, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”
Debuggee initialization failed, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”

This is not something that you want to be seeing on a server box. This particular error means that the process in question is in the middle of being terminated, which prevents a debugger from successfully attaching. However, processes typically terminate in timely fashion; in fact, it’s almost unheard of to actually see a process in the terminating state, since it happens so quickly. However, in this particular instances, the RRDTool processes were remaining in this half-dead state for what appeared to be an indefinite interval.

There are two things that commonly cause this kind of problem, and all of them are related to the kernel:

  1. The disk hardware is not properly responding to I/O requests and they are hanging indefinitely. This can block a process from exiting while the operating system waits for an I/O to finishing canceling or completing. Since this particular box was brand new (and with respectable, high-quality server hardware), I didn’t think that failing hardware was the cause here (or at least, I certainly hoped not!). Given that there were no errors in the event log about I/Os failing, and that I was still able to access files on my disks without issue, I decided to rule this possiblity out.
  2. A driver (or other kernel mode code in the I/O stack) is buggy and is not allowing I/O requests to be canceled or completed, or has deadlocked itself and is not able to complete an I/O request. (You might be familiar with the latter if you’ve tried to use the 1394 mass storage support in Windows for a non-trivial length of time.) Given that I had tentatively ruled out bad hardware, this would seem to be the most likely cause here.

Since the frozen process would be stuck in kernel mode, in either case, to proceed any further I would need to use the kernel debugger. I decided to start out with local kd, as that is sufficient for at least retrieving thread stacks and doing basic passive analysis of potential deadlock issues where the system is at least mostly still functional.

Sure enough, the stuck RRDTool process I had unsuccessfully tried to attach to was blocked in kernel mode:


lkd> !process 0n4904
Searching for Process with Cid == 1328
PROCESS fffffadfcc712040
SessionId: 0 Cid: 1328 Peb: 7efdf000 ParentCid: 1354
DirBase: 5ea6c000 ObjectTable: fffffa80041d19d0 HandleCount: 68.
Image: rrdtool.exe
[...]
THREAD fffffadfcca9a040 Cid 1328.1348 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Unknown) KernelMode Non-Alertable
fffffadfccf732d0 SynchronizationEvent
Impersonation token: fffffa80041db980 (Level Impersonation)
DeviceMap fffffa8001228140
Owning Process fffffadfcc712040 Image: rrdtool.exe
Wait Start TickCount 6545162 Ticks: 367515 (0:01:35:42.421)
Context Switch Count 445 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0015
Win32 Start Address windbg!_imp_RegCreateKeyExW (0x0000000000401000)
Start Address 0x000000007d4d1510
Stack Init fffffadfc4a95e00 Current fffffadfc4a953b0
Base fffffadfc4a96000 Limit fffffadfc4a8f000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 1
RetAddr Call Site
fffff800`01027752 nt!KiSwapContext+0x85
fffff800`0102835e nt!KiSwapThread+0x3c9
fffff800`013187ac nt!KeWaitForSingleObject+0x5a6
fffff800`012b2853 nt!IopAcquireFileObjectLock+0x6d
fffff800`01288dff nt!IopCloseFile+0xad
fffff800`01288f0e nt!ObpDecrementHandleCount+0x175
fffff800`0126ceb0 nt!ObpCloseHandleTableEntry+0x242
fffff800`0128d7a6 nt!ExSweepHandleTable+0xf1
fffff800`012899b6 nt!ObKillProcess+0x109
fffff800`01289d3b nt!PspExitThread+0xa3a
fffff800`0102e3fd nt!NtTerminateProcess+0x362
00000000`77ef0caa nt!KiSystemServiceCopyEnd+0x3
0202c9fc`0202c9fb ntdll!NtTerminateProcess+0xa

Hmm… not quite what I expected. If a buggy driver was involved, it should have at least been somewhere on the call stack, but in this particular instance all we have is ntoskrnl code, straight from the system call to the wait that isn’t coming back. Something was definitely wrong in kernel mode, but it wasn’t immediately clear what was causing it. It appeared as if the kernel was blocked on the file object lock (which, to my knowledge, is used to guard synchronous I/O’s that are issued for a particular file object), but, as the file object lock is built upon KEVENTs, the usual lock diagnostics extensions (like `!locks’) would not be particularly helpful. In this instance, what appeared to be happening was that the process rundown logic in the kernel was attempting to release all still-open handles in the exiting RRDTool process, and it was (for some reason) getting stuck while trying to close a handle to a particular file object.

I could at least figure out what file was “broken”, though, by poking around in the stack of IopCloseFile:

lkd> !fileobj fffffadf`ccf73250
\Temp\php\session\sess_bkcavai8fak8antv9coq46at95
LockOperation Set Device Object: 0xfffffadfce423370 \Driver\dmio
Vpb: 0xfffffadfce864840
Access: Read Write SharedRead SharedWrite
Flags: 0x40042
Synchronous IO
Cache Supported
Handle Created
File Object is currently busy and has 1 waiters.
FsContext: 0xfffffa800390e110 FsContext2: 0xfffffa8000106a10
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffadfcd601c20
Shared Cache Map: fffffadfccfdebb0 File Offset: 0 in VACB number 0
Vacb: fffffadfce97fb08
Your data is at: fffff98070e80000

From here, there are a couple of options:

  1. We could look for processes with an open handle to that file and check their stacks.
  2. We could look for an IRP associated with that file object and try and trace our way back from there.

Initially, I tried the first option, but this ended up not working particularly well. I attempted to use Process Explorer to locate all processes that had a handle to that file, but this ended up failing rather miserably as Process Explorer itself got deadlocked after it opened a handle to the file. This was actually rather curious; it turned out that processes could open a handle to this “broken” file just fine, but when they tried to close the handle, they would get blocked in kernel mode indefinitely.

That unsuccessful, I tried the second option, which is made easier by the use of `!irpfind’. Normally, this extension is very slow to operate (over a serial cable), but local kd makes it quite usable. This revealed something of value:


lkd> !irpfind -v 0 0 fileobject fffffadf`ccf73250
Looking for IRPs with file object == fffffadfccf73250
Scanning large pool allocation table for Tag: Irp? (fffffadfccdf6000 : fffffadfcce56000)
Searching NonPaged pool (fffffadfcac00000 : fffffae000000000) for Tag: Irp?
Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
fffffadfcc225380: Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[...]
>[ 11, 1] 2 1 fffffadfce7b6040 fffffadfccf73250 00000000-00000000 pending
\FileSystem\Ntfs
Args: fffffadfcd70e0a0 00000000 00000000 00000000

There was an active IRP for this file object. Hopefully, it could be related to whatever is holding the file object lock for that file object. Digging a bit deeper, it’s possible to determine what thread is associated with the IRP (if it’s a thread IRP), and from there, we can grab a stack (which might just give us the smoking gun we’re looking for)…:

lkd> !irp fffffadfcc225380 1
Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
Flags = 00000000
ThreadListEntry.Flink = fffffadfcc2253a0
ThreadListEntry.Blink = fffffadfcc2253a0
[...]
CancelRoutine = fffff800010ba930 nt!FsRtlPrivateResetLowestLockOffset
[...]
lkd> !thread fffffadfccea27d0
THREAD fffffadfccea27d0 Cid 10f8.138c Teb: 00000000fffa1000 Win32Thread: fffffa80023cd860 WAIT: (Unknown) UserMode Non-Alertable
fffffadfccf732e8 NotificationEvent
Impersonation token: fffffa8002c62060 (Level Impersonation)
DeviceMap fffffa8002f3b7b0
Owning Process fffffadfcc202c20 Image: w3wp.exe
Wait Start TickCount 6966187 Ticks: 952 (0:00:00:14.875)
Context Switch Count 1401 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0000
Win32 Start Address 0x00000000003d87d8
Start Address 0x000000007d4d1504
Stack Init fffffadfc4fbee00 Current fffffadfc4fbe860
Base fffffadfc4fbf000 Limit fffffadfc4fb8000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
RetAddr : Call Site
fffff800`01027752 : nt!KiSwapContext+0x85
fffff800`0102835e : nt!KiSwapThread+0x3c9
fffff800`012afb38 : nt!KeWaitForSingleObject+0x5a6
fffff800`0102e3fd : nt!NtLockFile+0x634
00000000`77ef14da : nt!KiSystemServiceCopyEnd+0x3
00000000`00000000 : ntdll!NtLockFile+0xa

This might just be what we’re looking for. There’s a thread in w3wp.exe (the IIS worker process), which is blocking on a synchronous NtLockFile call for that same file object that is in the “broken” state. Since I’m running PHP in ISAPI mode, this does make sense – if PHP is doing something to that file (which it could certainly be, since it’s a PHP session state file as we saw above), then it should be in the context of w3wp.exe.

In order to get a better user mode stack trace as to what might be going on, I was able to attach a user mode debugger to w3wp.exe and get a better picture as to what the deal was:


0:006> .effmach x86
Effective machine: x86 compatible (x86)
0:006:x86> ~6s
ntdll_7d600000!ZwLockFile+0x12:
00000000`7d61d82e c22800 ret 28h
0:006:x86> k
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
014edf84 023915a2 ntdll_7d600000!ZwLockFile+0x12
014edfbc 0241d886 php5ts!flock+0x82
00000000 00000000 php5ts!zend_reflection_class_factory+0xb576

It looks like that thread is indeed related to PHP; PHP is trying to acquire a file lock on the session state file. With a bit of work, we can figure out just what kind of lock it was trying to acquire.

The prototype for NtLockFile is as so:

// NtLockFile locks a region of a file.
NTSYSAPI
NTSTATUS
NTAPI
NtLockFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PULARGE_INTEGER LockOffset,
IN PULARGE_INTEGER LockLength,
IN ULONG Key,
IN BOOLEAN FailImmediately,
IN BOOLEAN ExclusiveLock
);

Given this, we can easily deduce the arguments off a stack dump:

0:006:x86> dd @esp+4 l0n10
00000000`014edf48 000002b0 00000000 00000000 014edfac
00000000`014edf58 014edfac 014edf74 014edf7c 00000000
00000000`014edf68 00000000 00000001
0:006:x86> dq 014edf74 l1
00000000`014edf74 00000000`00000000
0:006:x86> dq 014edf7c l1
00000000`014edf7c 00000000`00000001

It seems that PHP is trying to acquire an exclusive lock for a range of 1 byte starting at offset 0 in this file, with NtLockFile configured to wait until it acquires the lock.

Putting this information together, it’s now possible to surmise what is going on here:

  1. The child processes created by php have a file handle to the session state file (probably there from process creation inheritance).
  2. PHP tries to acquire an exclusive lock on part of the session state file. This takes the file object lock for that file and waits for the file to become exclusively available.
  3. The child process exits. Now, it tries to acquire the file object lock so that it can close its file handle. However, the file object lock cannot be acquired until the child process releases its handle as the handle is blocking PHP’s NtLockFile from completing.
  4. Deadlock! Neither thread can continue, and PHP appears to hang instead of configuring my graphs properly.

In this particular instance, it was actually possible to “recover” from the deadlock without rebooting; the IIS worker process’s wait in NtLockFile is marked as a UserMode wait, so it is possible to terminate the w3wp.exe process, which releases the file object lock and ultimately allows all the frozen processes that are trying to close a handle to the PHP session state file to finish the close handle operation and exit.

This is actually a nasty little problem; it looks like it’s possible for one user mode process to indefinitely freeze another user mode process in kernel mode via a deadlock. Although you can break the deadlock by terminating the second user mode process, the fact that a user mode process can, at all, cause the kernel to deadlock during process exit (“breakable” or not) does not appear to be a good thing to me.

Meanwhile, knowing this right now doesn’t really solve my problem. Furthermore, I suspect that there’s probably a different problem here too, as the command line that was given to RRDTool (simply “-“) doesn’t look all that valid to me. I’ll see if I can come up with some way to work around this deadlock problem, but it definitely looks like an unpleasant one. If it really is a file handle being incorrectly inherited to a child process, then it might be possible to un-mark that handle for inheritance with some work. The fact that I am having to consider making something to patch PHP to work around this is definitely not a happy one, though…

Silly me for thinking that it would just take a kernel debugger to get a web application running…

Debugger internals: Why do ntoskrnl and ntdll have type information?

Wednesday, February 14th, 2007

Johan Johansson sent me a mail asking why nt and ntdll have partial type information included (if you’re at all experienced with debugging on Windows, you’ll know that public symbols, such as what Microsoft ships on the public symbol server at http://msdl.microsoft.com/download/symbols, don’t typically include type information. Instead, one typically needs access to private symbols in order to view types.

However, nt and ntdll are an exception to this rule, on Windows XP or later. Unlike all the other PDBs shipped by Microsoft, the ones corresponding to ntdll and ntoskrnl do include type information for a seemingly arbitrary mix of types, some publicly documented and some undocumented. There is, however, a method to the madness with respect to what symbols are included in the public nt/ntdll PDBs.

To understand what symbols are chosen and why, though, it’s necessary to know a bit of history.

Way back in the days when Windows NT was still called Windows NT (and not Windows 2000 or Windows XP), the debugger scene was a much less friendly place. In those days, “remote debugging” involved dialing up your debugger person with a modem, and if you wanted to kernel debug a target, you had to run a different kernel debugger program specific to the architecture on the target computer.

Additionally, you had to use a debugger version that was newer than the operating system on the target computer or things wouldn’t work out very well. Furthermore, one had to load architecture-specific extension dlls for many kernel debugging tasks. One of the reasons for these restrictions, among other things, is that for different architectures (and different OS releases), the size and layout of many internal structures used by the debugger (and debugger extension modules) to do their work varied. In other words, the size and layout of, say, EPROCESS might not be the same on Windows NT 3.1 x86 vs Windows NT 4.0 for the DEC Alpha.

When Windows 2000 was released, things became a little bit better- Windows 2000 only publicly supported x86, which reduced the number of different architectures that WinDbg needed to publicly support going forward. However, Windows XP and Windows Server 2003 reintroduced mainstream support for non-x86 architectures (first IA64 and then x64).

At some point on the road to Windows XP and Windows Server 2003, a decision was made to clean things up from the debugger perspective and introduce a more manageable way of supporting a large matrix of operating systems and target architectures.

Part of the solution devised involved providing a unified, future-compatible (where possible; obviously, new or radically redesigned OS functionality would require debugger extension changes, but things like simple structure size chages shouldn’t require such drastic measures) method for accessing data on the remote system. Since all that a debugger extension does in the common case is simply quickly reformat data on the target system into an easily human-readable format (such as the process list returned by !process), a unified way to communicate structure sizes and layouts to debugger extensions (and the debugger engine itself) would greatly reduce the headache of supporting the debugger on an ever-expanding set of platforms and operating systems. This solution involved putting the structures used by the debugger engine itself and many debugger extension into the symbol files shipped with ntoskrnl and ntdll, and then providing a well-defined API for retrieving that type information from the perspective of a debugger extension.

Fast-forward to 2007. Now, there is a single, unified debugger engine for all target platforms and architectures (the i386kd.exe and ia64kd.exe that ship with the DTW distribution are essentially the same as the plain kd.exe and are vestigal remains of a by-gone era; these files are simply retained for backwards compability with scripts and programs that drive the debugger), real remote debugging exists, and your debugger doesn’t break every time a service pack is released. All of this is made possible in part due to the symbol information available in the ntoskrnl and ntdll PDB files. This is why you can use WinDbg 6.6.7.5 to debug Windows Vista RTM, despite the fact that WinDbg 6.6.7.5 was released months before the final RTM build was shipped.

Symbol support is also a reason why there is no ‘srv03ext’, ‘srv03fre’, or ‘srv03chk’ extension directories under your Debugging Tools for Windows folder. The nt4chk/nt4fre/w2kchk/w2kfre directories contain debugger extensions specific to that Windows build. Due to the new unified debugger architecture, there is no longer a need to tie a binary to a particular operating system build going forward. Because Windows 2000 and Windows NT 4.0 doesn’t include type data, however, the old directories still remain for backwards compatibility with those platforms.

So, to answer Johan’s question: All of the symbols in the ntoskrnl or ntdll PDBs should be used by the debugger engine itself, or some debugger extension, somewhere. This is the final determining factor as to what types are exposed via those PDBs, to my knowledge; whether a public debugger extension DLL (or the debugger engine itself) uses them or not.

Reverse engineer’s toolkit

Tuesday, February 13th, 2007

If you’re planning on reverse engineering something (or debugging unfaimilar code), then it’s important to have the right tools for the job. This is a short list of some of the various tools that I find useful for this line of work (some are free, others are not; they are good tools, though, so I would encourage you to support the authors); if I am going to be doing extensive reversing (or debugging) work on a system, I’ll typically install most (or all) of these tools:

  • WinDbg (Debugging Tools for Windows); the de-facto debugger for Windows NT-based systems (and it’s freely available). Although there are other good debuggers out there (such as OllyDbg), it’s hard to beat the fact that Microsoft develops the debugger, and the designers have a great deal of information about and experience with the internals of the operating system. WinDbg comes with a plethora of plugin modules (extensions) that automate and improve many common tasks, and it has a well-documented and powerful interface to allow third parties (such as yourself) to program new extension modules for customized tasks. Additionally, WinDbg supports a (relatively crude, but effective nevertheless) scripting language. WinDbg can be used on 32-bit and 64-bit (IA64 and x64) targets in both user mode and kernel mode, and is the only supported facilities for debugging problems at customer sites or debugging certain specialized scenarios (like certain infrastructure processes, or fullscreen processes). The Debugging Tools for Windows distribution also includes ntsd, cdb, and kd, which are different (command-line) front-ends to the same debugger engine used by WinDbg. It also includes various other useful utilities, such as Logger (an API call spy program).
  • IDA Pro, or the Interactive Disassembler. IDA is the most full-featured disassembler (by a huge margin) out there that I know of, and it supports a wide variety of target architectures and operating systems to boot (although its support for x86 is probably the most full-featured of all the target architectures that you can use it on). IDA automates many of the aspects of disassembly and reverse engineering that have historically been tedious and error-prone, such as stack variable tracking. The full version of IDA is relatively pricey (on the order of $500..$900 or so, depending on whether you need x64 support), but if you’re serious about reverse engineering, it’s a good investment. For the casual reverse engineer, DataRescue has released a previous version, IDA 4.30, for free. If you are just working with x86 or simply want to get your feet wet with reversing, the free version is a good place to start. Although tools like WinDbg include a built-in disassembler, IDA’s advanced code analysis and interactive aspects (such as allowing the user to describe and annotate types, names, and variables in a structured fashion) make it far superior for non-trivial reverse engineering projects.
  • HIEW (Hacker’s vIEW) is a powerful (console-based) hex editor / disassembler / assembler for 16-bit, 32-bit, and 64-bit (x64) targets. It understands the various Windows image formats, but it can also be used on raw binary files as well. In addition to serving all of your hex editor needs, HIEW is just the perfect tool to use when you need to make quick on-disk patches to executables (and the integrated disassembler and assembler makes creating and applying such patches on-the-fly a walk in the park compared to traditional hex editors, which wourld require you to manually build the opcodes yourself, a pain for non-trivial patches). HIEW includes some additional power-features as well, such as a way to create and run simple programs to decrypt sections in a file (very useful if you’re working on something that is self-decrypting, and you know how to decrypt it but don’t have a program to do so already). It also includes a fairly simple plugin extension interface to allow custom actions to be integrated with the HIEW user interface. HIEW isn’t free, although it is fairly reasonably priced (and there is a (limited) demo that you can play around with).
  • The Windows Vista SDK is an essential tool for many reverse engineering tasks. It includes extensive documentation (as well as headers) for all public Win32 APIs, and it also includes several useful utilities as well (such as link.exe /dump, otherwise known as dumpbin.exe, which can be used to quickly extract various bits of information from a binary (like a list of imports) without having to load it up into a full-featured disassembler tool). The Vista SDK also includes OleView, which can be useful for inspecting a third-party COM library, as it has integrated support for turning a type library into MSIL (which can be trivially converted to a C header as needed).
  • Process Monitor, from SysInternals, is a great utility for quickly inspecting what file and registry operations a program is making. Depending on what you are trying to figure out when analyzing a program, simply looking at its file and registry activity can often save you hours of wading through disassembly or working with a debugger. Process Monitor allows you to perform (potentially filtered) in-depth logging of low-level file and registry activity, including operations that fail and the data returned by successful operations in some circumstances. Process Monitor is freely available from Microsoft’s website.
  • Process Explorer, formerly known as HandleEx, is another freely available tool from SysInternals. It allows quick and easy searching of open handles and loaded modules within a process, which is handy for initial information gathering (such as finding which process uses a DLL that you might be interested in). The ability to quickly check active processes for a DLL is also quite handy if you are trying to track down where a shared DLL (like a Windows hook DLL) is getting loaded in as a precursor to in-depth analysis.
  • SDbgExt is a WinDbg extension that I’ve developed (shameless plug). It provides a number of features that are useful for analyzing unfamiliar targets, such as the ability to import map files and create symbolic names out of them (particularly useful in conjunction with IDA’s export map file feature, if you want to synchronize your process between IDA and WinDbg), support for identifying writable function pointers within a process address space, and support for locating and displaying x64 exception/unwind handlers. SDbgExt presently only supports the 32-bit version of WinDbg. It is freely available (see the link above), and requires that WinDbg and the VC2005 CRT be installed.

There are other useful utilities out there (this is hardly an all-inclusive list), but these are the ones that I use the most often in a wide variety of situations.

Compiler optimizer tricks in x86 assembly, part 1

Saturday, February 10th, 2007

The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious. With a bit of practice, you can train yourself to quickly identify these optimizations and see what they really do. These kinds of optimizations are very common with constant values.

With many of these little “assembly tricks”, the compiler is not simply taking into account instruction speed, but also instruction size, in order to reduce the total amount of opcode bytes required to do an optimization.

This post series attempts to provide a basic overview (and rationale) of several of the most common compiler optimizations that you might see while analyzing x86 assembly code. Knowing what these “assembly tricks” do is a great benefit if you are debugging or reverse engineering something, as it allows one to quickly recognize certain compiler patterns and gain clues as to what a program is doing.

Without further ado, here’s a brief overview of several of these such optimizations:

  1. Clearing a register value by xor reg, reg is a very common sequence in x86 code generated by a compiler. You will almost never see an instruction of the form mov reg, 0. Instead, the compiler will virtually always use the above-mentioned xor construct.

    The reasoning behind this is that the xor reg, reg construct is a very small instruction in terms of opcode bytes (2 bytes), whereas assigning a constant 32-bit value is typically much more expensive in terms of opcode length (say, 5 bytes).

    The gain with this optimization is reduced code size. Reducing code size is always good, and can lead to improved performance in that it improves the cacheability of a particular chunk of code (remember, most processor cache sizes are still very small compared to main system memory or hard drives). Also, if the compiler can shrink down the total image size by even one page (e.g. 4096 bytes) with optimizations like this, that’s one less page fault that needs to be taken when loading the program. This can be noticible even in lengths less than a page, if a function can be kept within one page instead of spilling over to a neighboring page (which might not be used for the current code path), which can eliminate “extraneous” page faults, where most of the data brought in is unnecessary.

    (It’s worth noting that this sort of optimization has come a long way recently, in the form of profile-guided BBT-style optimizations that reorder “hot” code paths to be on the same page in an attempt to make every byte that is in-paged from disk be as relevant as possible to the current execution path.)

  2. The constant zero register another very common optimization technique used by the compiler. Since the value zero appears so frequently in many functions (e.g. default parameter values are typically zero, and it is very common to compare values against zero or assign values to zero), the compiler will sometimes dedicate an entire register to contain the value zero throughout most of the function. Here’s an example, taken from nt!MmZeroPageThread:

    xor     esi, esi
    cmp     [ebp+var_14], esi
    [...]
    push    esi             ; WaitBlockArray
    push    esi             ; Timeout
    push    esi             ; Alertable
    push    esi             ; WaitMode
    push    8               ; WaitReason
    xor     ebx, ebx
    inc     ebx
    push    ebx             ; WaitType
    lea     eax, [ebp+Object]
    push    eax             ; Object
    push    2               ; Count
    call    _KeWaitForMultipleObjects@32

    Here, the compiler has dedicated the “esi” register to be the constant zero value for this function. It is used in many cases; in just a small part of nt!MmZeroPageThread, for instance, we can see that it is being used as both an argument to the “cmp” instruction for a test against constant zero, and we can also see it being used to fill many constant zero parameter values for a call to KeWaitForMultipleObjects.

    The gain from using a constant zero register is typically reduced code size. In x86, in many cases, it takes less opcode bytes to assign or compare a value to a register than to an immediate constant operand. The compiler takes advantage of this fact by only “paying the price” for referencing the value zero in an immediate operand for an instruction once, by storing it into a register. Then, the compiler can simply refer to that register for smaller overall code size if references to constant zero are common enough. For example, in the above code fragment, the compiler saves one opcode byte per push esi instruction over doing a push 0 instruction.

  3. Fast multiplication or addition with the lea instruction is another fairly common optimization that one is likely to run into frequently. The lea instruction (load effective address) was originally intended for use with pointer math, but it also turns out to be quite useful for doing fast constant math on non-pointer values as well, in part due to the wide range of operations that can be done with this instruction.

    Consider the following code fragment:

    mov     eax, [ebp+some_local]
    movzx   ecx, bx
    lea     eax, [eax+eax*4]
    lea     eax, [ecx+eax*2-30h]
    mov     [ebp+other_local], eax

    This instruction sequence may seem a bit convoluted at first, but it’s not too bad if we break it down into its constituent parts (and then combine it into one operation).

    We have the following operations:

    lea     eax, [eax+eax*4]

    This operation is equivalent to the following in C:

    other_local = some_local;
    other_local *= 5;

    Then, we’ve got the second lea operation:

    lea     eax, [ecx+eax*2-30h]

    In C, this might look like so:

    other_local = other_local * 2 + X - 0x30;

    …(where X corresponds to bx (and then ecx)).

    If we combine the two together, we get the following expression in C:

    other_local = some_local * 10 + X - 0x30;

    Now, the compiler could have used a combination of mul, add, and sub instructions to achieve the same effect. However, this would be more expensive in terms of instruction size, as those instructions are designed to work with values that are not known at runtime. By using the lea instruction, the compiler can take advantage of the fact that the lea instruction can perform multiple operations with one instruction in order to reduce code size.

  4. The lea instruction is also useful for scaling array indicies to their native type. Recall that in C, if you subscript an array, the subscript is performed on whole array elements; in the x86 instruction set, however, the processor has no way of magically knowing the size of an array element when the programmer subscripts an array.

    For example, consider the following structure:

    0:000> dt nt!_MMPFN
       +0x000 u1               : __unnamed
       +0x004 PteAddress       : Ptr32 _MMPTE
       +0x008 u2               : __unnamed
       +0x00c u3               : __unnamed
       +0x010 OriginalPte      : _MMPTE
       +0x010 AweReferenceCount : Int4B
       +0x014 u4               : __unnamed ; 4 bytes
    

    In this case, sizeof(nt!_MMPFN) == 24 (0x18). Consider an array of _MMPFN structures like so:

    _MMPFN MmPfnDatabase[ array_size ];

    If the programmer wants to index MmPfnDatabase (i.e. retrieve a pointer to a particular _MMPFN element within the array), then the compiler needs to convert an index into a pointer to an _MMPFN structure contained within the array.

    For example, the programmer might write the following C code:

    _MMPFN* Pfn = &MmPfnDatabase[ PfnIndex ];

    At the x86 instruction set level, though, the compiler needs to convert PfnIndex into a pointer. This requires two operations: First, PfnIndex needs to be scaled to the array size (or multipled by sizeof(_MMPFN). Second, the resultant value needs to be added to the base of MmPfnDatabase to form a complete pointer value to the requested _MMPFN element.

    In order to accomplish this, the compiler might emit the following instruction sequence:

    mov     eax, [ebp+PfnIndex]
    mov     ecx, _MmPfnDatabase
    push    ebx
    mov     ebx, [ebp+arg_4]
    lea     eax, [eax+eax*2]
    push    esi
    lea     esi, [ecx+eax*8]
    

    Here, the lea instruction is used to take the PfnIndex and MmPfnDatabase values and combine them into an _MMPFN pointer (stored in “esi”). If we break down the individual operations performed, what’s going on here isn’t all that difficult to understand:

    1. The initial LEA operation is equivalent to multiplying PfnIndex by 3 (PfnIndex is stored into “eax”).
    2. The final LEA operation multiplies the result of the first LEA operation by 8 (which can be simplified to say that PfnIndex has been multiplied by 24, which is conveniently equal to sizeof(_MMPFN).
    3. Finally (also part of the last LEA operation), “ecx” (which was loaded with the base address of MmPfnDatabase) is added to the intermediate result, which is then placed into “esi”, forming a completed _MMPFN pointer.

    Like with performing constant math for non-array indexing, the advantage of using lea over a series of mul and add operations is primarily code size, as lea allows for several distinct operations (e.g. multiply/add, or multiply/subtract) to be combined into a single compact operation. Most processors also provide very fast implementations of the lea instruction (as compared to the mul instruction for constant multiply operations).

    In order to be able to differentiate between an array indexing operation and plain constant math using lea, I would recommend checking to see whether any of the input values are treated as pointers or if the output value is treated as a pointer. Usually, it’s fairly easy to make this determination, though.

    As an aside, if you are reverse engineering something, constructs like array index operations are very handy as they will definitively tell you the size of the structure comprising an array.

The next post in this series will continue this discussion and examine several more common (and more complicated) optimizations and assembly tricks that the compiler may emit on x86. Stay tuned…

Enabling the local kernel debugger on Vista RTM

Friday, February 9th, 2007

If you’re a kernel developer, and you’ve upgraded to Vista, then one of the changes that you may have noticed is that you can’t perform local kernel debugging anymore.

This is true even if you elevate WinDbg. If you try, you’ll get an error message stating that the debugger failed to get KD version information (error 5), which corresponds to the Win32 ERROR_ACCESS_DENIED error code.

This is due to a change from Vista RC2 to Vista RTM, where the kernel function responsible for supporting much of the local KD functionality in WinDbg (KdSystemDebugControl) was altered to require the system to be booted with /DEBUG. This is apparent if we compare RC2 to RTM.

RC2 has the following check in KdSystemDebugControl (one comparison against KdpBootedNodebug):

nt!KdSystemDebugControl:
push    0F4h
push    81841938
call    nt!_SEH_prolog4
xor     ebx,ebx
mov     dword ptr [ebp-28h],ebx
mov     dword ptr [ebp-20h],ebx
mov     dword ptr [ebp-24h],ebx
cmp     byte ptr [nt!KdpBootedNodebug],bl
je      nt!KdSystemDebugControl+0x2c ; Success
mov     eax,0C0000022h ; STATUS_ACCESS_DENIED

On Vista RTM, two additional checks were added against nt!KdPitchDebugger and nt!KdDebuggerEnabled (disregard the fact that the RTM disassembly is from the x64 version; both the x86 and x64 Vista versions have the same checks):

nt!KdSystemDebugControl:
mov     qword ptr [rsp+8],rbx
mov     qword ptr [rsp+10h],rdi
push    r12
sub     rsp,170h
mov     r10,rdx
and     dword ptr [rsp+44h],0
and     qword ptr [rsp+48h],0
and     qword ptr [rsp+50h],0
cmp     byte ptr [nt!KdpBootedNodebug)],0
jne     nt!KdSystemDebugControl+0x8b7 ; Fail
cmp     byte ptr [nt!KdPitchDebugger],0
jne     nt!KdSystemDebugControl+0x8b7 ; Fail
cmp     byte ptr [nt!KdDebuggerEnabled],0
je      nt!KdSystemDebugControl+0x8b7 ; Fail

The essence of these checks is that you need to be booted with /DEBUG enabled in order for local kernel debugging to work.

There is a simple way to accomplish this, however, without the usual painful aspects of having a kernel debugger attached (e.g. breaking on user mode exceptions or breakpoints).

All you have to do is enable kernel debugging, and then disable user mode exception handling. This requires the following options to be set via BCDEdit.exe, the Vista boot configuration database manager:

  1. bcdedit /debug on. This enables kernel debugging for the booted OS configuration.
  2. bcdedit /dbgsettings <type> /start disable /noumex (where type corresponds to a usable KD type on your computer, such as 1394). This disables user mode exception handling for the kernel debugger. You should still be able to boot the system without a kernel debugger attached.

After setting these options, reboot, and you should be set. You’ll now be able to use local KD (you must still remember to elevate the debugger, though), but you won’t have user mode programs try to break into the kernel debugger when they crash without a user mode debugger attached.

Note, however, that you’ll still be able to break in to the system with a kernel debugger after boot if you choose these options (and if the box crashes in kernel mode, it’ll freeze waiting for a debugger to attach). However, at least you will not have to contend with errant user mode programs causing the system to break into the kernel debugger.

Don’t always trust the compiler… (or when reverse engineering comes in handy even when you’ve got source code)

Wednesday, January 10th, 2007

Usually, when a program breaks, you look for a bug in the program. On the rare occasion, however, compilers have been known to malfunction.

I ran into such a problem recently. At my apartment, I have a video streaming system setup, wherein I have a TV tuner plugged into a dedicated desktop box. That desktop box has been setup to run VLC noninteractively in order to stream (broadcast) TV from the TV tuner onto my apartment LAN. Then, if I want to watch TV, all I have to do is pull up VLC at a computer and tell it to display the MPEG stream I have configured to be broadcast on my local network.

This works fairly well (although VLC isn’t without it’s quirks), and it’s got the nice side effect of that I have a bit more flexibility as to where I want to watch TV at without having to invest in extra hardware (beyond a TV tuner). Furthermore, I can even do silly things like put TV up on multiple monitors if I really wanted to, something not normally doable if you just use a “plain” TV set (the old fashioned way!).

Recently, though, one of my computers ceased being able to play the MPEG stream I was running over my network. Investigation showed that other computers on the LAN weren’t having problems displaying the stream; only this one system in particular wouldn’t play the stream correctly. When I connected VLC to the stream, I’d get a blank black screen with no audio or video. I checked out the VLC debug message log and found numerous instances of this log message:

warning: received bufer in the future

Hmm. It seemed like VLC was having timing-related problems that were causing it to drop frames. My first reaction was that VLC had some broken-ness relating to the handling of large uptimes (this system in question had recently exceeded the “49.7 day boundary”, wherein the value returned by GetTickCount, a count in milliseconds of time elapsed since the system booted, wraps around to zero). I set out to prove this assumption by setting a breakpoint on kernel32!GetTickCount in the debugger and attaching VLC to the stream. While GetTickCount was occasionally called, it turned out that it wasn’t being used in the critical code path in question.

So, I set out to find that log message in the VLC source code (VLC is open source). It turned out to be coming from a function relating to audio decoding (aout_DecPlay). The relevant code turned out to be as follows (reformatting by me):

[...]
if ( p_buffer->start_date > mdate() +
     p_input->i_pts_delay           +
     AOUT_MAX_ADVANCE_TIME )

{
     msg_Warn( p_aout,
      "received buffer in the future ("I64Fd")",
       p_buffer->start_date - mdate());

[...]

After logging this warning, the function in question drops the frame with the assumption that it is probably bogus due to bad timing information.

Clearly, there was nothing wrong with the stream itself, as I could still play the stream fine on other computers. In fact, restarting VLC on the computer hosting the stream, or the computer hosting the VLC stream itself both did nothing to resolve the problem; other computers could play the stream, except for one system (with a high uptime) that would always fail due to bad timing information.

In this case, it turns out that the mdate function is an internal VLC function used all over the place for high resolution timing. It returns a microsecond-precision counter that is monotically incrementing since VLC started (or in the case of Win32 VLC, since Windows was started). I continued to suspect that something was wrong here (as the only system that was failing to play the stream had a fairly high uptime). Looking into the source for mdate, there were two code paths that could be taken on Win32; one that used GetTickCount for timing resolution (though this code path in question does handle tick count wraparound), and another path that utilizes QueryPerformanceCounter and QueryPerformanceFrequency for high resolution timing, if VLC thinks that the performance counter is slaved to the system timer clock. (Whether or not the latter is really a good thing to do period on Windows is debatable; I would say no, but it appears to work for VLC.)

As I had already ruled out GetTickCount as being used in the timing-critical parts of VLC, I ignored the GetTickCount-related code path in mdate. This left the following segment of code in the Win32 version of mdate:

/**
 * Return high precision date
 *
 * Uses the gettimeofday() function when
 *  possible (1 MHz resolution) or the
 * ftime() function (1 kHz resolution).
 */
mtime_t mdate( void )
{
 /* We don't need the real date, just the value of
    a high precision timer */
 static mtime_t freq = I64C(-1);

 if( freq == I64C(-1) )
 {
  /* Extract from the Tcl source code:
   * (http://www.cs.man.ac.uk/fellowsd-bin/TIP/7.html)
   *
   * Some hardware abstraction layers use the CPU clock
   * in place of the real-time clock as a performance counter
   * reference.  This results in:
   * - inconsistent results among the processors on
   *   multi-processor systems.
   * - unpredictable changes in performance counter frequency
   *   on "gearshift" processors such as Transmeta and
   *   SpeedStep.
   * There seems to be no way to test whether the performance
   * counter is reliable, but a useful heuristic is that
   * if its frequency is 1.193182 MHz or 3.579545 MHz, it's
   * derived from a colorburst crystal and is therefore
   * the RTC rather than the TSC.  If it's anything else, we
   * presume that the performance counter is unreliable.
   */

  freq = ( QueryPerformanceFrequency( (LARGE_INTEGER *)&freq )
      && (freq == I64C(1193182) || freq == I64C(3579545) ) )
      ? freq : 0;
 }

 if( freq != 0 )
 {
  LARGE_INTEGER counter;
  QueryPerformanceCounter (&counter);

  /* Convert to from (1/freq) to microsecond resolution */
  /* We need to split the division to avoid 63-bits
       overflow */
  lldiv_t d = lldiv (counter.QuadPart, freq);

  return (d.quot * 1000000)
    + ((d.rem * 1000000) / freq);
 }
[...]
}

This code isn’t all that hard to follow. The idea is that the first time around, mdate will check the performance counter frequency for the current system. If it is one of two magical values, then mdate will be configured to use the performance counter for timing. Otherwise, it is configured to use an alternate method (not shown here), which is based on GetTickCount. On the system in question, mdate was being set to use the performance counter and not GetTickCount.

Assuming that mdate has decided on using the system performance counter for timing purposes (which, again, I do not believe is a particularly good (portable) choice, though it does happen to work on my system), then mdate simply divides out the counter value by the frequency (count of counter units per second), adjusted to return a nanosecond value (hence the constant 1000000 vales). The reason why the original author split up the division into two parts is evident by the comment; it is an effort to avoid an integer overflow when performing math on large quantities (it avoids multiplying an already very large (64-bit) value by 1000000 before the divission, which might then exceed 64 bits in the resultant quantity). (In case you were wondering, lldiv is a 64-bit version of the standard C runtime function ldiv; that is, it performs an integral 64-bit division with remainder.)

Given this code, it would appear that mtime should be working fine. Just to be sure, though, I decided to double check what was going on the debugger. Although VLC was built with gcc (and thus doesn’t ship with WinDbg-compatible symbol files), mtime is a function exported by one of the core VLC DLLs (libvlc.dll), so there wasn’t any great difficulty in setting a breakpoint on it with the debugger.

What I found was that mdate was in fact returning a strange value (to be precise, a large negative value – mtime_t is a signed 64-bit integer). Given the expression used in the audio decoding function snippet I listed above, it’s no surprise why that would break if mdate returned a negative value (and it’s a good assumption that other code in VLC would similarly break).

The relevant code portions for the actual implementation of mdate that gcc built were as so:

libvlc!mdate+0xe0:
62e20aa0 8d442428     lea     eax,[esp+0x28]
62e20aa4 890424       mov     [esp],eax
;
; QueryPerformanceCounter(&counter)
;
62e20aa7 e874640800   call    QueryPerformanceCounter
62e20aac 83ec04       sub     esp,0x4
62e20aaf b940420f00   mov     ecx,0xf4240 ; 1000000
62e20ab4 8b742428     mov     esi,[esp+0x28]
62e20ab8 8b7c242c     mov     edi,[esp+0x2c]
62e20abc 89f0         mov     eax,esi
62e20abe f7e1         mul     ecx
62e20ac0 89c1         mov     ecx,eax
62e20ac2 69c740420f00 imul    eax,edi,0xf4240 ; 1000000
62e20ac8 890c24       mov     [esp],ecx
62e20acb 8b3dcc7a2763 mov     edi,[freq.HighPart]
62e20ad1 8d3402       lea     esi,[edx+eax]
62e20ad4 897c240c     mov     [esp+0xc],edi
62e20ad8 8b15c87a2763 mov     edx,[freq.LowPart]
62e20ade 89742404     mov     [esp+0x4],esi
62e20ae2 89542408     mov     [esp+0x8],edx
;
; lldiv(...)
;
62e20ae6 e815983e00   call    lldiv
62e20aeb 8b5c2430     mov     ebx,[esp+0x30]
62e20aef 8b742434     mov     esi,[esp+0x34]
62e20af3 8b7c2438     mov     edi,[esp+0x38]
62e20af7 83c43c       add     esp,0x3c
62e20afa c3           ret

This bit of code might look a bit daunting at first, but it’s not too bad. Translated into C, it looks approximately like so:

LARGE_INTEGER counter, tmp;

QueryPerformanceCounter(&counter);

tmp.LowPart  = Counter.LowPart  * 1000000;
tmp.HighPart = Counter.HighPart * 1000000 +
    (((unsigned __int64)counter.LowPart  * 1000000) >> 32);

d = lldiv(tmp.QuadPart, freq);

return d.quot;

This looks code looks a little bit weird, though. It’s not exactly the same thing that we see in the VLC source code, even counting for differences that might arise between original C source code and reverse enginereed C source code; in the compiled code, the expression in the return statement has been moved before the call to lldiv.

In fact, the code has been heavily optimized. The compiler (gcc, in this case) apparently assumed some knowledge about the inner workings of lldiv, and decided that it would be safe to pre-calculate an input value instead of perform post-calculations on the result of lldiv. The calculations do appear to be equivalent, at first; the compiler simply moved a multiply around relative to a division that used remainders. Basic algebra tells us that there isn’t anything wrong with doing this.

However, there’s one little complication: computers don’t really do “basic algebra”. Normally, in math, you typically assume an unlimited space for variables and intermediate values, but in computer-land, this isn’t really the case. Computers approximate the set of all integer values in a 32-bit (or 64-bit) number-space, and as a result, there is a cap on how large (or small) of an integer you can represent natively, at least without going to a lot of extra work to support truly arbitrarily large integers (as is often done in public key cryptography implementations).

Taking a closer look at this case, there is a problem; the optimizations done by gcc cause some of the intermediate values of this calculation to grow to be very large. While the end result might be equivalent in pure math, when dealing with computers, the rules change a bit due to the fact that we are dealing with an integer with a maximum size of 64 bits. Specifically, this ends up being a problem because the gcc-optimized version of mdate multiplies the raw value of “counter” by 1000000 (as opposed to multiplying the result of the first division by 1000000). Presumably, gcc has performed this optimization as multiply is fairly cheap as far as computers go (and division is fairly expensive in comparison).

Now, while one might naively assume that the original version of mdate and the instructions emitted by gcc are equivalent, with the above information in mind, it’s clear that this isn’t really the case for the entire range of values that might be returned by QueryPerformanceCounter. Specifically, if the counter value multiplied by 1000000 exceeds the range of a 64-bit integer, then the two versions of mdate will not return the same value, as in the second version, one of the intermediate values of this calculation will “wrap around” (and in fact, to make matters worse, mdate is dealing with signed 64-bit values here, which limits the size of an integer to 63 significant bits, with one bit reserved for the representation of the integer’s sign).

This can be experimentally confirmed in the debugger, as I previously alluded to. Stepping through mdate, around the call to lldiv specifically, we can see that the intermediate value has exceeded the limits of a 63-bit integer with sign bit:

0:007>
eax=d5ebcb40 ebx=00369e99 ecx=6f602e40
edx=00369e99 esi=d5f158ce edi=00000000
eip=62e20ae6 esp=01d0fbfc ebp=00b73068
iopl=0         ov up ei ng nz nape cy
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000a83
libvlc!mdate+0x126:
62e20ae6 e815983e00       call    lldiv
0:007> dd @esp
01d0fbfc  6f602e40 d5f158ce 00369e99 00000000
01d0fc0c  00000060 01d0ffa8 77e6b7d0 77e6bb00
01d0fc1c  ffffffff 77e6bafd 5d29c231 00000e05
01d0fc2c  00000000 00b73068 00000000 62e2a544
01d0fc3c  00000f08 ffffffff 01d0fd50 00b72fd8
01d0fc4c  03acc670 00a49530 01d0fd50 6b941bd2
01d0fc5c  00a49530 00000f30 00000000 00000003
01d0fc6c  00b24130 00b53e20 0000000f 00b244a8
0:007> p
eax=e1083396 ebx=00369e99 ecx=ffffffff
edx=ffffff3a esi=d5f158ce edi=00000000
eip=62e20aeb esp=01d0fbfc ebp=00b73068
iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000206
libvlc!mdate+0x12b:
62e20aeb 8b5c2430         mov     ebx,[esp+0x30]

Using our knowledge of calling conventions, it’s easy to retrieve the arguments to lldiv from the stack: tmp.QuadPart is 0xd5f158ce6f602e40, and freq is 0x0000000000369e99

It’s clear that counter.QuadPart has overflowed here; considering the sign bit is set, it now holds a (very large) negative quantity. Since the remainder of the function does nothing that would influence the sign bit of the result, after the division, we get another (large, but closer to zero) negative value back (stored in edx:eax, the value 0xffffff3ae1083396). This is the final return value of mdate, which explains the problems I was experiencing with playing video streams; large negative values were being returned and causing sign-sensitive inequality tests on the return value of mdate (or a derivative thereof) to operate unexpectedly.

In this case, it turned out that VLC failing to play my video stream wasn’t really the fault of VLC; it ended up being a bug in gcc’s optimizer that caused it to make unsafe optimizations that introduce calculation errors. What’s particularly insidious about this mis-optimization is that it is invisible until the input values for the operations involve grow to a certain size, after which calculation results are wildly off. This explains why nobody else has run into this problem in VLC enough to get it fixed by now; unless you run VLC on Windows systems with a high uptime, where VLC is convinced that it can use the performance counter for timing, you would never know that the compiler had introduced a subtle (but serious) bug due to optimizations.

As to fixing this problem, there are a couple of approaches the VLC team could take. The first is to update to a more recent version of gcc (if newer gcc versions fix this problem; I don’t have a build environment that would let me compile all of VLC, and I haven’t really had much luck in generating a minimal repro for this problem, unfortunately). Alternatively, the function could be rewritten until gcc’s optimizer decided to stop trying to optimize the division (and thus introduce calculation errors).

A better solution would be to just drop the usage of QueryPerformanceCounter entirely, though. For VLC’s usage, GetTickCount should be close enough timing-wise, and you can even increase the resolution of GetTickCount up to around 1ms (with good hardware) using timeBeginTime. GetTickCount does have the infamous 49.7-day wraparound problem, but VLC does have a workaround that works. Furthermore, on Windows Vista and later, GetTickCount64 could be used, turning the 49.7-day limit into a virtual non-issue (at least in our lifetimes, anyway).

(Oh, and in case you’re wondering why I didn’t just fix this myself and submit a patch to VLC (after all, it’s open source, so why can’t I just “fix it myself”?): VLC’s source code distribution is ~100mb uncompressed, and I don’t really want to go spending a great deal of time to find a cygwin version that works correctly on Vista x64 with ASLR and NX enabled (cygwin’s fault, not Vista’s) so that I can get a build environment for VLC up and running so that I could test any potential fix I make (after debugging the inevitable build environment difficulties along the way for such a large project). I might still do this at some point, perhaps to see if recent gcc versions fix this optimizer bug, though.)

For now, I just patched my running VLC instance in-memory to use GetTickCount instead, using the debugger. Until I restart VLC, that will have to do for now.

Debugger internals: How loaded module names are communicated to the debugger

Monday, December 11th, 2006

If you’ve ever used the Win32 debugging API, you’ll notice that the WaitForDebugEvent routine, when returning a LOAD_DLL_DEBUG_EVENT style of event, gives you the address of an optional debuggee-relative string pointer containing the name of the DLL that is being loaded. In case you’ve ever wondered just where that string comes from, you’ll be comforted to know that this mechanism for communicating module name strings to the remote debugger is built upon a giant hack.

To give a bit of background information on how loading of DLLs works, most of the heavy-lifting with respect to loading DLLs (referred to as “mapping an image”) is done by the memory manager subsystem in kernel mode – specifically, in the “MiMapViewOfImageSection” internal routine. This routine is responsible for taking a section object (known as a file mapping object in the Win32 world) that represents a PE image on disk, and setting up the in-memory layout of the PE image in the specified process address space (in the case of Win32, always the address space of the caller). This includes setting up PE image subsections with the correct alignment, zero-filling “bss”-style sections, and setting up the protections of each PE image subsection. It is also responsible for supplying the “magic” necessary to allow shared PE subsections to work. All of this behavior is controlled by the SEC_IMAGE flag being passed to NtMapViewOfSection (this behavior is visible by Win32 via passing SEC_IMAGE to MapViewOfFile, and can be used to achieve the same result of “just” mapping an image in-memory without going through the loader). Internally, the loader routine in NTDLL (LdrLoadDll and its associated subfunctions, which are called by the LoadLibrary family of routines in kernel32) utilizes NtMapViewOfSection to create the in-memory layout of the DLL being requested. After performing this task, the user-mode NTDLL-based loader then performs tasks such as applying base relocations, resolving imports to other modules (and loading dependent modules if necessary), allocating TLS data slots, making DLL initializer callouts, and soforth.

Now, the way that the debugger is notified of module load events is via a kernel mode hook that is called by NtMapViewOfSection (DbgkMapViewOfSection). This hook is responsible for detecting if a debugger (user mode or kernel mode) is present, and if so, forwarding the event to the debugger.

This is all well and good, but there’s a catch here. Both the user mode and kernel mode debuggers display the full path name to the DLL being loaded, but we’re now at the wrong level of abstraction, so to speak, to retrieve this information. All MiMapViewOfSection has is a handle to a section object (in actuality, a PSECTION_OBJECT and not a handle at this point). Now, the section object *does* have a reference to the PFILE_OBJECT associated with the file backing the section object (the reference is stored in the CONTROL_AREA of the section object), but there isn’t necessarily a good way to get the original filename that was passed to LoadLibrary out of the FILE_OBJECT (for starters, at this point, that path has already been converted to a native path instead of a Win32 path, and there is some potential ambiguouity when trying to convert native paths back to Win32 paths).

To work around this little conundrum, the solution the developers chose is to temporarily borrow a field of the NT_TIB portion of the TEB of the calling thread for use as a way to signal the name of a DLL that is being loaded (if SEC_IMAGE is being passed to NtMapViewOfSection). Specifically, NT_TIB.ArbitraryUserPointer is temporarily replaced with a string pointer (in Windows NT, this is always a unicode string) to the original filename passed to LdrLoadDll. Normally, the ArbitraryUserPointer field is reserved exclusively for use by user mode as a sort of “free TLS slot” that is available at a known location for every thread. Although this particular value is rarely used in Windows, the loader does make the effort to preserve its value across calls to LdrLoadDll. This works (since the loader knows that none of the code that it is calling will use NT_TIB.ArbitraryUserPointer), so long as you don’t have cross-thread accesses to a different thread’s NT_TIB.ArbitraryUserPointer (to date, I have never seen a program that tries to do this – and a good thing to, or it would randomly fail when DLLs are being loaded). Because the original value of NT_TIB.ArbitraryUserPointer is restored, the calling thread is typically none-the-wiser that this substitution has been performed.

Disassembling the part of the NTDLL loader responsible for mapping the DLL into the address space via NtMapViewOfSection (a subroutine named “LdrpMapViewOfDllSection” on Windows Vista), we can see this behavior in action:

ntdll!LdrpMapViewOfDllSection:
[...]
;
; Find the TEB address for the current thread.
; esi = NtCurrentTeb()->NtTib.Self
;
77f0e2ee 648b3518000000  mov     esi,dword ptr fs:[18h]
77f0e2f5 8365fc00        and     dword ptr [ebp-4],0
77f0e2f9 57              push    edi
77f0e2fa bf00000020      mov     edi,20000000h
77f0e2ff 857d18          test    dword ptr [ebp+18h],edi
77f0e302 c745f804000000  mov     dword ptr [ebp-8],4
77f0e309 0f85ce700400    jne     LdrpMapViewOfDllSection+0x26

ntdll!LdrpMapViewOfDllSection+0x42:
77f0e30f 8b4514          mov     eax,dword ptr [ebp+14h]
;
; Save away the previous ArbitraryUserPointer value.
;
; ebx = Teb->NtTib.ArbitraryUserPointer
77f0e312 8b5e14          mov     ebx,dword ptr [esi+14h]
77f0e315 6a04            push    4
77f0e317 ff7518          push    dword ptr [ebp+18h]
;
; Set the ArbitraryUserPointer value to the string pointer
; referring to the DLL name passed to LdrLoadDll.
; Teb->NtTib.ArbitraryUserPointer = (PVOID)DllNameString;
; 
77f0e31a 894614          mov     dword ptr [esi+14h],eax
77f0e31d 6a01            push    1
77f0e31f ff7510          push    dword ptr [ebp+10h]
77f0e322 33c0            xor     eax,eax
77f0e324 50              push    eax
77f0e325 50              push    eax
77f0e326 50              push    eax
77f0e327 ff750c          push    dword ptr [ebp+0Ch]
77f0e32a 6aff            push    0FFFFFFFFh
77f0e32c ff7508          push    dword ptr [ebp+8]
;
; Call NtMapViewOfSection to map the image and perform the
; debugger notification.
;
77f0e32f e830180300      call    NtMapViewOfSection
77f0e334 857d18          test    dword ptr [ebp+18h],edi
77f0e337 5f              pop     edi
;
; Restore the previous value of
; Teb->NtTib.ArbitraryUserPointer.
;
77f0e338 895e14          mov     dword ptr [esi+14h],ebx
77f0e33b 5e              pop     esi
77f0e33c 894514          mov     dword ptr [ebp+14h],eax
77f0e33f 5b              pop     ebx
77f0e340 0f85bc700400    jne     LdrpMapViewOfDllSection+0x75

Sure enough, the user mode loader uses the current thread’s NT_TIB.ArbitraryUserPointer to communicate the DLL name string pointer (in this context, the “eax” value loaded into NT_TIB.ArbitraryUserPointer is the dll name string.) We can easily verify this in the debugger:

Breakpoint 0 hit
eax=0017ecfc ebx=00000000 ecx=0017ecd8
edx=774951b4 esi=c0000135 edi=0017ed80
eip=773fe2e5 esp=0017ec10 ebp=0017ed18
iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b
gs=0000             efl=00000246
ntdll!LdrpMapViewOfDllSection:
773fe2e5 8bff            mov     edi,edi
0:000> g 773fe31a 
eax=001db560 ebx=00000000 ecx=0017ecd8
edx=774951b4 esi=7ffdf000 edi=20000000
eip=773fe31a esp=0017ebf0 ebp=0017ec0c
iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b
gs=0000             efl=00000246
ntdll!LdrpMapViewOfDllSection+0x4d:
773fe31a 894614          mov     dword ptr [esi+14h],eax
0:000> du @eax
001db560  "C:\\Windows\\system32\\CLBCatQ.DLL"

Looking in the kernel, we can clearly see the call to DbgkMapViewOfSection:

ntoskrnl!NtMapViewOfSection+0x21a:
0060a9b6 50              push    eax
0060a9b7 8b55e0          mov     edx,dword ptr [ebp-20h]
0060a9ba 8b4dd8          mov     ecx,dword ptr [ebp-28h]
0060a9bd e86e1c0100      call    ntoskrnl!DbgkMapViewOfSection

Additionally, we can see the references to NT_TIB in DbgkMapViewOfSection:

ntoskrnl!DbgkMapViewOfSection+0x65:
;
; Load eax with the address of the current thread's
; KTHREAD object.
;
; Here, fs refers to the KPCR.
;    +0x120 PrcbData         : _KPRCB
;  (in KPRCB)
;    +0x004 CurrentThread    : Ptr32 _KTHREAD
;
0061c695 64a124010000    mov     eax,dword ptr fs:[00000124h]
;
; Load esi with the address of the current thread's
; user mode PTEB.
;
; Here, we have the following layout in KTHREAD:
;    +0x084 Teb              : Ptr32 Void
;
0061c69b 8bb084000000    mov     esi,dword ptr [eax+84h]
0061c6a1 eb02            jmp     DbgkMapViewOfSection+0x75
ntoskrnl!DbgkMapViewOfSection+0x75:
0061c6a5 3bf3            cmp     esi,ebx
0061c6a7 7421            je      DbgkMapViewOfSection+0x9a
0061c6a9 3b8a44010000    cmp     ecx,dword ptr [edx+144h]
0061c6af 7519            jne     DbgkMapViewOfSection+0x9a
0061c6b1 56              push    esi
0061c6b2 e82c060200      call    DbgkpSuppressDbgMsg
0061c6b7 85c0            test    eax,eax
0061c6b9 0f85bf000000    jne     DbgkMapViewOfSection+0x144
0:000> u
ntoskrnl!DbgkMapViewOfSection+0x8f:
;
; Recall that 14 is the offset of the
; ArbitraryUserPointer member in NT_TIB,
; and that NT_TIB is the first member of TEB.
;
;    +0x000 NtTib            : _NT_TIB
;  (in NT_TIB)
;    +0x014 ArbitraryUserPointer : Ptr32 Void
;
0061c6bf 83c614          add     esi,14h
;
; [ebp-90h] is now the current thread's value of
; NtCurrentTeb()->NtTib.ArbitraryUserPointer
;
0061c6c2 89b570ffffff    mov     dword ptr [ebp-90h],esi

Thus is the story of how the filename that you pass to LoadLibrary ends up being communicated to the debugger, in a rather round-about and hackish way.

It is also worth noting that the kernel cannot trust the user mode supplied filename for use with opening the file handle to the DLL passed to the debugger process. This is because the kernel uses ZwOpenFile which bypasses normal security checks. As a result, the kernel needs to retrieve the filename via querying the section’s associated PFILE_OBJECT anyway, although for different purposes than providing the filename to the debugger.

An introduction to kernrate (the Windows kernel profiler)

Thursday, December 7th, 2006

One useful utility for tracking down performance problems that you might not have heard of until now is kernrate, the Windows kernel profiler. This utility currently ships with the Windows Server 2003 Resource Kit Tools package (though you can use kernate on Windows XP is well) and is freely downloadable. Currently, you’ll have to match the version of kernrate you want to use with your processor architecture, so if you are using your processor in x64 mode with an x64 Windows edition, then you’ll have to dig up an x64 version of kernrate (the one that ships with the Srv03 resource kit tools is x86); KrView (see below) ships with an x64 compatible version of kernrate.

Kernrate requires that you have the SeProfilePrivilege assigned (which is typically only granted to administrators), so in most cases you will need to be a local administrator on your system in order to use it. This privilege allows access to the (undocumented) profile object system services. These APIs allow programmatic access to sample the instruction pointer at certain intervals (typically, a profiler program selects the timer interrupt for use with instruction pointer sampling). This allows you to get a feel for what the system is doing over time, which is in turn useful for identifying the cause of performance issues where a particular operation appears to be processor bound and taking longer than you would like.

There are a multitude of options that you can give kernrate (and you are probably best served by experimenting with them a bit on your own), so I’ll just cover the common ones that you’ll need to get started (use “kernrate -?” to get a list of all supported options).

Kernrate can be used to profile both user mode and kernel mode performance issues. By default, it operates only on kernel mode code, but you can override this via the -a (and -av) options, which cause kernrate to include user mode code in its profiling operations in addition to kernel mode code. Additionally, by default, kernrate operates over the entire system at once; to get meaningful results with profiling user mode code, you’ll want to specify a process (or group of processes) to profile, with the “-p pid” and/or “-n process-name” arguments. (The process name is the first 8 characters of a process’s main executable filename.)

To terminate collection of profiling data, use Ctrl-C. (On pre-Windows-Vista systems where you might be running kernrate.exe via runas, remember that Ctrl-C does not work on console processes started via runas.) Additionally, you can use the “-s seconds” argument to specify that profling should be automagically stopped after a given count of seconds have elapsed.

If you run kernrate on kernel mode code only, or just specify a process (or group of processes) as described above, you’ll notice that you get a whole lot of general system-wide output (information about interrupt counts, global processor time usage, context switch counts, I/O operation counts) in addition to output about which modules used a noteworthy amount of processor time. Here’s an example output of running kernrate on just the kernel on my system, as described above (including just the module totals):

D:\\Programs\\Utilities>kernrate
Kernrate User-Specified Command Line:
kernrate


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on
the Total Hits for the Kernel

Time   197 hits, 25000 events per hit --------
 Module    Hits   msec  %Total  Events/Sec
intelppm     67        980    34 %     1709183
ntkrnlpa     52        981    26 %     1325178
win32k       35        981    17 %      891946
hal          19        981     9 %      484199
dxgkrnl       6        980     3 %      153061
nvlddmkm      6        980     3 %      153061
fanio         3        981     1 %       76452
bcm4sbxp      2        981     1 %       50968
portcls       2        980     1 %       51020
STAC97        2        980     1 %       51020
bthport       1        981     0 %       25484
BTHUSB        1        981     0 %       25484
Ntfs          1        980     0 %       25510

Using kernrate in this fashion is a good first step towards profiling a performance problem (especially if you are working with someone else’s program), as it quickly allows you to narrow down a processor hog to a particular module. While this is useful as a first step, however, it doesn’t really give you a whole lot of information about what specific code in a particular mode is taking a lot of processor time.

To dig in deeper as to the cause of the problem (beyond just tracing it to a particular module), you’ll need to use the “-z module-name” option. This option tells kernrate to “zoom in” on a particular module; that is, for the given module, kernrate will track instruction pointer locations within the module to individual functions. This level of granularity is often what you’ll need for tracking down a performance issue (at least as far as profiling is concerned). You can repeat the “-z” option multiple times to “zoom in” to multiple modules (useful if the problem you are tracking down involves high processor usage across multiple DLLs or binaries).

Because kernrate is resolving instruction pointer sampling down to a more granular level than modules (with the “-z” option), you’ll need to tell it how to load symbols for all affected modules (otherwise, the granularity for profiler output will typically be very poor, often restricted to just exported functions). There are two ways to do this. First, you can use the “-j symbol-path” command line option; this option tells kernrate to pass a particular symbol path to DbgHelp for use with loading symbols. I recommend the second option, however, which is to configure your _NT_SYMBOL_PATH before-hand so that it points to a valid DbgHelp symbol path. This relieves you of having to manually tell kernrate a symbol path every time you execute it.

Continuing with the example I gave above, we might be interested in just what the “win32k” (the Win32 kernel mode support driver for USER/GDI) module is doing that was taking up 17% of the processor time spent in kernel mode on my system (for the interval that I was profiling). To do that, we can use the following command line (the output has been truncated only include information that we are interested in):

D:\\Programs\\Utilities>kernrate -z win32k

Kernrate User-Specified Command Line:
kernrate -z win32k


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
CallBack: Finished Attempt to Load symbols for
90a00000 \\SystemRoot\\System32\\win32k.sys

Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on the
Total Hits for the Kernel

Time   2465 hits, 25000 events per hit --------
 Module      Hits   msec  %Total  Events/Sec
ntkrnlpa     1273      14799    51 %     2150483
win32k        388      14799    15 %      655449
intelppm      263      14799    10 %      444286
hal           236      14799     9 %      398675
bcm4sbxp       66      14799     2 %      111494
spsys          55      14799     2 %       92911
nvlddmkm       48      14799     1 %       81086
STAC97         31      14799     1 %       52368

[...]


===> Processing Zoomed Module win32k.sys...


----- Zoomed module win32k.sys (Bucket size = 16 bytes,
Rounding Down) --------
Percentage in the following table is based on the
Total Hits for this Zoom Module

Time   388 hits, 25000 events per hit --------
 Module                  Hits   msec  %Total  Events/Sec
xxxInternalDoPaint         44      14799    10 %       74329
XDCOBJ::bSaveAttributes    20      14799     4 %       33786
DelayedDestroyCacheDC      20      14799     4 %       33786
HANDLELOCK::vLockHandle    15      14799     3 %       25339
mmxAlphaPerPixelOnly       15      14799     3 %       25339
XDCOBJ::RestoreAttributes  13      14799     2 %       21960
DoTimer                    12      14799     2 %       20271
_SEH_prolog4               11      14799     2 %       18582
memmove                     9      14799     2 %       15203
_GetDCEx                    6      14799     1 %       10135
HmgLockEx                   6      14799     1 %       10135
XDCOBJ::bCleanDC            5      14799     1 %        8446
XEPALOBJ::ulIndexToRGB      5      14799     1 %        8446
HmgShareCheckLock           4      14799     0 %        6757
RGNOBJ::bMerge              4      14799     0 %        6757

[...]

This should give you a feel for the kind of information that you’ll get from kernrate. Although the examples I gave were profiling kernel mode code, the whole process works the same way for user mode if you use the “-p” or “-n” options as I mentioned earlier. In conjunction with a debugger, the information that kernrate gives you can often be a great help in narrowing down CPU usage performance problems (or at the very least point you in the general direction as to where you’ll need to do further research).

There are also a variety of other options that are available in kernrate, such as features for gathering information about “hot” locks that have a high degree of contention, and support for launching new processes under the profiler. There is also support for outputting the raw sampled profile data, which can be used to graph the output (such as you might see used with tools like KrView).

Although kernrate doesn’t have all the “bells and whistles” of some of the high-end profiling tools (like Intel’s vTune), it’s often enough to get the job done, and it’s also available to you at no extra cost (and can be quickly and easily deployed to help find the source of a problem). I’d highly recommend giving it a shot if you are trying to analyze a performance problem and don’t already have a profiling solution that you are using.

Frame pointer omission (FPO) optimization and consequences when debugging, part 2

Wednesday, December 6th, 2006

This series is about frame pointer omission (FPO) optimization and how it impacts the debugging experience.

  1. Frame pointer omission (FPO) and consequences when debugging, part 1.
  2. Frame pointer omission (FPO) and consequences when debugging, part 2.

Last time, I outlined the basics as to just what FPO does, and what it means in terms of generated code when you compile programs with or without FPO enabled. This article builds on the last, and lays out just what the impacts of having FPO enabled (or disabled) are when you end up having to debug a program.

For the purposes of this article, consider the following example program with several do-nothing functions that shuffle stack arguments around and call eachother. (For the purposes of this posting, I have disabled global optimizations and function inlining.)

__declspec(noinline)
void
f3(
   int* c,
   char* b,
   int a
   )
{
   *c = a * 3 + (int)strlen(b);

   __debugbreak();
}

__declspec(noinline)
int
f2(
   char* b,
   int a
   )
{
   int c;

   f3(
      &c,
      b + 1,
      a - 3);

   return c;
}

__declspec(noinline)
int
f1(
   int a,
   char* b
   )
{
   int c;
   
   c = f2(
      b,
      a + 10);

   c ^= (int)rand();

   return c + 2 * a;
}

int
__cdecl
wmain(
   int ac,
   wchar_t** av
   )
{
   int c;

   c = f1(
      (int)rand(),
      "test");

   printf("%d\\n",
      c);

   return 0;
}

If we run the program and break in to the debugger at the hardcoded breakpoint, with symbols loaded, everything is as one might expect:

0:000> k
ChildEBP RetAddr  
0012ff3c 010015ef TestApp!f3+0x19
0012ff4c 010015fe TestApp!f2+0x15
0012ff54 0100161b TestApp!f1+0x9
0012ff5c 01001896 TestApp!wmain+0xe
0012ffa0 77573833 TestApp!__tmainCRTStartup+0x10f
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Regardless of whether FPO optimization is turned on or off, since we have symbols loaded, we’ll get a reasonable call stack either way. The story is different, however, if we do not have symbols loaded. Looking at the same program, with FPO optimizations enabled and symbols not loaded, we get somewhat of a mess if we ask for a call stack:

0:000> k
ChildEBP RetAddr  
WARNING: Stack unwind information not available.
Following frames may be wrong.
0012ff4c 010015fe TestApp+0x15d8
0012ffa0 77573833 TestApp+0x15fe
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Comparing the two call stacks, we lost three of the call frames entirely in the output. The only reason we got anything slightly reasonable at all is that WinDbg’s stack trace mechanism has some intelligent heuristics to guess the location of call frames in a stack where frame pointers are used.

If we look back to how call stacks are setup with frame pointers (from the previous article), the way a program trying to walk the stack on x86 without symbols works is by treating the stack as a sort of linked list of call frames. Recall that I mentioned the layout of the stack when a frame pointer is used:

[ebp-01]   Last byte of the last local variable
[ebp+00]   Old ebp value
[ebp+04]   Return address
[ebp+08]   First argument...

This means that if we are trying to perform a stack walk without symbols, the way to go is to assume that ebp points to a “structure” that looks something like this:

typedef struct _CALL_FRAME
{
   struct _CALL_FRAME* Next;
   void*               ReturnAddress;
} CALL_FRAME, * PCALL_FRAME;

Note how this corresponds to the stack layout relative to ebp that I described above.

A very simple stack walk function designed to walk frames that are compiled with frame pointer usage might then look like so (using the _AddressOfReturnAddress intrinsic to find “ebp”, assuming that the old ebp is 4 bytes before the address of the return address):

LONG
StackwalkExceptionHandler(
   PEXCEPTION_POINTERS ExceptionPointers
   )
{
   if (ExceptionPointers->ExceptionRecord->ExceptionCode
      == EXCEPTION_ACCESS_VIOLATION)
      return EXCEPTION_EXECUTE_HANDLER;

   return EXCEPTION_CONTINUE_SEARCH;
}

void
stackwalk(
   void* ebp
   )
{
   PCALL_FRAME frame = (PCALL_FRAME)ebp;

   printf("Trying ebp %p\\n",
      ebp);

   __try
   {
      for (unsigned i = 0;
          i < 100;
          i++)
      {
         if ((ULONG_PTR)frame & 0x3)
         {
            printf("Misaligned frame\\n");
            break;
         }

         printf("#%02lu %p  [@ %p]\\n",
            i,
            frame,
            frame->ReturnAddress);

         frame = frame->Next;
      }
   }
   __except(StackwalkExceptionHandler(
      GetExceptionInformation()))
   {
      printf("Caught exception\\n");
   }
}

#pragma optimize("y", off)
__declspec(noinline)
void printstack(
   )
{
   void* ebp = (ULONG*)_AddressOfReturnAddress()
     - 1;

   stackwalk(
      ebp);
}
#pragma optimize("", on)

If we recompile the program, disable FPO optimizations, and insert a call to printstack inside the f3 function, the console output is something like so:

Trying ebp 0012FEB0
#00 0012FEB0  [@ 0100185C]
#01 0012FED0  [@ 010018B4]
#02 0012FEF8  [@ 0100190B]
#03 0012FF2C  [@ 01001965]
#04 0012FF5C  [@ 01001E5D]
#05 0012FFA0  [@ 77573833]
#06 0012FFAC  [@ 7740A9BD]
#07 0012FFEC  [@ 00000000]
Caught exception

In other words, without using any symbols, we have successfully performed a stack walk on x86.

However, this all breaks down when a function somewhere in the call stack does not use a frame pointer (i.e. was compiled with FPO optimizations enabled). In this case, the assumption that ebp always points to a CALL_FRAME structure is no longer valid, and the call stack is either cut short or is completely wrong (especially if the function in question repurposed ebp for some other use besides as a frame pointer). Although it is possible to use heuristics to try and guess what is really a call/return address record on the structure, this is really nothing more than an educated guess, and tends to be at least slightly wrong (and typically missing one or more frames entirely).

Now, you might be wondering why you might care about doing stack walk operations without symbols. After all, you have symbols for the Microsoft binaries that your program will be calling (such as kernel32) available from the Microsoft symbol server, and you (presumably) have private symbols corresponding to your own program for use when you are debugging a problem.

Well, the answer to that is that you will end up needing to record stack traces without symbols in the course of normal debugging for a wide variety of problems. The reason for this is that there is a lot of support baked into NTDLL (and NTOSKRNL) to assist in debugging a class of particularly insidious problems: handle leaks (and other problems where the wrong handle value is getting closed somewhere and you need to find out why), memory leaks, and heap corruption.

These (very useful!) debugging features offer options that allow you to configure the system to log a stack trace on each heap allocation, heap free, or each time a handle is opened or closed. Now the way these features work is that they will capture the stack trace in real time as the heap operation or handle operation happens, but instead of trying to break into the debugger to display the results of this output (which is undesirable for a number of reasons), they save a copy of the current stack trace in-memory and then continue execution normally. To display these saved stack traces, the !htrace, !heap -p, and !avrf commands have functionality that locates these saved traces in-memory and prints them out to the debugger for you to inspect.

However, NTDLL/NTOSKRNL needs a way to create these stack traces in the first place, so that it can save them for later inspection. There are a couple of requirements here:

  1. The functionality to capture stack traces must not rely on anything layed above NTDLL or NTOSKRNL. This already means that anything as complicated as downloading and loading symbols via DbgHelp is instantly out of the picture, as those functions are layered far above NTDLL / NTOSKRNL (and indeed, they must make calls into the same functions that would be logging stack traces in the first place in order to find symbols).
  2. The functionality must work when symbols for everything on the call stack are not even available to the local machine. For instance, these pieces of functionality must be deployable on a customer computer without giving that computer access to your private symbols in some fashion. As a result, even if there was a good way to locate symbols where the stack trace is being captured (which there isn’t), you couldn’t even find the symbols if you wanted to.
  3. The functionality must work in kernel mode (for saving handle traces), as handle tracing is partially managed by the kernel itself and not just NTDLL.
  4. The functionality must use a minimum amount of memory to store each stack trace, as operations like heap allocation, heap deallocation, handle creation, and handle closure are extremely frequent operations throughout the lifetime of the process. As a result, options like just saving the entire thread stack for later inspection when symbols are available cannot be used, since that would be prohibitively expensive in terms of memory usage for each saved stack trace.

Given all of these restrictions, the code responsible for saving stack traces needs to operate without symbols, and it must furthermore be able to save stack traces in a very concise manner (without using a great deal of memory for each trace).

As a result, on x86, the stack trace saving code in NTDLL and NTOSKRNL assumes that all functions in the call frame use frame pointers. This is the only realistic option for saving stack traces on x86 without symbols, as there is insufficient information baked into each individual compiled binary to reliably perform stack traces without assuming the use of a frame pointer at each call site. (The 64-bit platforms that Windows supports solve this problem with the use of extensive unwind metadata, as I have covered in a number of past articles.)

So, the functionality exposed by pageheap’s stack trace logging, and handle tracing are how stack traces without symbols end up mattering to you, the developer with symbols for all of your binaries, when you are trying to debug a problem. If you make sure to disable FPO optimization on all of your code, then you’ll be able to use tools like pageheap’s stack tracing on heap operations, UMDH (the user mode heap debugger), and handle tracing to track down heap-related problems and handle-related problems. The best part of these features is that you can even deploy them on a customer site without having to install a full debugger (or run your program under a debugger), only later taking a minidump of your process for examination in the lab. All of them rely on FPO optimizations being disabled (at least on x86), though, so remember to turn FPO optimizations off on your release builds for the increased debuggability of these tough-to-find problems in the field.