An introduction to kernrate (the Windows kernel profiler)

One useful utility for tracking down performance problems that you might not have heard of until now is kernrate, the Windows kernel profiler. This utility currently ships with the Windows Server 2003 Resource Kit Tools package (though you can use kernate on Windows XP is well) and is freely downloadable. Currently, you’ll have to match the version of kernrate you want to use with your processor architecture, so if you are using your processor in x64 mode with an x64 Windows edition, then you’ll have to dig up an x64 version of kernrate (the one that ships with the Srv03 resource kit tools is x86); KrView (see below) ships with an x64 compatible version of kernrate.

Kernrate requires that you have the SeProfilePrivilege assigned (which is typically only granted to administrators), so in most cases you will need to be a local administrator on your system in order to use it. This privilege allows access to the (undocumented) profile object system services. These APIs allow programmatic access to sample the instruction pointer at certain intervals (typically, a profiler program selects the timer interrupt for use with instruction pointer sampling). This allows you to get a feel for what the system is doing over time, which is in turn useful for identifying the cause of performance issues where a particular operation appears to be processor bound and taking longer than you would like.

There are a multitude of options that you can give kernrate (and you are probably best served by experimenting with them a bit on your own), so I’ll just cover the common ones that you’ll need to get started (use “kernrate -?” to get a list of all supported options).

Kernrate can be used to profile both user mode and kernel mode performance issues. By default, it operates only on kernel mode code, but you can override this via the -a (and -av) options, which cause kernrate to include user mode code in its profiling operations in addition to kernel mode code. Additionally, by default, kernrate operates over the entire system at once; to get meaningful results with profiling user mode code, you’ll want to specify a process (or group of processes) to profile, with the “-p pid” and/or “-n process-name” arguments. (The process name is the first 8 characters of a process’s main executable filename.)

To terminate collection of profiling data, use Ctrl-C. (On pre-Windows-Vista systems where you might be running kernrate.exe via runas, remember that Ctrl-C does not work on console processes started via runas.) Additionally, you can use the “-s seconds” argument to specify that profling should be automagically stopped after a given count of seconds have elapsed.

If you run kernrate on kernel mode code only, or just specify a process (or group of processes) as described above, you’ll notice that you get a whole lot of general system-wide output (information about interrupt counts, global processor time usage, context switch counts, I/O operation counts) in addition to output about which modules used a noteworthy amount of processor time. Here’s an example output of running kernrate on just the kernel on my system, as described above (including just the module totals):

D:\\Programs\\Utilities>kernrate
Kernrate User-Specified Command Line:
kernrate


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on
the Total Hits for the Kernel

Time   197 hits, 25000 events per hit --------
 Module    Hits   msec  %Total  Events/Sec
intelppm     67        980    34 %     1709183
ntkrnlpa     52        981    26 %     1325178
win32k       35        981    17 %      891946
hal          19        981     9 %      484199
dxgkrnl       6        980     3 %      153061
nvlddmkm      6        980     3 %      153061
fanio         3        981     1 %       76452
bcm4sbxp      2        981     1 %       50968
portcls       2        980     1 %       51020
STAC97        2        980     1 %       51020
bthport       1        981     0 %       25484
BTHUSB        1        981     0 %       25484
Ntfs          1        980     0 %       25510

Using kernrate in this fashion is a good first step towards profiling a performance problem (especially if you are working with someone else’s program), as it quickly allows you to narrow down a processor hog to a particular module. While this is useful as a first step, however, it doesn’t really give you a whole lot of information about what specific code in a particular mode is taking a lot of processor time.

To dig in deeper as to the cause of the problem (beyond just tracing it to a particular module), you’ll need to use the “-z module-name” option. This option tells kernrate to “zoom in” on a particular module; that is, for the given module, kernrate will track instruction pointer locations within the module to individual functions. This level of granularity is often what you’ll need for tracking down a performance issue (at least as far as profiling is concerned). You can repeat the “-z” option multiple times to “zoom in” to multiple modules (useful if the problem you are tracking down involves high processor usage across multiple DLLs or binaries).

Because kernrate is resolving instruction pointer sampling down to a more granular level than modules (with the “-z” option), you’ll need to tell it how to load symbols for all affected modules (otherwise, the granularity for profiler output will typically be very poor, often restricted to just exported functions). There are two ways to do this. First, you can use the “-j symbol-path” command line option; this option tells kernrate to pass a particular symbol path to DbgHelp for use with loading symbols. I recommend the second option, however, which is to configure your _NT_SYMBOL_PATH before-hand so that it points to a valid DbgHelp symbol path. This relieves you of having to manually tell kernrate a symbol path every time you execute it.

Continuing with the example I gave above, we might be interested in just what the “win32k” (the Win32 kernel mode support driver for USER/GDI) module is doing that was taking up 17% of the processor time spent in kernel mode on my system (for the interval that I was profiling). To do that, we can use the following command line (the output has been truncated only include information that we are interested in):

D:\\Programs\\Utilities>kernrate -z win32k

Kernrate User-Specified Command Line:
kernrate -z win32k


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
CallBack: Finished Attempt to Load symbols for
90a00000 \\SystemRoot\\System32\\win32k.sys

Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on the
Total Hits for the Kernel

Time   2465 hits, 25000 events per hit --------
 Module      Hits   msec  %Total  Events/Sec
ntkrnlpa     1273      14799    51 %     2150483
win32k        388      14799    15 %      655449
intelppm      263      14799    10 %      444286
hal           236      14799     9 %      398675
bcm4sbxp       66      14799     2 %      111494
spsys          55      14799     2 %       92911
nvlddmkm       48      14799     1 %       81086
STAC97         31      14799     1 %       52368

[...]


===> Processing Zoomed Module win32k.sys...


----- Zoomed module win32k.sys (Bucket size = 16 bytes,
Rounding Down) --------
Percentage in the following table is based on the
Total Hits for this Zoom Module

Time   388 hits, 25000 events per hit --------
 Module                  Hits   msec  %Total  Events/Sec
xxxInternalDoPaint         44      14799    10 %       74329
XDCOBJ::bSaveAttributes    20      14799     4 %       33786
DelayedDestroyCacheDC      20      14799     4 %       33786
HANDLELOCK::vLockHandle    15      14799     3 %       25339
mmxAlphaPerPixelOnly       15      14799     3 %       25339
XDCOBJ::RestoreAttributes  13      14799     2 %       21960
DoTimer                    12      14799     2 %       20271
_SEH_prolog4               11      14799     2 %       18582
memmove                     9      14799     2 %       15203
_GetDCEx                    6      14799     1 %       10135
HmgLockEx                   6      14799     1 %       10135
XDCOBJ::bCleanDC            5      14799     1 %        8446
XEPALOBJ::ulIndexToRGB      5      14799     1 %        8446
HmgShareCheckLock           4      14799     0 %        6757
RGNOBJ::bMerge              4      14799     0 %        6757

[...]

This should give you a feel for the kind of information that you’ll get from kernrate. Although the examples I gave were profiling kernel mode code, the whole process works the same way for user mode if you use the “-p” or “-n” options as I mentioned earlier. In conjunction with a debugger, the information that kernrate gives you can often be a great help in narrowing down CPU usage performance problems (or at the very least point you in the general direction as to where you’ll need to do further research).

There are also a variety of other options that are available in kernrate, such as features for gathering information about “hot” locks that have a high degree of contention, and support for launching new processes under the profiler. There is also support for outputting the raw sampled profile data, which can be used to graph the output (such as you might see used with tools like KrView).

Although kernrate doesn’t have all the “bells and whistles” of some of the high-end profiling tools (like Intel’s vTune), it’s often enough to get the job done, and it’s also available to you at no extra cost (and can be quickly and easily deployed to help find the source of a problem). I’d highly recommend giving it a shot if you are trying to analyze a performance problem and don’t already have a profiling solution that you are using.

6 Responses to “An introduction to kernrate (the Windows kernel profiler)”

  1. […] Previously, I wrote up an introduction to kernrate (the Windows kernel profiler). In that post, I erroneously stated that the KrView distribution includes a version of kernrate that works on x64. Actually, it supports IA64 and x86, but not x64. A week or two ago, I had a problem on an x64 box of mine that I wanted to help track down using kernrate. Unfortunately, I couldn’t actually find a working kernrate for Windows x64 (Srv03 / Vista). […]

  2. Atul Marathe says:

    Hello,

    Nice article. Can you please provide more information about the bucket size and the round up/down counters used? Googling for this didn’t give any concrete information regarding how to determine the best bucket size. Is the default good enough / how will changing the bucket size really help – for I didn’t see much difference in the kernrate output neither could I gather much inference from the rounding up / down of the hits in either case.

    if you can shed some light on the kernrate parameters, especially for a typical device driver, that would be good.

    thanks.

  3. Steve says:

    You can actually get kernrate in any wdk, in the tools directory for any supported CPU (Ia64/x86/amd64)

    Usuallly here: \WINDDK\6001\tools\other\i386

  4. Bill says:

    I have a .pdb file and kernrate doesn’t seem to be able to resolve the symbols within this file. Is there any way to use kernrate if I don’t have a .dbg for my module?

  5. Brad says:

    Bill, you will have better luck getting kernrate to load your PDB files if you copy kernrate.exe and up to date versions of dbghelp.dll and symsrv.dll (from Debugging Tools for Windows) into a different directory, and run it from there. Otherwise, it uses the version of dbghelp.dll in system32, which is pretty old.

  6. Fernand says:

    Interesting tool, even if being obsoleted by MS’s next generation of PerformanceAnalysis tools.
    In my case I actually intended to use it (probably naively) to get me a kind of code coverage metric (at least a relative one) for applications where there is no source available. I chose MS Excel for simplicity (although for this one there is symbol tables available), and ran an automated script, with the hope that there would be a repeat pattern in the (“anonymous” mostly) buckets for every iteration, or at least against the disassembled bytes, but nothing of the kind…

    I used “kernrate -ps -z EXCEL -rd”

    If “kernrate” is not appropriate for that, is there other sampling profilers for WinXP suitable for that purpose?