Archive for April, 2007

WordPress can be an incredible pain in the ‘p’ tag…

Monday, April 30th, 2007

I decided to go check that the frontpage of the blog still passed the W3C XML validator today (something I try to remember to do at least semi-regularly) and, to my dismay, there were validation failures all over the place.

It seems that WordPress is not particularly intelligent about determining where it should and should not place <p> tags; depending on how you place your whitespace around the beginning of a list (ol/li) or tt (or other tag whose contents tend to span multiple lines), it seems to have quite an affinity for either spewing completely bogus open or close p tags (or closing tags in the wrong order, such as a opening a p tag before a user-specified li tag, then “helpfully” automagically closing the p tag before the closing li tag the user writes in the post).

The worst part of the whole thing is that the breaking tags are autogenerated, and they’re controlled by what sort of whitespace (e.g. blank lines) you have near your the opening and closing tags. Because the “helpful” autogenerated tags aren’t visible at “design time” for a post, you’re all but limited to blind trial and error to get it working right. Sometimes, you even need to put seemingly bogus </p> tags in at “design time” to match unbalanced tags emitted by WordPress automagically at display time.

I love how this turns writing blog posts that render properly into debugging something that is influenced by how I use whitespace. Argh!

Excuse me, while I go back to figuring out the right combination of blank lines to fix the rest of the blog’s tag close mismatch failures…

Blog move finished…

Sunday, April 29th, 2007

The blog’s been (finally) moved to a reliable box, at a reliable location (including all dependent services), so that should be the last of the intermittant downtime.

So, same url, but minus the random downtime.

New WinDbg (6.7.5.0) released

Friday, April 27th, 2007

It’s finally here – WinDbg 6.7.5.0.

I haven’t gotten around to trying out all of the new goodies yet, but there are some nice additions. For one, .fnent now decodes unwind information in a more meaningful way on x64 (although it still doesn’t understand C scope table entries, making it less useful than SDbgExt’s !fnseh if that is what you were interested in.

Looks like they’ve finally gotten around to signing WinDbg.exe too (though, curiously, not the .msi the installer extracts), so the elevation prompts for WinDbg are now of the more friendlier sort instead of the “this program will destroy your computer” sort.

There is also reportedly source server support for CVS included; I imagine that I’ll be taking a stab at that again now that it is supposedly fully baked now.

In other news, the blog (and DNS) will be moving to a more ideal hosting location (read: not my apartment) as early as this weekend (if all goes according to plan, that is). It’ll be moving to a yummy new quad core Xeon box (with a real connection), a nice step up from the original hardware that it has been running on until a short while ago (good riddance). Crossing my fingers, but hopefully the random unavailability been hardware dying on me and Road Runner sucking should be going away Real Soon Now(tm).

A brief discussion of Windows Vista’s IE Protected Mode (and user/process level security)

Wednesday, April 25th, 2007

I was discussing the recent QuickTime bug on Matasano Chargen, and the question of whether it would work in the presence IE7 + Vista and protected mode came up. I figured a more in depth explanation as to just what IE7’s Protected Mode actually does might be in order, hence this posting.

One of the new features introduced with Internet Explorer 7 on Windows Vista is something called “Protected Mode” (or “IE Protected Mode”). It’s an on-by-default security that is sold as something that greatly hardens Internet Explorer against meaningful exploitation, even if an exploitable hole in IE (or a component of IE, such as an ActiveX control) is found.

Let’s dig in a little bit deeper as to what IE Protected Mode is (and isn’t), and what it means for you.

First things first. Protected mode is not related to the “enhanced security configuration” that is introduced in Windows Server 2003 in any way. The “enhanced security configuration” IE included with Srv03 is, at its core, just a set of more restrictive (i.e. locked down) default settings with regard to things like scripting, downloading files, and soforth. Protected mode does not rely on locking down security zone settings to the point where you cannot download files or run any scripts by default, and is completely unrelated to the IE hardening work done in the default Srv03 configuration. I’d imagine that protected mode will probably be included in Longhon Server, but the underlying technologies are very different, and are designed to address different “market segments” (“enhanced security configuration” being just a set of more restrictive defaults, whereas protected mode is a fundamental rethink of how the browser interacts with the rest of the operating system).

Protected mode is a feature that is designed to make “surfing the web a safer experience” for end users. Unlike Srv03, where a locked down IE might fly because you are ostensibly not supposed to be doing lots of fancy web-browser-ish things from a server box, end users are clearly not going to take kindly towards not being permitted to download files, run JavaScript, and soforth in the default configuration.

The way protected mode takes a stab at making things better for the end users of the world is to build upon the new “integrity level” security mechanism that has been introduced into the Windows NT security model starting with Vista, with the goal of making the web browser an “untrusted” process that cannot perform “dangerous” things.

To understand what this means, it’s necessary to know what these new-fangled “integrity levels” in Vista are all about. Integrity levels are assigned to a token representing a user, and tokens are assigned to a process (and can be impersonated by a thread, typically something done by something like an IPC server process that needs to perform something on behalf of a lesser-privileged caller). What’s meaningful about integrity levels is that they allow you to partition what we know of as a “user” into something with multiple different “trust levels” (low, medium, high, with several other infrequently-used levels), such that a thread or a process running as a certain integrity level (or “trust level”) cannot “intefere” with something running at a higher integrity level.

The way this is implemented is by an additional level of security check that is performed when some kind of access rights check is performed. This additional check compares the integrity level of the caller (i.e. the thread or process token’s integrity level) with a new type of field in the security descriptor of the target object (called the “mandatory label“) that specifies what sorts of access a caller of a certain integrity level is allowed to request. The “mandatory label” allows an integrity level to be associated with an object for security checks, and allows three basic policies (lower integrity levels cannot request read access, low integrity levels cannot request write access, lower integrity levels cannot request execute access) to be set, comparing the integrity level of a caller against the integrity level specified with an object’s security descriptor. (Only these three generic access rights may be “guarded” by the integrity level in this way; there is no granularity to allow object specific access rights to be given specific minimum caller integrity levels).

The default settings in most places do not allow write access to be granted to processes of a lower integrity level, and the default minimum integrity level is usually “medium”. The new, label/integrity level access check is performed before conventional ACL-based checks.

In this respect, integrity levels are an attempt to inject something of a sort of process-level security into the NT security model.

If you’re at all familiar with how NT security works, this may be a bit new to you. NT is based upon user-level security, where processes (and threads, in the case of impersonation) run under the context of a user, and derive their security rights (i.e. what securable objects they have access to – files, directories, registry keys, and soforth) and privileges (i.e. the ability to shut down the system, the ability to load a driver, the ability to bypass ACL checks for backup/restore, and soforth) from the user context they run under. The thinking behind this sort of model is that each distinct user on a system will run as, well, a different user. Processes from one user cannot interfere with processes (or files, directories, and soforth) running as a different user, without being granted access to do so (i.e. via an ACL, or by special, administrator-level privileges). The “operating system” (i.e. the services and programs that support the system) conceptually runs as yet another “user”, and is thus ostensibly protected from adverse modifications by malicious users on the system. Each user thus exists in a sort of sandbox, unable to interfere with any other user. Conversely, any process running as a particular user can do anything to any other process (or file or directory) owned by that same user; there is no protection within a user security context.

Obviously, this is a gross oversimplification of the NT security model, but it gets the point across (or so I hope!): The security system in NT revolves around the user as the means to control access in a meaningful fashion. This does make sense in environments like large corporate settings, where many users share the same computer (or where computers are centrally managed), such that users cannot interfere with eachother, and ostensibly cannot attack their computers (i.e. the operating system) because they are running as “plain users” without administrator access and cannot perform “dangerous” tasks.

Unfortunately, in the era of the internet, exploitable software bugs, and computers with end users that run code they do not entirely trust, this model isn’t quite as good as we would like. Because the user is the security boundary, here, if an attacker can run code under your user account, they have full access to all of the processes, files, directories (and soforth) that are accessible to that user. And if that user account happened to be a computer administrator account, then things are even worse; now the attacker has free reign over the entire computer, and everything on it (including all other users present on the box).

Clearly, this isn’t such a great situation, especially given the reality that many users run untrusted code (or more generally, buggy or exploitable code) on a frequent basis. In this Internet-enabled age, user-level security as it has been traditionally implemented isn’t really enough.

There are still ways to make things “work” with user-level security; namely, to give each human several user accounts, specific to the task that they are doing. For example, I might have one user account that I use for browsing and games, and another user account that I use for accessing top secret corporate documents. If the user account that I use to browse the Internet with gets compromised somehow, such as by my running an exploitable program and getting “owned”, then my top secret corporate documents are still safe; the malicious code running under the Internet-browsing-and-games account doesn’t have access to do anything to my secret documents, since they are owned by a different account and the default ACL protects them from other users.

Of course, this is a tough pill to expect end users to swallow; having to switch user accounts as they switch between tasks of differing importance is at best inconvenient and at worst confusing and problematic (for example, if I want to download a file from the Internet for use with my top secret corporate documents, I have to go to (comparatively) a lot of trouble to give it to my other user, and doing so opens an implicit trust relationship between my secret-documents-user and my less-trusted-Internet-browsing user, that the program I just downloaded is 1) not inherently malicious, 2) not tampered with or compromised, and 3) not full of exploitable holes that would put my documents at risk anyway the moment my secret-documents-user runs it). Clearly, while you could theoretically still get by with user level access in today’s world, as a single user, doing so as it is implemented in Windows today is a major pain (and even with everyone’s best intentions, few people I have seen really follow through completely with the concept and do not share programs or files between their users in any way whatsoever).

(Note that I am not suggesting that things like running as nonadmin or breaking tasks up into different users are a lost cause, just that to get things truly right and secure, it is a much more difficult job than one might expect initially, so much so that most “joe users” will not stand a chance at doing it perfectly. I’m also not trying to knock on user-level security as just outright flawed, useless, or broken, but the fact remains there are problems in today’s world that merit additonal consideration.)

Whew, that’s a rather long segway into user-level security. Anyways, protected mode is Microsoft’s first real attempt to tackle this problem – the fact that user level security does not always provide fine enough granularity, in the fact of untrusted or buggy programs – in a consumer-level system, in such a way that is palatable to “joe users”. The way that it works is to leverage the integrity level mechanism to create an additonal security barrier between the user’s web browser (i.e. Internet Explorer in protected mode) and the rest of the user’s files and programs. This is done by assigning the IE process a low integrity level. Following with what we know of integrity levels above, this means that the IE process will be denied access (by the security manager in the kernel) to do things like overwrite your documents, place malicious programs in your “startup” Start Menu directory, overwrite executables in your system directory (of course, if you were already running as a plain user, it wouldn’t be able to do this anyway…), and soforth. This is clearly a good thing. In case the implications haven’t fully sunk in yet:

If an attacker compromises a low integrity process, they should not be able to destroy your data or install a trojan (or other malicious code) on your system*.

(*: This is, of course, barring implementation errors, design oversights, and local privilege escalation holes. The latter may prove to be an especially important sticking point, as many companies (Microsoft included) have often “looked down” upon local privilege escalation bugs as relatively not important to fix in a timely fashion. Perhaps the introduction of process-level security control will help add impetus to shatter the idea that leaving local privilege escalation holes sitting around is okay.)

Now, this is a very important departure from where we have been traditionally with user level access control. Combining per process access control with per user access control allows us to do a lot more to protect users from malicious code and buggy software (in other words, protecting users from themselves), in a fashion that is much easier to deal with from a user perspective.

However, I think it would be premature to say that we’re “all the way there” yet. Protected mode and low integrity level processes are definitely a great step in the right direction, but there still remain issues to be solved. For example, as I alluded to previously, the default configuration allows medium integrity objects to still be opened for read access by low integrity callers. This means that, for example, if an attacker compromises an IE process running in protected mode, they still do have a chance at doing some damage. For instance, even though an attacker might not be able to destroy your data, per-se, he or she can still read it (and thus steal it). So, to continue with our previous example of someone who works with top secret corporate documents, an attacker might not be able to wipe out the company financial records, or joe user’s credit card numbers, but he or she could still steal them (and post them on the Internet for anyone to see, or what have you). In other words, an attacker who compromises a low integrity process can’t destroy all your data (as would be the case if there were no process-level security and we were dealing with just one user account), but he or she can still read it and steal it.

There are other things to watch out for, too, with protected mode. Don’t get into the habit of clicking “OK” on that “are you sure you want this program to run outside of IE Protected Mode” dialog box, or you’re setting yourself up to be burned by clever malware. And certainly never click the “don’t ask me again” check box on the consent dialog, or you’re just begging for some piece of malware to abuse your implicit consent without you even realizing that something’s gone wrong. (In case you’re wondering, the mechanism in IE that allows processes to elevate to medium integrity involves an appcompat hook on CreateProcess that requests a medium integrity process (ieuser.exe) to prompt the user for consent, with the medium integrity process creating the process if the user agrees. So user interaction is still required there, though we know how much users love to click “OK” on all those pesky security warnings. Oh, and there is also some hardening that has been done in win32k.sys to prevent lower integrity processes from sending “dangerous” window messages to higher integrity processes (even WM_USER and friends are disabled by default across an integrity level boundary), so “shatter attacks” ought not to work against the consent dialog. Note that if you bypass the appcompat hook, the new process created is also low integrity, and won’t be able to write to anywhere outside of the “low integrity” sandbox writable files and directories.)

So, while running IE in protected mode does, in some respects limit the damage that can be done if you get compromised, I would still recommend not running untrusted programs under the same user account as your important documents (if you really care that much). Perhaps in a future release, we’ll see a solution that addresses the idea of not allowing untrusted programs to read arbitrary user data as well (such should be possible with proper use of the new integrity level mechanisms, although I suspect the true difficulty shall be in getting third party applications to play nicely as we continue to try and place control of the user’s documents more firmly in the control of the actual user instead of in any arbitrary application that runs on the box).

Sorry about the downtime…

Wednesday, April 25th, 2007

The box I have been hosting the blog on had got to its last proverbial leg as of late; hanging about every 12 hours or less. So, I decided to move the whole thing over to another box. I’m already in the middle of another project to eventually host the blog at a real location, but that’s still in the works and not ready yet. Not wanting to go to the trouble to configure WordPress and everything on another box just temporarily, I decided to give the VMware Converter a try and move the entire box into a VM image on a more stable computer at my apartment.

So far, I’m pretty pleased with the results; it took only a couple of minutes to install the converter and start the conversion process (which took about 1.5 hrs total to complete). Surprisingly enough, the thing mostly just worked out of the box after being VM-ized; I had to install VMware Tools, and reconfigure network settings a bit, but other than that, the whole experience was rather seamless.

For now, the blog should be a bit more stable (now living as a VM on a relatively new and, ah, higher quality tower server that I recently acquired to run most of the services for my apartment).

And as far as VMware Converter goes, color me impressed; I expected to run into a lot of snags, but the user experience was quite good. Definitely a great timesaver for retiring old, failing hardware without having to go through the trouble of reconstructing a new install to perform the tasks of the old one.

Sometimes, a cheap hack does the trick after all (or how I worked around some mysterious USB/Bluetooth issues)

Monday, April 16th, 2007

My relatively new Dell XPS M1710 laptop (running Vista x64) has an annoying problem that I’ve recently tracked down.

It appears that when you connect a USB 1.1 device (such as a USB keyboard) to it, and then disconnect that device, Bluetooth and the built-in smart card reader tend to break the next time the computer goes through a suspend/resume cycle. This is fairly annoying for me, as I use a USB keyboard at work (and I also use Bluetooth extensively); it got to the point where pretty much every other time I went to the office and came home, Bluetooth and the smart card reader would break.

The internal Bluetooth transceiver and the built-in smart card reader are both connected to the system via an internal USB hub. It seems like this is becoming a popular thing nowadays among laptop manufacturers, connecting “internal” peripherals on laptops via USB instead of them being on-board and hardwired to the PCI bus or the like. Anyways, when the Bluetooth hub and smart card reader get into the broken state, I’ll typically get a “A USB device attached to the system is not functioning” notification, and the internal USB hub shows up as not started in device manager (with a problem code of CM_PROB_FAILED_START – code 10). Occasionally, the Bluetooth transceiver itself shows up with a CM_PROB_FAILED_START error, but the vast majority of the time it is the USB hub that fails.

I’ve done a bit of searching for a real fix for this problem, and closest I’ve found is this KB article which describes a problem that does sound like mine – Bluetooth breaks after suspending – but, the hotfix isn’t publicly available yet. I suppose I could have called PSS and tried to talk them into getting me a copy of that hotfix to try out, but I tend to try and avoid muddling through the various tiers of technical support until I really have no other options. Furthermore, there’s no guarantee that the hotfix would actually solve the underlying problem; the article doesn’t make any mention of internal USB hubs acting flaky in conjunction with a Bluetooth transceiver, only the Bluetooth module itself.

Until recently, to get out of this state, I typically either have to suspend/resume the laptop again (although this is dangerous*), reboot entirely, or remove and reinstall the devnode associated with the USB hub in device manager. (*Trying to suspend/resume in this case doesn’t always work. Sometimes, it won’t fix the problem, and more than a couple of times it has resulted in Vista hanging while trying to suspend, forcing a hard reboot.).

None of these solutions are particularly desirable; nobody likes having to shut everything down and reboot all the time with a laptop (kind of defeats the point of that nice sleep feature, doesn’t it?). Removing and reinstalling the devnode is less painful than a reboot, but only slightly; waiting for everything on the internal USB hub to get reinstalled after doing that takes up to a minute or two of disk thrashing while INFs are searched, and if things are going to take that long then it would almost be faster just to shut down and reboot anyway.

Eventually, though, I discovered that simply disabling and reenabling the devnode corresponding to the USB hub would resolve the problem. This is definitely a much nicer solution than any of the above; it’s (relatively) quick and less painful, and it doesn’t involve me sitting around and waiting for either a shutdown or reboot, or for a bunch of devices to get reinstalled by PnP.

Unfortunately, that workaround is still a rather manual process. It’s not too bad if I just keep an elevated Device Manager window open for easy access, but among other things this prevents me from, say, unlocking my computer with a smart card after an unsuspend (as the Bluetooth transceiver and smart card reader share the same USB hub that tends to flake out after a suspend/resume, after a USB 1.1 device is removed).

So, I set about seeing whether I could try and automate this process. I could have tried to track down the root cause a bit more, but as far as I can tell, the problem is either 1) a bug in Vista’s USB support (perhaps or perhaps not an x64 specific bug), or 2) a bug in firmware/hardware relating to the internal USB hub itself, or 3) a bug in firmware/hardware relating to the Bluetooth transceiver, or 4) a bug in the Bluetooth drivers for my laptop. None of those possibilities would be something that I could easily solve (as far as the root cause goes), even if I managed to track down the originating cause of the problem. As a result, I decided that the most time-effective solution would be to just try and automate the process of disabling and reenabling the devnode associated with the USB hub (if it breaks), or the Bluetooth transceiver (if it breaks instead). Restarting the devnode is comparatively fast when viewed in light of the other workarounds, and in any case it is fast enough to be a viable way to at least alleviate the symptoms of this problem.

After doing some brief research, it looked like the way to go here was a combination of the CM_Xxx APIs and the SetupDi APIs. Although there’s a relatively large amount of indirection you have to go through to restart an individual devnode (and the CM_Xxx APIs / setupapi are not particularly easy to use in the first place), there happens to be a WDK sample that has the capability to do just what I want – DevCon. DevCon is a console equivalent to the Device Manager MMC snapin; it’s capable of enumerating device nodes, installing/removing them, updating drivers, disabling/reenabling devnodes, and soforth.

Sure enough, I verified that DevCon’s `restart’ command was sufficient to restart the broken devnode (in a fashion similar to disabling and reenabling it in Device Manager), with the end result of causing the USB hub to start working again.

At this point, all I had to do was come up with a good way to locate the broken devnodes, a good way to know when the computer went in/out of sleep (as a sleep cycle causes the problem to occur), and then inject the DevCon code responsible for restarting a device into my program. To make a long story short, I ended up keying the program based on the hardware ids of the internal USB hub and Bluetooth transceiver such that it would check for all devnodes that 1) matched a hardware id in that list, and 2) were in a disabled state with a problem code of CM_PROB_FAILED_START. Detecting a sleep transition is fairly easy for a service (services receive notifications of PnP/Power events through the callback registered via RegisterServiceCtrlHandlerEx, and as I wanted the program to function continuously, a service seemed like the logical way to run it anyway.

An hour or two later and I had my workaround problem done. It’s certainly not pretty, and it doesn’t do anything to fix the root cause of the problem, but as far as treating the symtoms goes, it gets the job done. The basic way the program works is that every time the system resumes from suspend, it will poll all devnodes every second for 30 seconds, looking for a devnode that failed to start and is one of the known list of problematic hardware ids. If it finds such a device, it attempts to restart it (thereby working around the root cause of the problem and alleviating the symptoms). Polling for a fixed time after resume isn’t pretty by any means, but it can occasionally take a bit for one of the devnodes to show as broken when it’s not working, so this works around that.

While you could safely call it a giant hack, the program does get the job done; now, I can unlock my laptop via smart card or use Bluetooth almost immediately after a resume, even if the breakage-after-USB1-device-is-unplugged problem strikes, and all without having to manually futz around in device manager every time.

In case anyone’s interested, I’ve put the source code for the program (“Broken Device Bouncer”, or DevBouncer for short) up. It’s fairly hardcoded to be specific to my machine, so if you wanted to use it for some reason, you’d need to rebuild it.

How I ended up in the kernel debugger while trying to get PHP and Cacti working…

Saturday, April 14th, 2007

Some days, nothing seems to work properly. This is the sad story of how something as innocent as trying to install a statistics graphing Web application culminated in my breaking out the kernel debugger in an attempt to get things working. (I don’t seem to have a lot of luck with web applications. So much for the way of the future being “easy to develop/deploy/use” web-based applications…)

Recently, I decided that to try installing Cacti in order to get some nice, pretty graphs describing resource utilization on several boxes at my apartment. Cacti is a PHP program that queries SNMP data and, with the help of a program called RRDTool, creates friendly historical graphs for you. It’s commonly used for monitoring things like network or processor usage over time.

In this particular instance, I was attempting to get Cacti working on a Windows Server 2003 x64 SP2 box. Running an amalgam of unix-ish programs on Windows is certainly “fun”, and doing it on native x64 is even more “interesting”. I didn’t expect to find myself in the kernel debugger while trying to get Cacti working, though…

To start out, the first thing I had to do was convert IIS6’s worker processes to 32-bit instead of 64-bit, as the standard PHP 5 distribution doesn’t support x64. (No, I don’t consider spending who knows how many hours to get PHP building on x64 natively a viable solution here, so I just decided to stick with the 32-bit release. I don’t particularly want to be in the habit of having to then maintain rebuild my own PHP distribution from a custom build environment each time security updates come out either…).

This wasn’t too bad (at least not at first); a bit of searching revealed this KB article that documented an IIS metabase flag that you can set to turn on 32-bit worker processes (with the help of the adsutil.vbs script included in the IIS Adminscripts directory).

One small snag here was that I happened to be running a symbol proxy in native x64 mode on this system already. Since the 32-bit vs 64-bit IIS worker process flag is an all-or-nothing option, I had to go install the 32-bit WinDbg distribution on this system and copy over the 32-bit symproxy.dll and symsrv.dll into %systemroot%\system32\inetsrv. Additionally, the registry settings used by the 64-bit symproxy weren’t directly accessible to the 32-bit version (due to a compatiblity feature in 64-bit versions of Windows known as Registry Reflection), so I had to manually copy over the registry settings describing which symbol paths the symbol proxy used to the Wow64 version of HKLM\Software. No big deal so far, just a minor annoyance.

The first unexpected problem that cropped up happened after I had configured the 32-bit symbol proxy ISAPI filter and installed PHP; after I enabled 32-bit worker processes, IIS started tossing HTTP 500 Internal Server Error statuses whenever I tried to browse any site on the box. Hmm, not good…

After determining that everything was still completely broken even after disabling the symbol proxy and PHP ISAPI modules, I discovered some rather unpleasant-looking event log messages:

ISAPI Filter ‘%SystemRoot%\Microsoft.NET\Framework64\v2.0.50727\aspnet_filter.dll’ could not be loaded due to a configuration problem. The current configuration only supports loading images built for a x86 processor architecture. The data field contains the error number. To learn more about this issue, including how to troubleshooting this kind of processor architecture mismatch error, see http://go.microsoft.com/fwlink/?LinkId=29349.

It seemed that the problem was the wrong version of ASP.NET being loaded (still the x64 version). The link in the event message wasn’t all that helpful, but a bit of searching located yet another knowledge base article – this time, about how to switch back and forth between 32-bit and 64-bit versions of ASP.NET. After running aspnet_regiis as described in that article, IIS was once again in a more or less working state. Another problem down, but the worst was yet to come…

With IIS working again, I turned towards configuring Cacti in IIS. Although, at first it appeared as though everything might actually go as planned (after configuring Cacti’s database settings, I soon found myself at its php-based initial configuration page), such things were not meant to be. The first sign of trouble appeared after I completed the initial configuration page and attempted to log on with the default username and password. Doing so resulted in my being thrown back to the log on page, without any error messages. A username and password combination not matching the defaults did result in a logon failure error message, so something besides a credential failure was up.

After some digging around in the Cacti sources, it appeared that the way that Cacti tracks whether a user is logged in or not is via setting some values in the standard PHP session mechanism. Since Cacti was apparently pushing me back to the log on page as soon as I logged on, I guessed that there was probably some sort of failure with PHP’s session state management.

Rewind a bit to back when I installed PHP. In the interest of expediency (hah!), I decided to try out the Win32 installer package (as opposed to just the zip distribution for a manual install) for PHP. Typically, I’ve just installed PHP for IIS the manual way, but I figured that if they had an installer nowadays, it might be worth giving it a shot and save myself some of the tedium.

Unfortunately, it appears that PHP’s installer is not all that intelligent. It turns out that in the IIS ISAPI mode, PHP configures the system-wide PHP settings to point the session state file directory to the user-specific temp directory (i.e. pointing to a location under %userprofile%). This, obviously, isn’t going to work; anonymous users logged on to IIS aren’t going to have access to the temp directory of the account I used to install PHP with.

After manually setting up a proper location for PHP’s session state with the right security permissions (and reconfiguring php.ini to match), I tried logging in to Cacti again. This time, I actually got to the main screen after changing the password (hooray, progress!).

From here, all that I had left to do was some minor reconfiguring of the Windows SNMP service in order to allow Cacti to query it, set up the Cacti poller task job (Cacti is designed to poll data from its configured data sources at regular intervals), and configure my graphs in Cacti.

Configuring SNMP wasn’t all that difficult (though something I hadn’t done before with the Windows SNMP service), and I soon had Cacti successfully querying data over SNMP. All that was left to do was graph it, and I was home free…

Unfortunately, getting Cacti to actually graph the data turned out to be rather troublesome. In fact, I still haven’t even got it working, though I’ve at least learned a bit more about just why it isn’t working…

When I attempted to create graphs in Cacti, everything would appear to work okay, but no RRDTool datafiles would ever appear. No amount of messing with filesystem permissions resolved the problem, and the Cacti log files were not particularly helpful (even on debug severity). Additionally, attempting to edit graph properties in Cacti would result in that HTTP session mysteriously hanging forever more (definitely not a good sign). After searching around (unsuccessfully) for any possible solutions, I decided to try and take a closer look at what exactly was going on when my requests to Cacti got stuck.

Checking the process list after repeating the sequence that caused a particular Cacti session to hang several times, I found that there appeared to be a pair of cmd.exe and rrdtool.exe instances corresponding to each hung session. Hmm, it would appear that something RRDTool was doing was freezing and PHP was waiting for it… (PHP uses cmd.exe to call RRDTool, so I guessed that PHP would be waiting for cmd.exe, which would be waiting for RRDTool).

At first, I attempted to attach to one of the cmd processes with WinDbg. (Incidentally, it would appear that there are currently no symbols for the Wow64 versions of the Srv03SP2 ntdll, kernel32, user32, and a large number of other core DLLs with Wow64 builds available on the Microsoft symbol server for some reason. If any Microsoft people are reading this, it would be greaaaat if you could fix the public symbol server for Srv03 SP2 x64 Wow64 DLLs …) However, symbols for cmd.exe were fortunately available, so it was relatively easy to figure out what it was up to, and prove my earlier hypothesis that it was simply waiting on an rrdtool instance:

0:001:x86> ~1k
ChildEBP RetAddr
0012fac4 7d4d8bf1 ntdll_7d600000!NtWaitForSingleObject+0x15
0012fad8 4ad018ea KERNEL32!WaitForSingleObject+0x12
0012faec 4ad02611 cmd!WaitProc+0x18
0012fc24 4ad01a2b cmd!ExecPgm+0x3e2
0012fc58 4ad019b3 cmd!ECWork+0x84
0012fc70 4ad03c58 cmd!ExtCom+0x40
0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
0012fee0 4ad06cf6 cmd!Dispatch+0x137
0012ff44 4ad07786 cmd!main+0x108
0012ffc0 7d4e7d2a cmd!mainCRTStartup+0x12f
0012fff0 00000000 KERNEL32!BaseProcessInitPostImport+0x8d
0:001:x86> !peb
[...]
CommandLine: 'cmd.exe /c c:/progra~2/rrdtool/rrdtool.exe -'
[...]

Given this, the next logical step to investigate would be the RRDTool.exe process. Unfortunately, something really weird seemed to be going on with all the RRDTool.exe processes (naturally). WinDbg would give me an access denied error for all of the RRDTool PIDs in the F6 process list, despite my being a local machine administrator.

Attempting to attach to these processes failed as well:

Microsoft (R) Windows Debugger Version 6.6.0007.5
Copyright (c) Microsoft Corporation. All rights reserved.

Cannot debug pid 4904, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”
Debuggee initialization failed, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”

This is not something that you want to be seeing on a server box. This particular error means that the process in question is in the middle of being terminated, which prevents a debugger from successfully attaching. However, processes typically terminate in timely fashion; in fact, it’s almost unheard of to actually see a process in the terminating state, since it happens so quickly. However, in this particular instances, the RRDTool processes were remaining in this half-dead state for what appeared to be an indefinite interval.

There are two things that commonly cause this kind of problem, and all of them are related to the kernel:

  1. The disk hardware is not properly responding to I/O requests and they are hanging indefinitely. This can block a process from exiting while the operating system waits for an I/O to finishing canceling or completing. Since this particular box was brand new (and with respectable, high-quality server hardware), I didn’t think that failing hardware was the cause here (or at least, I certainly hoped not!). Given that there were no errors in the event log about I/Os failing, and that I was still able to access files on my disks without issue, I decided to rule this possiblity out.
  2. A driver (or other kernel mode code in the I/O stack) is buggy and is not allowing I/O requests to be canceled or completed, or has deadlocked itself and is not able to complete an I/O request. (You might be familiar with the latter if you’ve tried to use the 1394 mass storage support in Windows for a non-trivial length of time.) Given that I had tentatively ruled out bad hardware, this would seem to be the most likely cause here.

Since the frozen process would be stuck in kernel mode, in either case, to proceed any further I would need to use the kernel debugger. I decided to start out with local kd, as that is sufficient for at least retrieving thread stacks and doing basic passive analysis of potential deadlock issues where the system is at least mostly still functional.

Sure enough, the stuck RRDTool process I had unsuccessfully tried to attach to was blocked in kernel mode:


lkd> !process 0n4904
Searching for Process with Cid == 1328
PROCESS fffffadfcc712040
SessionId: 0 Cid: 1328 Peb: 7efdf000 ParentCid: 1354
DirBase: 5ea6c000 ObjectTable: fffffa80041d19d0 HandleCount: 68.
Image: rrdtool.exe
[...]
THREAD fffffadfcca9a040 Cid 1328.1348 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Unknown) KernelMode Non-Alertable
fffffadfccf732d0 SynchronizationEvent
Impersonation token: fffffa80041db980 (Level Impersonation)
DeviceMap fffffa8001228140
Owning Process fffffadfcc712040 Image: rrdtool.exe
Wait Start TickCount 6545162 Ticks: 367515 (0:01:35:42.421)
Context Switch Count 445 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0015
Win32 Start Address windbg!_imp_RegCreateKeyExW (0x0000000000401000)
Start Address 0x000000007d4d1510
Stack Init fffffadfc4a95e00 Current fffffadfc4a953b0
Base fffffadfc4a96000 Limit fffffadfc4a8f000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 1
RetAddr Call Site
fffff800`01027752 nt!KiSwapContext+0x85
fffff800`0102835e nt!KiSwapThread+0x3c9
fffff800`013187ac nt!KeWaitForSingleObject+0x5a6
fffff800`012b2853 nt!IopAcquireFileObjectLock+0x6d
fffff800`01288dff nt!IopCloseFile+0xad
fffff800`01288f0e nt!ObpDecrementHandleCount+0x175
fffff800`0126ceb0 nt!ObpCloseHandleTableEntry+0x242
fffff800`0128d7a6 nt!ExSweepHandleTable+0xf1
fffff800`012899b6 nt!ObKillProcess+0x109
fffff800`01289d3b nt!PspExitThread+0xa3a
fffff800`0102e3fd nt!NtTerminateProcess+0x362
00000000`77ef0caa nt!KiSystemServiceCopyEnd+0x3
0202c9fc`0202c9fb ntdll!NtTerminateProcess+0xa

Hmm… not quite what I expected. If a buggy driver was involved, it should have at least been somewhere on the call stack, but in this particular instance all we have is ntoskrnl code, straight from the system call to the wait that isn’t coming back. Something was definitely wrong in kernel mode, but it wasn’t immediately clear what was causing it. It appeared as if the kernel was blocked on the file object lock (which, to my knowledge, is used to guard synchronous I/O’s that are issued for a particular file object), but, as the file object lock is built upon KEVENTs, the usual lock diagnostics extensions (like `!locks’) would not be particularly helpful. In this instance, what appeared to be happening was that the process rundown logic in the kernel was attempting to release all still-open handles in the exiting RRDTool process, and it was (for some reason) getting stuck while trying to close a handle to a particular file object.

I could at least figure out what file was “broken”, though, by poking around in the stack of IopCloseFile:

lkd> !fileobj fffffadf`ccf73250
\Temp\php\session\sess_bkcavai8fak8antv9coq46at95
LockOperation Set Device Object: 0xfffffadfce423370 \Driver\dmio
Vpb: 0xfffffadfce864840
Access: Read Write SharedRead SharedWrite
Flags: 0x40042
Synchronous IO
Cache Supported
Handle Created
File Object is currently busy and has 1 waiters.
FsContext: 0xfffffa800390e110 FsContext2: 0xfffffa8000106a10
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffadfcd601c20
Shared Cache Map: fffffadfccfdebb0 File Offset: 0 in VACB number 0
Vacb: fffffadfce97fb08
Your data is at: fffff98070e80000

From here, there are a couple of options:

  1. We could look for processes with an open handle to that file and check their stacks.
  2. We could look for an IRP associated with that file object and try and trace our way back from there.

Initially, I tried the first option, but this ended up not working particularly well. I attempted to use Process Explorer to locate all processes that had a handle to that file, but this ended up failing rather miserably as Process Explorer itself got deadlocked after it opened a handle to the file. This was actually rather curious; it turned out that processes could open a handle to this “broken” file just fine, but when they tried to close the handle, they would get blocked in kernel mode indefinitely.

That unsuccessful, I tried the second option, which is made easier by the use of `!irpfind’. Normally, this extension is very slow to operate (over a serial cable), but local kd makes it quite usable. This revealed something of value:


lkd> !irpfind -v 0 0 fileobject fffffadf`ccf73250
Looking for IRPs with file object == fffffadfccf73250
Scanning large pool allocation table for Tag: Irp? (fffffadfccdf6000 : fffffadfcce56000)
Searching NonPaged pool (fffffadfcac00000 : fffffae000000000) for Tag: Irp?
Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
fffffadfcc225380: Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[...]
>[ 11, 1] 2 1 fffffadfce7b6040 fffffadfccf73250 00000000-00000000 pending
\FileSystem\Ntfs
Args: fffffadfcd70e0a0 00000000 00000000 00000000

There was an active IRP for this file object. Hopefully, it could be related to whatever is holding the file object lock for that file object. Digging a bit deeper, it’s possible to determine what thread is associated with the IRP (if it’s a thread IRP), and from there, we can grab a stack (which might just give us the smoking gun we’re looking for)…:

lkd> !irp fffffadfcc225380 1
Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
Flags = 00000000
ThreadListEntry.Flink = fffffadfcc2253a0
ThreadListEntry.Blink = fffffadfcc2253a0
[...]
CancelRoutine = fffff800010ba930 nt!FsRtlPrivateResetLowestLockOffset
[...]
lkd> !thread fffffadfccea27d0
THREAD fffffadfccea27d0 Cid 10f8.138c Teb: 00000000fffa1000 Win32Thread: fffffa80023cd860 WAIT: (Unknown) UserMode Non-Alertable
fffffadfccf732e8 NotificationEvent
Impersonation token: fffffa8002c62060 (Level Impersonation)
DeviceMap fffffa8002f3b7b0
Owning Process fffffadfcc202c20 Image: w3wp.exe
Wait Start TickCount 6966187 Ticks: 952 (0:00:00:14.875)
Context Switch Count 1401 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0000
Win32 Start Address 0x00000000003d87d8
Start Address 0x000000007d4d1504
Stack Init fffffadfc4fbee00 Current fffffadfc4fbe860
Base fffffadfc4fbf000 Limit fffffadfc4fb8000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
RetAddr : Call Site
fffff800`01027752 : nt!KiSwapContext+0x85
fffff800`0102835e : nt!KiSwapThread+0x3c9
fffff800`012afb38 : nt!KeWaitForSingleObject+0x5a6
fffff800`0102e3fd : nt!NtLockFile+0x634
00000000`77ef14da : nt!KiSystemServiceCopyEnd+0x3
00000000`00000000 : ntdll!NtLockFile+0xa

This might just be what we’re looking for. There’s a thread in w3wp.exe (the IIS worker process), which is blocking on a synchronous NtLockFile call for that same file object that is in the “broken” state. Since I’m running PHP in ISAPI mode, this does make sense – if PHP is doing something to that file (which it could certainly be, since it’s a PHP session state file as we saw above), then it should be in the context of w3wp.exe.

In order to get a better user mode stack trace as to what might be going on, I was able to attach a user mode debugger to w3wp.exe and get a better picture as to what the deal was:


0:006> .effmach x86
Effective machine: x86 compatible (x86)
0:006:x86> ~6s
ntdll_7d600000!ZwLockFile+0x12:
00000000`7d61d82e c22800 ret 28h
0:006:x86> k
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
014edf84 023915a2 ntdll_7d600000!ZwLockFile+0x12
014edfbc 0241d886 php5ts!flock+0x82
00000000 00000000 php5ts!zend_reflection_class_factory+0xb576

It looks like that thread is indeed related to PHP; PHP is trying to acquire a file lock on the session state file. With a bit of work, we can figure out just what kind of lock it was trying to acquire.

The prototype for NtLockFile is as so:

// NtLockFile locks a region of a file.
NTSYSAPI
NTSTATUS
NTAPI
NtLockFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PULARGE_INTEGER LockOffset,
IN PULARGE_INTEGER LockLength,
IN ULONG Key,
IN BOOLEAN FailImmediately,
IN BOOLEAN ExclusiveLock
);

Given this, we can easily deduce the arguments off a stack dump:

0:006:x86> dd @esp+4 l0n10
00000000`014edf48 000002b0 00000000 00000000 014edfac
00000000`014edf58 014edfac 014edf74 014edf7c 00000000
00000000`014edf68 00000000 00000001
0:006:x86> dq 014edf74 l1
00000000`014edf74 00000000`00000000
0:006:x86> dq 014edf7c l1
00000000`014edf7c 00000000`00000001

It seems that PHP is trying to acquire an exclusive lock for a range of 1 byte starting at offset 0 in this file, with NtLockFile configured to wait until it acquires the lock.

Putting this information together, it’s now possible to surmise what is going on here:

  1. The child processes created by php have a file handle to the session state file (probably there from process creation inheritance).
  2. PHP tries to acquire an exclusive lock on part of the session state file. This takes the file object lock for that file and waits for the file to become exclusively available.
  3. The child process exits. Now, it tries to acquire the file object lock so that it can close its file handle. However, the file object lock cannot be acquired until the child process releases its handle as the handle is blocking PHP’s NtLockFile from completing.
  4. Deadlock! Neither thread can continue, and PHP appears to hang instead of configuring my graphs properly.

In this particular instance, it was actually possible to “recover” from the deadlock without rebooting; the IIS worker process’s wait in NtLockFile is marked as a UserMode wait, so it is possible to terminate the w3wp.exe process, which releases the file object lock and ultimately allows all the frozen processes that are trying to close a handle to the PHP session state file to finish the close handle operation and exit.

This is actually a nasty little problem; it looks like it’s possible for one user mode process to indefinitely freeze another user mode process in kernel mode via a deadlock. Although you can break the deadlock by terminating the second user mode process, the fact that a user mode process can, at all, cause the kernel to deadlock during process exit (“breakable” or not) does not appear to be a good thing to me.

Meanwhile, knowing this right now doesn’t really solve my problem. Furthermore, I suspect that there’s probably a different problem here too, as the command line that was given to RRDTool (simply “-“) doesn’t look all that valid to me. I’ll see if I can come up with some way to work around this deadlock problem, but it definitely looks like an unpleasant one. If it really is a file handle being incorrectly inherited to a child process, then it might be possible to un-mark that handle for inheritance with some work. The fact that I am having to consider making something to patch PHP to work around this is definitely not a happy one, though…

Silly me for thinking that it would just take a kernel debugger to get a web application running…