January « 2007 « Nynaeve

Archive for January, 2007

Hooray for bad hardware… (and upcoming blog maintenance)

Wednesday, January 31st, 2007

Well, I just spent the past hour and a half trying to resurrect the box that hosts the blog after it froze again. I finally succeeded in getting it working (cross my fingers…). At this point, though, I think that it’s time to start working on moving the blog to a more reliable location. The current box has locked up a couple times this month already (to the point where I can’t break in with either KD or SAC), which is worrisome. Furthermore, as the current box also happens to be routing for my apartment LAN, it locking up also means the primary Internet connection will for my entire apartment will go as well – ick. The fact that it took about an hour and a half of disassembling it and putting things back together to get it running is definitely something I take as a strong sign that it’s going to take a turn for the worst rather soon. Granted, this box was donated to me, so I suppose I can’t complain too much, but it’s time to get something more reliable.

Since I’m going to be switching boxes for the blog anyway (as the current box looks as if it is going to die, permanently, any day now), I am going to move it to a more capable hosting location (i.e. not my apartment) while I’m at it. Furthermore, I’ll probably be going ahead and upgrading to the next major WordPress version soon as well now that it’s out of beta. Assuming that goes well on the new blog server, it’ll be running 2.1 when I cut it over from the old box.

Hopefully, this should end up with the blog being a bit more responsive (the burst data rate at my apartment is a bit less than I would like for hosting the site), and (finally!) drive a stake through the periodic but short outages I’ve had due to flaky hardware over the past few months.

Posted in Blogging | 4 Comments »

Thoughts on PatchGuard (otherwise known as Kernel Patch Protection)

Monday, January 29th, 2007

Recently, there has been a fair bit of press about PatchGuard. I’d like to clarify a couple of things (and clear up some common misconceptions that appear to be floating around out there).

First of all, these opinions are my own, based on the information that I have available to me at this time, and are not sponsored by either Microsoft or my employer. Furthermore, these views are based on PatchGuard as it is implemented today, and do not relate to any theoretical extensios to PatchGuard that may occur sometime in the future. That being said, here’s some of what I think about PatchGuard:

PatchGuard is (mostly) not a security system.

Although some people out there might try to tell you this, I don’t buy it. The thing about PatchGuard is that it protects the kernel (and a couple of other Microsoft-supplied core kernel libraries) from being patched. PatchGuard also protects a couple of kernel-related processor registers (MSRs) that are used in conjunction with functionality like making system calls. However, this doesn’t really equate to improving computer security directly. Some persons out there would like to claim that PatchGuard is the next great thing in the anti-malware/anti-rootkit war, but they’re pretty much just wrong, and here’s why:

Malware doesn’t need to patch the kernel in the vast majority of all cases. Virtually all of the “interesting” things out there that malware does on compromised computers (create botnets, blast out spam emails, and the like) don’t require anything that is even remotely related to what PatchGuard blocks in terms of kernel patch prevention. Even in the rootkit case, most of the hiding of nefarious happenings could be done with clever uses of documented APIs (such as creating threads in other processes, or filesystem filters) without even having to patch the kernel at all. I will admit that many rootkits out there now do simply patch the kernel, but I attribute this mostly to a 1) lack of knowledge about how the kernel works by the part of rootkit authors, and 2) the fact that it might be “easier” to simply patch the kernel and introduce race conditions that crash rootkit’d computers rarely than to do things the right way. Once things like system call hooks in rootkits start to run afoul of PatchGuard, though, rootkit authors have innumerable other choices that are completely unguarded by PatchGuard. As a result, it’s not really correct (in my opinion) to call PatchGuard an anti-malware technology.
Malware authors are more agile than Microsoft. What I mean when I say “more agile” is that malware vendors can much more easily release new software versions without the “burdens” of regression testing, quality assurance, and the like. After all, if you’re already in the malicious software business, it probably doesn’t matter if 5% of your customers (sorry, victims) will have systems that crash with your software installed because you didn’t fully test your releases and fix a corner case bug. On the other hand, Microsoft does have to worry about this sort of thing (and Microsoft’s install base for Windows is huge), which means that Microsoft needs to be very, very careful about releasing updates to something as “dangerous” as PatchGuard. I say “dangerous” in the sense that if a PatchGuard version is released with a bug, it could very well cause some number of Microsoft’s customers to bluescreen on boot, which would clearly not fly very well. Given the fact that Microsoft can’t really keep up with malware authors (many who are dedicated to writing malicious code with financial incentives to keep doing so) as it comes to the “cat and mouse” game with PatchGuard, it doesn’t make sense to try and use PatchGuard to stop malware.
PatchGuard is targetted at vendors that are accountable to their customers. This point seems to be often overlooked, but PatchGuard works by making it painful for vendors to patch the kernel. This pain comes in the fact that ISVs who choose to bypass PatchGuard are at risk of causing customer computers to bluescreen on boot en-masse the next time that Microsoft releases a PatchGuard update. For malware authors, this is really much less of an issue; one compromised computer is the same as another, and many flavors of malware out there will try to do things like break automatic updates anyway. Furthermore, it is generally easier for malware authors to push out new versions of software to victims (that might counter a PatchGuard update) than it is for most ISVs to deploy updates to their paying customers, who often tend to be stubborn on the issue of software updates, typically insisting on internal testing before mass deployment.
By the time malware is in a position to have to deal with PatchGuard as a potential blocking point, the victim has already lost. In order for PatchGuard to matter to malware, said malware must be able to run kernel level code on the victim’s computer. At this point, guarding the computer is really somewhat of a moot point, as the victim’s passwords, personal information, saved credit card numbers, secret documents, and whatnot are already toast in that the malware already has complete access to anything on the system. Now, there are a couple of scenarios (like trojaning a library computer kiosk) where the information desired by an attacker is not yet available, and the attacker has to hide his or her malicious code until a user comes around to type it in on a keyboard, but for the typical home or business computer scenario, the game is over the moment malicious code gets kernel level access to the box in question.

Now, for a bit of clarification. I said that PatchGuard was mostly not a security system. There is one real security benefit that PatchGuard conferns on customers, and that is the fact that PatchGuard keeps out ill-written software that does things like patch system calls and then fail to validate parameters correctly, resulting in new security issues being introduced. This is actually a real issue, believe it or not. Now, not everything that hooks the kernel introduces race conditions, security problems, or the like; it is possible (though difficult) to do so in a safe and correct manner in many circumstances, depending on what it is exactly that you’re trying to do. However, most of the software out there does tend to do things incorrectly, and in the process often inadvertently introduces security holes (typically local privilege escalation issues).

PatchGuard is not a DRM mechanism.

People who try to tell you that PatchGuard is DRM-related likely don’t really understand what it does. Even with PatchGuard installed, it’s still possible to extend the kernel with custom drivers. Additionally, the things that PatchGuard protects are only related to the new HD DRM mechanisms in Vista in a very loose sense. Many of the DRM provisions in Windows Vista are implemented in the form of drivers and not the kernel image itself, and are thus not protected by PatchGuard. If nothing else, third party drivers are responsible for the negotiation and lowest level I/O as relating to most of the new HD DRM schemes, and present an attractive target for DRM-bypass attempts that PatchGuard has no jurasdiction over. Unless Microsoft supplies all multimedia-output-related drivers in the system, this is how it will have to stay for the forseeable future, and it would be extremely difficult for Microsoft to protect just DRM-related drivers with PatchGuard in a non-trivially bypassable faction.

Preventing poorly written code from hooking the kernel is good for Windows customers.

There is a whole lot of bad code out there that hooks things incorrectly, and typically introduces race conditions and crash conditions. I have a great deal of first hand experience with this, as our primary product here at Positve has run afoul of third party software that hooks things and causes hard to debug random breakage that we get the blame for from a customer support perspective. The most common offenders that cause us pain are third party hooks that inject themselves into networking-related code and cause issues by subtlely breaking API semantics (for instance, we’ve run into certain versions of McAfee’s LSPs where if you call select in a certain way (Yes, I know that select is hardly optimal on Windows), the LSP will close a garbage handle value, sometimes nuking something important in the process. We’ve also run into a ton of cases where poorly behaved Internet Explorer add-ons will cause various corruption and other issues that we tend to get the blame for when our (rather complicated, code-wise, being a full-fledged VPN client) browser-based product is used in a partially corrupted iexplore process and eventually falls over and crashes. Another common trouble case relates to third party software that attempts to hook the CreateProcess* family of routines in order to propagate hooked code to child processes, and in the process manages to break legitimate usage of CreateProcess in certain obscure scenarios.

In our case, though, this is all just third-party “junkware” that breaks stuff in usermode. When you get to poorly written code that breaks the kernel, things get just that much worse; instead of a process blowing up, now the entire system hangs, crashes (i.e. bluescreens), experiences subtle filesystem corruption, or has local guest-to-kernel privilege escalation vulnerabilities introduced. Not only are the consequences of failure that much worse with hooking in kernel mode, but you can guess who gets the blame when a Windows box bluescreens: Microsoft, even though that sleazy anti-virus (or whatnot) software you just installed was really to blame. Microsoft claims that there are a significant amount of crashes on x86 Windows from third parties hooking the kernel wrong, and based on my own analysis of certain anti-virus software (and my own experience with hooking things blowing up our code in usermode), I don’t have any basis for disagreeing with Microsoft’s claim. From this perspective, PatchGuard is a good thing for consumers, as it represents a non-trivial attempt to force the industry to clean house and fix code that is at best questionable.

PatchGuard makes it significantly more difficult for ISVs out there which provide value on top of Windows through the use of careful hooking (or other non-blessed) means for extending the kernel.

Despite the dangers involved in kernel hooking, it is in fact possible to do it right in many circumstances, and there really are products out there which require kernel-level alterations that simply can’t be done without hooking (or other aspects that are blocked by PatchGuard). Note that, for the most part, I consider AV software as not in this class of software, and if nothing else, Microsoft’s Live OneCare demonstrates that it is perfectly feasible to deploy an AV solution without kernel hacks. Nonetheless, there are a number of niche solutions that revolve around things like kernel-level patching, which are completely shut out by PatchGuard. This presents an unfortunate negative atmosphere to any ISV that falls into this zone of having deployed technology that is now blocked in principle by PatchGuard, just because XYZ large anti-virus (the most common offenders, though there are certainly plenty of non-AV programs out there that are similarly “kernel-challenged”) vendor couldn’t be bothered to clean up their code and do things the correct way.

From this perspective, PatchGuard is damaging to customers (and affected ISVs), as they are essesntially forced to stay away from x64 (or take the risky road of trying to play cat-and-mouse with Microsoft and hack PatchGuard). This is unfortunate, and until recently, Microsoft has provided an outward appearance that they were taking a hard-line, stonewall stance against anything that ran afoul of PatchGuard, regardless of whether the program in question really had a “legitimate” need to alter the kernel. Fortunuately, for ISVs and customers, Microsoft appears to have very recently (or at least, very recently started saying the opposite thing in public) warmed up to the fact that completely shutting out a subset of ISVs and Windows customers isn’t such a great idea, and has reversed its previous statements regarding its willingness to cooperate with ISVs that run into problems with PatchGuard. I see this as a completely positive thing myself (keeping in mind that at work here, we don’t ship anything that conflicts with PatchGuard), as it signals that Microsoft is willing to work with ISVs that have “legitimate” needs that are being blocked by PatchGuard.

There is no giant conspiracy in Microsoft to shut out the ISVs of the world.

While I may not completely agree with the new reality that PatchGuard presents to ISVs, I have no illusions that Microsoft is trying to take over the world with PatchGuard. Like it or not, Microsoft is quite aware that the success of the Windows platform depends on the applications (and drivers) that run on top of it. As a result, it would be grossly stupid of Microsoft to try and leverage PatchGuard as a way to shut out other vendors entirely; customers don’t like changing vendors and ripping out all the experience and training that they have with an existing install base, and so you can bet that Microsoft trying to take over the software market “by force” with the use of PatchGuard wouldn’t go over well with Windows customers (or help Microsoft in its sales case for future Windows upgrades featuring PatchGuard), not to mention the legal minefield that Microsoft would be waltzing into should they attempt to do such a thing. Now, that being said, I do believe that somebody “dropped the ball”, so to speak, as far as cooperating with ISVs when PatchGuard was initially deployed. Things do appear to be improving from the perspective of Microsoft’s willingness to work with ISVs on PatchGuard, however, which is a great first step in the right direction (though it will remain to be seen how well Microsoft’ s current stance willl work out with the industry).

PatchGuard does represent, on some level, Microsoft exerting control over what users do with their hardware.

This is the main aspect of PatchGuard that I am most uncomfortable with, as I am of the opinion that when I buy a computer and pay for software, that I should absolutely be permitted to do what I want with it, even if that involves “dangerous” things like patching my own box’s kernel. In this regard, I think that Microsoft is attempting to play the “benevolent dictrator” with respect to kernel software; drivers that are dangerous to the reliablity of Windows computers, on average, are being blocked by Microsoft. Now, I do trust Microsoft and its code a whole lot more than most ISVs out there (I know first-hand how much more seriously Microsoft considers issues like reliablity and security at the present day (I’m not talking about Windows 95 here…) than a whole lot of other software vendors out there. Still, I don’t particularly like the fact that PatchGuard is akin to Microsoft telling me that “sorry, that new x64 computer and Windows license you bought won’t let you do things that we have classified as dangerous to system stability”. For example, I can’t write a program on Windows x64 that patches things blocked by PatchGuard, even for research and education purposes. I can understand that Microsoft is finding themselves up against the wall here against an uncooperative industry that doesn’t want to clean up its act, but still, it’s my computer, so I had damn well better be able to use it how I like. (No, I don’t consider the requirement of having a kernel debugger attached at boot time as an acceptable one, especially in automated scenarios.)

PatchGuard could make it more difficult to prove that a system is uncompromised, or to analyze a known compromised system for clues about an attack, if the malware involved is clever enough.

If you are trying to analyze a compromised (or suspected compromised) system for information about an attack, PatchGuard represents a dangerous unknown: a deliberately obfuscated chunk of code with sophisticated anti-analysis/anti-debugging/anti-reverse-engineering that ships with the operating system. This makes it very difficult to look at a system and determine if, say, it’s really PatchGuard that is running every so often, or perhaps some malware that has hijacked PatchGuard for nefarious purposes. Without digging though layer upon layer of obfuscation and anti-debugging code, definitively saying that PatchGuard on a system is uncompromised is just plain not do-able. In this respect, PatchGuard’s obfuscation presents a potentially attractive place for malicious code to hide and avoid being picked up in the course of post-compromise forensics of a running system that has been successfully compromised.

PatchGuard will not be obfuscation-based forever.

Eventually, I would expect that it will be replaced by hardware-enforced (hypervisor) based systems that utilize hardware-supported virtualization technology in new processors to provide a “ring -1” for code presently guarded by PatchGuard to execute in. This approach would move the burden of guarding kernel code to the processor itself, instead of the current “cat and mouse” game in software that exists with PatchGuard, as PatchGuard executes at the same privilege isolation level as code that might try to subvert it. Note that, in a hypervisor based system, hardware drivers would ideally be unable to cause damage (in terms of things like memory corruption and the like) to the kernel itself, which might eventually allow the system to continue functioning even if a driver fails. Of course, if drivers rely on being able to rewrite the kernel, this goal is clearly unattainable, and PatchGuard helps to ensure that in the future, there won’t be a backwards compatibility nightmare caused by a plethora of third-party drivers that rely on being able to directly alter the behavior of the kernel. (I suspect that x64 will supplant x86 in terms of being the operating sytem execution environment in the not-too-distant future). In usermode, x86 will likely live on for a very long time, but as x64 processors can execute x86 code at native speed, this is likely to not be an issue.

When PatchGuard is hypervisor-backed, it won’t be feasible to simply patch it out of existance, which means that ISVs will either have to comply with Microsoft’s requirements or find a way to evade PatchGUard entirely.

Overall, I think that there are both significant positive (and negative) aspects for PatchGuard. Whether it will turn out to be the best business decision (or the best experience for customers) remains to be seen; in an ideal world, though, only developers that really understand the full implications of what they are doing would patch the kernel (and only in safe ways), and things like PatchGuard would be unnecessary. I fear that this has become too much to expect for every programmer to do the right thing, though; one need only look at the myriad security vulnerabilities in software all over to see how little so many programmers care about correctness.

Posted in Security, Windows | 5 Comments »

Programming against the x64 exception handling support, part 6: Frame consolidation unwinds

Sunday, January 14th, 2007

In the last post in the programming x64 exception handling series, I described how collided unwinds were implemented, and just how they operate. That just about wraps up the guts of unwinding (finally), except for one last corner case: So-called frame consolidation unwinds.

Consolidation unwinds are a special form of unwind that is indicated to RtlUnwindEx with a special exception code, STATUS_UNWIND_CONSOLIDATE. This exception code changes RtlUnwindEx’s behavior slightly; it suppresses the behavior of substituting the TargetIp argument to RtlUnwindEx with the unwound context’s Rip value.

As far as RtlUnwindEx goes, that’s all that there is to consolidation unwinds. There’s a bit more that goes on with this special form of unwind, though. Specifically, as with longjump style unwinds, there is special logic contained within RtlRestoreContext (used by RtlUnwindEx to realize the final, unwound execution context) that detects the consolidation unwind case (by virtue of the ExceptionCode member of the ExceptionRecord argument), and enables a special code path. As it is currently implemented, some of the logic relating to unwind consolidation is also resident within the pseudofunction RcFrameConsolidation. This function is tightly coupled with RtlUnwindContext; it is only separated into another logical function for purposes of describing RtlRestoreContext and RcFrameConsolidation in the unwind metadata for ntdll (or ntoskrnl).

The gist of what RtlRestoreContext/RcFrameConsolidation does in this case is essentially the following:

Make a local copy of the passed-in context record.
Treating ExceptionRecord->ExceptionInformation[0] as a callback function, this callback function is called (given a single argument pointing to the ExceptionRecord provided to RtlRestoreContext).
The return address of the callback function is treated as a new Rip value to place in the context that is about to be restored.
The context is restored as normal after the Rip value is updated based on the callback’s decision.

The callback routine pointed to by ExceptionRecord->ExceptionInformation[0] has the following signature:

// Returns a new RIP value to be used in the restored context.
typedef ULONG64 (* PFRAME_CONSOLIDATION_ROUTINE)(
   __in PEXCEPTION_RECORD ExceptionRecord
   );

After the confusing interaction between multiple instances of the case function that is collided unwinds, frame consolidation unwinds may seem a little bit anti-climactic.

Frame consolidation unwinds are typically used in conjunction with C++ try/catch/throw support. Specifically, when a C++ exception is thrown, a consolidation unwind is executed within a function that contains a catch handler. The frame consolidation callback is used by the CRT to actually invoke the various catch filters and handlers (these do not necessarily directly correspond to standard C SEH scope levels). The C++ exception handling routines use the additional ExceptionInformation fields available in a consolidation unwind in order to pass information about the exception object itself; this usage of the ExceptionInformation fields does not have any special support in the OS-level unwind routines, however. Once the C++ exception is going to cross a function-level boundary, in my observations, it is converted into a normal exception for purposes of unwinding and exception handling. Then, consolidation unwinds are used again to invoke any catch handlers encountered within the next function (if applicable).

Essentially, consolidation unwinds can be thought as a normal unwind, with a conditionally assigned TargetIp whose value is not determined until after all unwind handlers have been called, and the specified context has been unwound. In most circumstances, this functionality is not particularly useful (or critical) when speaking in terms of programming the raw OS-level exception support directly. Nonetheless, knowing how it works is still useful, if only for debugging purposes.

I’m not covering the details of how all of the C++ exception handling framework is built on top of the unwind handling routines in this post series; there is no further OS-level “awareness” of C++ exceptions within the OS’s unwind related support routines, however.

Next up: Tying it all together, or using x64’s improved exception handling support to our gain in the real world.

Posted in NT Internals, Programming, Windows | 1 Comment »

Things that don’t quite work right in Windows Vista

Friday, January 12th, 2007

Having switched to Windows Vista full-time, I’ve now had an opportunity to run into most of the little “daily annoyances” that detract from the general experience (at least, my experience, anyway – your mileage my vary). Many of these problems existed in Windows XP, but that doesn’t change the fact that they’re annoying and/or frustrating. Some of them are new to Vista. Following is a list of some of the big annoyances that bother me on a regular basis:

Vista’s 1394 support is mostly broken. Specifically, the 1394 storage support (a-la sbp2port.sys) in Vista is pretty much horrific. This sad state of affairs is carried over from Windows XP and Windows Server 2003, and isn’t really new to Vista, but it’s highly frustrating nonetheless. It seems that the 1394 storage support is prone to randomly breaking (in my experience), and when it breaks, it breaks badly; usually, you end up with either a bluescreen in sbp2port.sys due to it corrupting one of it’s internal data structures, or the I/O system grinds to a halt as all IRPs that make their way to a 1394 storage device get “stuck” forever, essentially permanently pinning (and freezing) any process that touches a 1394 storage device. Since there are an amazing number of things that do this indirectly by asking for information about storage volumes, this typically manifests itself as an inability to do practically anything (logon, logoff, start programs, and soforth).
This problem is particularly prone to happening when you disconnect a 1394 device, but I’ve also seen it spontaneously happen without user interaction (and without cables becoming lose, as far as I can tell). I’ve experienced these problems on Windows XP, Windows Server 2003, and Windows Vista (x64 and x86), across a wide range of 1394 chipsets and 1394 storage devices.

In order to recover from this breakage, a hard reboot is usually necessary, since shutting down is virtually impossible due to any I/Os hitting a 1394 device getting frozen indefinitely.

This is really a shame, as 1394 is a pretty nice bus for storage devices. Other uses of 1394 on Windows are fairly stable; kernel debugging works fine, and networking (used) to work fine, although Vista removes support for IP/1394 (this negatively impacted me, as it was fairly handy for transferring large amounts of data between laptops which typically have 1394 but not gigabit ethernet. Not the end of the world, but it is a feature that I used which disappeared out from under me).
Hybrid sleep takes forever to complete the low power transition on a laptop with 2GB (or 1.5GB) of RAM. Microsoft advertises it as powering the system down in a few seconds for a laptop, but as far as I can tell, this is pure marketing fiction. It regularly takes 30-45 seconds (sometimes more) for hybrid sleep to finish doing its thing and transition the system to a low power state. I’ve observed this on at least two respectable laptops, so I’m fairly sure this isn’t just bad hardware and just a limitation of the hibernate-based technology. Still, despite the annoyance factor, I think hybrid sleep is an overall improvement (especially with being able to swap out batteries without fear if your system went into suspend due to low power).
The Windows Vista RDP client has an ill-advised default behavior relating to the selection of the account domain name sent to the remote system when logging on. Unlike previous RDP client versions, the Vista RDP client requires you to enter credentials before connecting. The problem here is that if you don’t explicitly specify a domain name in the username part of your name, and you aren’t connecting to a target system with its netbios name, then your logon will virtually always fail. This is because the RDP client will prepend “hostname\” to the username sent to the remote system, where “hostname” is the hostname you tell the RDP client to connect to. This results in all sorts of stupidly broken scenarios, where mstsc provides an IP address as your logon domain, or a FQDN for a computer that isn’t a member of that domain, or a different FQDN that leads to a computer but doesn’t match the computer name. All of these will result in a failed logon, for no particularly good reason. The workaround to this is fairly simple; specify “\username” as your username in mstsc, and the local (SAM) account database of the remote system is used to log you on. Still, there is almost no reason to not default to this behavior…
Credential saving in IE7 is unintuitive at best. Specifically, even if you check the “save my credentials” box, the credentials are mysteriously not used unless you explicitly put the target site in your Intranet zone. This is highly annoying until you figure out what’s going on, as the credential manager UI in the control panel shows your credentials as being saved, but they’re never actually used. At the very least, there needs to be some kind of UI feedback on the save credentials dialog if your current settings dictate that your saved credentials will never actually be used.
Switching users will kill RAS (dial-up) connections. There used to be a nicely documented way for how to suppress this behavior on pre-Vista systems, but from disassembling and debugging Vista, it’s abundantly clear that there is no possible way to suppress this behavior in Vista, at least without patching rasmans.dll to not kill connections on logon/logoff). Vista does have an explicit exemption for a connection shared by Internet Connection Sharing (ICS), that prevents it from being killed on logon/logout, but even this is stupidly buggy: there are no checks done to make sure that dependant connections of the ICS RAS connection aren’t killed on logout. This is especially ironic, given all of the work that has been put into developing routing compartments for Vista and “Longhorn” (culminating in the nice advantages of them being completely ignored on Vista). For me, this is a real problem, as I often use a dependant RAS connection setup for remote Internet access: A dial-up connection through Bluetooth to my cell phone (EVDO) for physical Internet access, and then an L2TP/IPSec link on top of that for security. Unfortunately, this means that no matter what I do, each time I switch users in Vista, my mobile Internet access gets nuked. I’ve been pondering writing something to patch out this stupidness in rasmans.dll…
Something seems to cause Vista to hold on to IP addresses after their associated adapter has gone down. I’ve noticed this with my Broadcom NetXtra 57xx gigabit NIC; my previous laptop had a similar problem with its Broadcom 100mbit NIC as well. I don’t know if this is a Broadcom problem or an OS-level problem, but what ends up happening is that tcpip still claims ownership of the IP address that you had assigned to you while that interface is disconnected (media unplugged). Normally, this isn’t a big problem, but if you have a setup where VPN’ing in and plugging into Ethernet will net you the same static IP on a network, you’ll occasionally run into bizzare problems where the VPN connection doesn’t get an IP address. This tends to be a result of Vista claiming that same IP address that you would be assigned via the VPN on an ethernet interface that is now media-disconnected (but was previously connected to the network in question and had the IP address in question). This problem isn’t new to Vista; I’ve seen it on Windows XP as well. I’ve pretty much given up on retaining the same IP for remoting and just being plugged directly into my network as a result of this problem…
You can’t use runas on explorer or IE anymore. This means you’re forced to use Fast User Switching for things that UAC doesn’t have a convenient UI for, which also means that you’re in trouble if you’re using RAS for Internet access. Doh.
Managing RAS connections is a royal pain. In order to make a copy of a RAS connection, rename it from the default name (“original name – Copy”), and edit the properties (presumably, you wanted to actually change the connection without just duplicating it for no reason, right?), you need to go through no less than three elevation prompts in rapid succession. This isn’t so terrible if you’re logged on as an admin, but having to type your admin password three different times (if you’re logged on as a limited user, the “recommended” configuration) is just stupidly redundant and annoying.
There is a no way that I can tell to change the default domain name used in UAC elevation prompts (if you’re a non-admin user providing over-the-shoulder (OTS) credentials). This sucks if you’re on a domain, as the typical “recommended” scenario is you would be logged on with a domain account (and a limited user) on the local system. If you need to perform an administrative task, then you elevate using a local admin account (presumably, you don’t give all your users domain admin accounts). Unfortunately, Vista always defaults to your domain as the logon domain, instead of the local domain (for elevation prompts). This means that in the “recommended”, “secure” configuration, you have to type computername\ before your account name every single time you get an elevation prompt. This is a basic convenience thing that really needs a way to change the default logon domain, as you pretty much always use one or the other, and if you aren’t the network domain for elevation, you’re stuck doing completely redundant extra work each time you elevate.
You can’t scroll down in the bottom pane of the “Networking” tab in Task Manager anymore, if you have enough adapters to cause a scroll bar to appear. Each time Task Manager refreshes, the listbox scrolls to the top and blows away whatever you were looking at. This (minor, but still annoying) issue is new to Vista.
There is no way to modify Terminal Server connection ACLs. The tool to do this (tscc.msc) isn’t shipped on “workstation” systems. It’s there on Windows Server 2003, but not Windows XP. I suppose this is a “business decision” by Microsoft, but I happen to want to be able to make myself able to perform session connections from the command line or Task Manager without going through the switch user GUI. This is a pretty minor gripe, but it’s still something that wastes a bit of my time each time I need to switch sessions.

Okay, enough ranting about things that aren’t quite the way I would like in Vista. Despite these shortcomings (which you may or may not agree about), I still think that Vista’s worth having over XP (there are a whole lot of things that *do* work right in Vista, and plenty of useful new additions to boot). Nothing’s perfect, though, and improving awareness of issues can only improve the chances of them getting fixed, someday.

Oh, and if anyone’s got any suggestions or information about how to work around some of the problems I’ve talked about here, I’d be happy to hear them.

Posted in Windows | 6 Comments »

Don’t always trust the compiler… (or when reverse engineering comes in handy even when you’ve got source code)

Wednesday, January 10th, 2007

Usually, when a program breaks, you look for a bug in the program. On the rare occasion, however, compilers have been known to malfunction.

I ran into such a problem recently. At my apartment, I have a video streaming system setup, wherein I have a TV tuner plugged into a dedicated desktop box. That desktop box has been setup to run VLC noninteractively in order to stream (broadcast) TV from the TV tuner onto my apartment LAN. Then, if I want to watch TV, all I have to do is pull up VLC at a computer and tell it to display the MPEG stream I have configured to be broadcast on my local network.

This works fairly well (although VLC isn’t without it’s quirks), and it’s got the nice side effect of that I have a bit more flexibility as to where I want to watch TV at without having to invest in extra hardware (beyond a TV tuner). Furthermore, I can even do silly things like put TV up on multiple monitors if I really wanted to, something not normally doable if you just use a “plain” TV set (the old fashioned way!).

Recently, though, one of my computers ceased being able to play the MPEG stream I was running over my network. Investigation showed that other computers on the LAN weren’t having problems displaying the stream; only this one system in particular wouldn’t play the stream correctly. When I connected VLC to the stream, I’d get a blank black screen with no audio or video. I checked out the VLC debug message log and found numerous instances of this log message:

warning: received bufer in the future

Hmm. It seemed like VLC was having timing-related problems that were causing it to drop frames. My first reaction was that VLC had some broken-ness relating to the handling of large uptimes (this system in question had recently exceeded the “49.7 day boundary”, wherein the value returned by GetTickCount, a count in milliseconds of time elapsed since the system booted, wraps around to zero). I set out to prove this assumption by setting a breakpoint on kernel32!GetTickCount in the debugger and attaching VLC to the stream. While GetTickCount was occasionally called, it turned out that it wasn’t being used in the critical code path in question.

So, I set out to find that log message in the VLC source code (VLC is open source). It turned out to be coming from a function relating to audio decoding (aout_DecPlay). The relevant code turned out to be as follows (reformatting by me):

[...]
if ( p_buffer->start_date > mdate() +
     p_input->i_pts_delay           +
     AOUT_MAX_ADVANCE_TIME )

{
     msg_Warn( p_aout,
      "received buffer in the future ("I64Fd")",
       p_buffer->start_date - mdate());

[...]

After logging this warning, the function in question drops the frame with the assumption that it is probably bogus due to bad timing information.

Clearly, there was nothing wrong with the stream itself, as I could still play the stream fine on other computers. In fact, restarting VLC on the computer hosting the stream, or the computer hosting the VLC stream itself both did nothing to resolve the problem; other computers could play the stream, except for one system (with a high uptime) that would always fail due to bad timing information.

In this case, it turns out that the mdate function is an internal VLC function used all over the place for high resolution timing. It returns a microsecond-precision counter that is monotically incrementing since VLC started (or in the case of Win32 VLC, since Windows was started). I continued to suspect that something was wrong here (as the only system that was failing to play the stream had a fairly high uptime). Looking into the source for mdate, there were two code paths that could be taken on Win32; one that used GetTickCount for timing resolution (though this code path in question does handle tick count wraparound), and another path that utilizes QueryPerformanceCounter and QueryPerformanceFrequency for high resolution timing, if VLC thinks that the performance counter is slaved to the system timer clock. (Whether or not the latter is really a good thing to do period on Windows is debatable; I would say no, but it appears to work for VLC.)

As I had already ruled out GetTickCount as being used in the timing-critical parts of VLC, I ignored the GetTickCount-related code path in mdate. This left the following segment of code in the Win32 version of mdate:

/**
 * Return high precision date
 *
 * Uses the gettimeofday() function when
 *  possible (1 MHz resolution) or the
 * ftime() function (1 kHz resolution).
 */
mtime_t mdate( void )
{
 /* We don't need the real date, just the value of
    a high precision timer */
 static mtime_t freq = I64C(-1);

 if( freq == I64C(-1) )
 {
  /* Extract from the Tcl source code:
   * (http://www.cs.man.ac.uk/fellowsd-bin/TIP/7.html)
   *
   * Some hardware abstraction layers use the CPU clock
   * in place of the real-time clock as a performance counter
   * reference.  This results in:
   * - inconsistent results among the processors on
   *   multi-processor systems.
   * - unpredictable changes in performance counter frequency
   *   on "gearshift" processors such as Transmeta and
   *   SpeedStep.
   * There seems to be no way to test whether the performance
   * counter is reliable, but a useful heuristic is that
   * if its frequency is 1.193182 MHz or 3.579545 MHz, it's
   * derived from a colorburst crystal and is therefore
   * the RTC rather than the TSC.  If it's anything else, we
   * presume that the performance counter is unreliable.
   */

  freq = ( QueryPerformanceFrequency( (LARGE_INTEGER *)&freq )
      && (freq == I64C(1193182) || freq == I64C(3579545) ) )
      ? freq : 0;
 }

 if( freq != 0 )
 {
  LARGE_INTEGER counter;
  QueryPerformanceCounter (&counter);

  /* Convert to from (1/freq) to microsecond resolution */
  /* We need to split the division to avoid 63-bits
       overflow */
  lldiv_t d = lldiv (counter.QuadPart, freq);

  return (d.quot * 1000000)
    + ((d.rem * 1000000) / freq);
 }
[...]
}

This code isn’t all that hard to follow. The idea is that the first time around, mdate will check the performance counter frequency for the current system. If it is one of two magical values, then mdate will be configured to use the performance counter for timing. Otherwise, it is configured to use an alternate method (not shown here), which is based on GetTickCount. On the system in question, mdate was being set to use the performance counter and not GetTickCount.

Assuming that mdate has decided on using the system performance counter for timing purposes (which, again, I do not believe is a particularly good (portable) choice, though it does happen to work on my system), then mdate simply divides out the counter value by the frequency (count of counter units per second), adjusted to return a nanosecond value (hence the constant 1000000 vales). The reason why the original author split up the division into two parts is evident by the comment; it is an effort to avoid an integer overflow when performing math on large quantities (it avoids multiplying an already very large (64-bit) value by 1000000 before the divission, which might then exceed 64 bits in the resultant quantity). (In case you were wondering, lldiv is a 64-bit version of the standard C runtime function ldiv; that is, it performs an integral 64-bit division with remainder.)

Given this code, it would appear that mtime should be working fine. Just to be sure, though, I decided to double check what was going on the debugger. Although VLC was built with gcc (and thus doesn’t ship with WinDbg-compatible symbol files), mtime is a function exported by one of the core VLC DLLs (libvlc.dll), so there wasn’t any great difficulty in setting a breakpoint on it with the debugger.

What I found was that mdate was in fact returning a strange value (to be precise, a large negative value – mtime_t is a signed 64-bit integer). Given the expression used in the audio decoding function snippet I listed above, it’s no surprise why that would break if mdate returned a negative value (and it’s a good assumption that other code in VLC would similarly break).

The relevant code portions for the actual implementation of mdate that gcc built were as so:

libvlc!mdate+0xe0:
62e20aa0 8d442428     lea     eax,[esp+0x28]
62e20aa4 890424       mov     [esp],eax
;
; QueryPerformanceCounter(&counter)
;
62e20aa7 e874640800   call    QueryPerformanceCounter
62e20aac 83ec04       sub     esp,0x4
62e20aaf b940420f00   mov     ecx,0xf4240 ; 1000000
62e20ab4 8b742428     mov     esi,[esp+0x28]
62e20ab8 8b7c242c     mov     edi,[esp+0x2c]
62e20abc 89f0         mov     eax,esi
62e20abe f7e1         mul     ecx
62e20ac0 89c1         mov     ecx,eax
62e20ac2 69c740420f00 imul    eax,edi,0xf4240 ; 1000000
62e20ac8 890c24       mov     [esp],ecx
62e20acb 8b3dcc7a2763 mov     edi,[freq.HighPart]
62e20ad1 8d3402       lea     esi,[edx+eax]
62e20ad4 897c240c     mov     [esp+0xc],edi
62e20ad8 8b15c87a2763 mov     edx,[freq.LowPart]
62e20ade 89742404     mov     [esp+0x4],esi
62e20ae2 89542408     mov     [esp+0x8],edx
;
; lldiv(...)
;
62e20ae6 e815983e00   call    lldiv
62e20aeb 8b5c2430     mov     ebx,[esp+0x30]
62e20aef 8b742434     mov     esi,[esp+0x34]
62e20af3 8b7c2438     mov     edi,[esp+0x38]
62e20af7 83c43c       add     esp,0x3c
62e20afa c3           ret

This bit of code might look a bit daunting at first, but it’s not too bad. Translated into C, it looks approximately like so:

LARGE_INTEGER counter, tmp;

QueryPerformanceCounter(&counter);

tmp.LowPart  = Counter.LowPart  * 1000000;
tmp.HighPart = Counter.HighPart * 1000000 +
    (((unsigned __int64)counter.LowPart  * 1000000) >> 32);

d = lldiv(tmp.QuadPart, freq);

return d.quot;

This looks code looks a little bit weird, though. It’s not exactly the same thing that we see in the VLC source code, even counting for differences that might arise between original C source code and reverse enginereed C source code; in the compiled code, the expression in the return statement has been moved before the call to lldiv.

In fact, the code has been heavily optimized. The compiler (gcc, in this case) apparently assumed some knowledge about the inner workings of lldiv, and decided that it would be safe to pre-calculate an input value instead of perform post-calculations on the result of lldiv. The calculations do appear to be equivalent, at first; the compiler simply moved a multiply around relative to a division that used remainders. Basic algebra tells us that there isn’t anything wrong with doing this.

However, there’s one little complication: computers don’t really do “basic algebra”. Normally, in math, you typically assume an unlimited space for variables and intermediate values, but in computer-land, this isn’t really the case. Computers approximate the set of all integer values in a 32-bit (or 64-bit) number-space, and as a result, there is a cap on how large (or small) of an integer you can represent natively, at least without going to a lot of extra work to support truly arbitrarily large integers (as is often done in public key cryptography implementations).

Taking a closer look at this case, there is a problem; the optimizations done by gcc cause some of the intermediate values of this calculation to grow to be very large. While the end result might be equivalent in pure math, when dealing with computers, the rules change a bit due to the fact that we are dealing with an integer with a maximum size of 64 bits. Specifically, this ends up being a problem because the gcc-optimized version of mdate multiplies the raw value of “counter” by 1000000 (as opposed to multiplying the result of the first division by 1000000). Presumably, gcc has performed this optimization as multiply is fairly cheap as far as computers go (and division is fairly expensive in comparison).

Now, while one might naively assume that the original version of mdate and the instructions emitted by gcc are equivalent, with the above information in mind, it’s clear that this isn’t really the case for the entire range of values that might be returned by QueryPerformanceCounter. Specifically, if the counter value multiplied by 1000000 exceeds the range of a 64-bit integer, then the two versions of mdate will not return the same value, as in the second version, one of the intermediate values of this calculation will “wrap around” (and in fact, to make matters worse, mdate is dealing with signed 64-bit values here, which limits the size of an integer to 63 significant bits, with one bit reserved for the representation of the integer’s sign).

This can be experimentally confirmed in the debugger, as I previously alluded to. Stepping through mdate, around the call to lldiv specifically, we can see that the intermediate value has exceeded the limits of a 63-bit integer with sign bit:

0:007>
eax=d5ebcb40 ebx=00369e99 ecx=6f602e40
edx=00369e99 esi=d5f158ce edi=00000000
eip=62e20ae6 esp=01d0fbfc ebp=00b73068
iopl=0         ov up ei ng nz nape cy
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000a83
libvlc!mdate+0x126:
62e20ae6 e815983e00       call    lldiv
0:007> dd @esp
01d0fbfc  6f602e40 d5f158ce 00369e99 00000000
01d0fc0c  00000060 01d0ffa8 77e6b7d0 77e6bb00
01d0fc1c  ffffffff 77e6bafd 5d29c231 00000e05
01d0fc2c  00000000 00b73068 00000000 62e2a544
01d0fc3c  00000f08 ffffffff 01d0fd50 00b72fd8
01d0fc4c  03acc670 00a49530 01d0fd50 6b941bd2
01d0fc5c  00a49530 00000f30 00000000 00000003
01d0fc6c  00b24130 00b53e20 0000000f 00b244a8
0:007> p
eax=e1083396 ebx=00369e99 ecx=ffffffff
edx=ffffff3a esi=d5f158ce edi=00000000
eip=62e20aeb esp=01d0fbfc ebp=00b73068
iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000206
libvlc!mdate+0x12b:
62e20aeb 8b5c2430         mov     ebx,[esp+0x30]

Using our knowledge of calling conventions, it’s easy to retrieve the arguments to lldiv from the stack: tmp.QuadPart is 0xd5f158ce6f602e40, and freq is 0x0000000000369e99

It’s clear that counter.QuadPart has overflowed here; considering the sign bit is set, it now holds a (very large) negative quantity. Since the remainder of the function does nothing that would influence the sign bit of the result, after the division, we get another (large, but closer to zero) negative value back (stored in edx:eax, the value 0xffffff3ae1083396). This is the final return value of mdate, which explains the problems I was experiencing with playing video streams; large negative values were being returned and causing sign-sensitive inequality tests on the return value of mdate (or a derivative thereof) to operate unexpectedly.

In this case, it turned out that VLC failing to play my video stream wasn’t really the fault of VLC; it ended up being a bug in gcc’s optimizer that caused it to make unsafe optimizations that introduce calculation errors. What’s particularly insidious about this mis-optimization is that it is invisible until the input values for the operations involve grow to a certain size, after which calculation results are wildly off. This explains why nobody else has run into this problem in VLC enough to get it fixed by now; unless you run VLC on Windows systems with a high uptime, where VLC is convinced that it can use the performance counter for timing, you would never know that the compiler had introduced a subtle (but serious) bug due to optimizations.

As to fixing this problem, there are a couple of approaches the VLC team could take. The first is to update to a more recent version of gcc (if newer gcc versions fix this problem; I don’t have a build environment that would let me compile all of VLC, and I haven’t really had much luck in generating a minimal repro for this problem, unfortunately). Alternatively, the function could be rewritten until gcc’s optimizer decided to stop trying to optimize the division (and thus introduce calculation errors).

A better solution would be to just drop the usage of QueryPerformanceCounter entirely, though. For VLC’s usage, GetTickCount should be close enough timing-wise, and you can even increase the resolution of GetTickCount up to around 1ms (with good hardware) using timeBeginTime. GetTickCount does have the infamous 49.7-day wraparound problem, but VLC does have a workaround that works. Furthermore, on Windows Vista and later, GetTickCount64 could be used, turning the 49.7-day limit into a virtual non-issue (at least in our lifetimes, anyway).

(Oh, and in case you’re wondering why I didn’t just fix this myself and submit a patch to VLC (after all, it’s open source, so why can’t I just “fix it myself”?): VLC’s source code distribution is ~100mb uncompressed, and I don’t really want to go spending a great deal of time to find a cygwin version that works correctly on Vista x64 with ASLR and NX enabled (cygwin’s fault, not Vista’s) so that I can get a build environment for VLC up and running so that I could test any potential fix I make (after debugging the inevitable build environment difficulties along the way for such a large project). I might still do this at some point, perhaps to see if recent gcc versions fix this optimizer bug, though.)

For now, I just patched my running VLC instance in-memory to use GetTickCount instead, using the debugger. Until I restart VLC, that will have to do for now.

Posted in Debugging, Reverse Engineering | 8 Comments »

Programming against the x64 exception handling support, part 5: Collided unwinds

Tuesday, January 9th, 2007

Previously, I discussed the internal workings of RtlUnwindEx. While that posting covered most of the inner details regarding unwind support, I didn’t fully cover some of the corner cases.

Specifically, I haven’t yet discussed just what a “collided unwind” really is, other than providing vague hints as to its existance. A collided unwind occurs when an unwind handler initiates a secondary unwind operation in the context of an unwind notification callback. In other words, a collided unwind is what occurs when, in the process of a stack unwind, one of the call frames changes the target of an unwind. This has several implications and requirements in order to operate as one might expect:

Some unwind handlers that were on the original unwind path might no longer be called, depending on the new unwind target.
The current unwind call stack leading into RtlUnwindEx will need to be interrupted.
The new unwind operation should pick up where the old unwind operation left off. That is, the new unwind operation shouldn’t start unwinding the exception handler stack; instead, it must unwind the original stack, starting from the call frame after the unwind handler which initiated the new unwind operation.

Because of these conditions, the implementation of collided unwinds is a bit more complicated than one might expect. The main difficulty here is that the second unwind operation is initiated within the call stack of an existing unwind operation, but what the unwind handler “wants” to do is to unwind the stack that was already being unwound, except to a different target and with different parameters.

From an unwind handler’s perspective, all that needs to be done to accomplish this is to make a call to RtlUnwindEx in the context of an unwind handler callback for an unwind operation, and RtlUnwindEx magically takes care of all of the work necessary to make the collided unwind “just work”.

Allowing this sort of unwind operation to “just work” requires a bit of creative thinking from the perspective of RtlUnwindEx, however. The main difficult here is that RtlUnwindEx, when called from the unwind handler, somehow needs a way to recover the original context that was being unwound in order to “pick up” where the original call to RtlUnwindEx “left off” (when it called an unwind handler that initiated a collided unwind). Because there is no provision for passing a context record to RtlUnwindEx and indicating that RtlUnwindEx should use it as a starting point for an unwind operation (RtlUnwindEx always initiates the unwind in the current call stack), this poses a problem; how is RtlUnwindEx to recover the original unwind parameters from where it should initiate the “real” unwind?

The way that Microsoft decided to solve this problem is an elegant little hack of sorts. The solution all comes down to that mysterious exception handler around RtlpExecuteHandlerForUnwind: RtlpUnwindHandler. Recall from the previous article that RtlUnwindEx calls RtlpExecuteHandlerForUnwind in order to invoke an exception handler for unwind purposes, and that RtlpExecuteHandlerForUnwind sets up an exception handler (RtlpUnwindHandler) before calling the requested exception handler for unwind. At the time, these extra steps (the use of RtlpExecuteHandlerForUnwind, and its exception handler) probably looked a bit redundant, and in the process of a “conventional” unwind operation, the extra work that RtlUnwindEx goes through before calling an unwind handler doesn’t even come into play as adding any value.

That all changes when a collided unwind occurs, however. In the collided unwind case, RtlpExecuteHandlerForUnwind and RtlpUnwindHandler are critical to solving the problem of how to recover the original unwind parameters so that RtlUnwindEx can perform an unwind operation on the correct call stack. In order to understand just how RtlpUnwindHandler and friends come into play with a collided unwind, it’s necessary to take a closer look about just what RtlUnwindEx will do when it is called from the context of an unwind handler.

Since RtlUnwindEx always begins a call frame unwind from the currently active call stack, the second call to RtlUnwindEx will start unwinding the call stack of the unwind handler that called RtlUnwindEx. But wait, you might say – this isn’t what is supposed to happen! It turns out that unwinding the unwind handler’s call stack will actually lead up to the “right thing” happening, through a bit of clever use of how “conventional” unwind operations work. To better understand what I mean, it’s helpful to look at the stack of a secondary call to RtlUnwindEx (initiating a collided unwind operation). For this purpose, I’ve put together a small problem that initiates a collided unwind (more on how and why you might see a collided unwind in the “real world” later). I’ve set a breakpoint on RtlUnwindEx, and skipped forward until I encountered the nested call to RtlUnwindEx that was initiating a collided unwind operation:

0:000> k
Child-SP          Call Site
00000000`0012e058 ntdll!RtlUnwindEx
00000000`0012e060 ntdll!local_unwind+0x1c
00000000`0012e540 TestApp!`FaultingFunction2'::`1'::fin$2+0x34
00000000`0012e570 ntdll!_C_specific_handler+0x140
00000000`0012e5e0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012e610 ntdll!RtlUnwindEx+0x236
00000000`0012ec90 TestApp!UnwindExceptionHandler2+0xf8
00000000`0012f1b0 TestApp!`FaultingFunction2'::`1'::filt$1+0xe
[...]

At this point, given what we know about RtlUnwindEx, it will start unwinding the stack downward. Since the target of the collided unwind will by definition be lower in the stack than the unwind handler’s stack pointer itself, RtlUnwindEx will continue unwinding downward, calling unwind handlers (if any) for each successive frame. Taking a look at the call stack, we can determine that there are no frames with an exception handler marked for unwind (denotated by a [ U ] in the !fnseh output):

0:000> !fnseh ntdll!RtlUnwindEx
ntdll!RtlUnwindEx L295 22,0A [   ]  (none)
0:000> !fnseh ntdll!local_unwind
ntdll!local_unwind L24 07,02 [   ]  (none)
0:000> !fnseh 00000000`01001f04 
1001ed0 L3a 06,02 [   ]  (none)
0:000> !fnseh ntdll!_C_specific_handler+0x140
ntdll!_C_specific_handler L16a 20,0C [   ]  (none)

(Here, 00000000`01001f04 corresponds to TestApp!`FaultingFunction2′::`1′::fin$2+0x34).

Because none of these call frames have an exception handler marked for unwind callbacks, we can surmise that RtlUnwindEx will blissfully unwind past all of these call frames just as one might expect. At this point, RtlUnwindEx is still unwinding the “wrong” stack though; we’d like it to be unwinding the stack passed to the original call to RtlUnwindEx, and not the unwind/exception handler call stack.

Something that one might not immediately expect happens when RtlUnwindEx reaches the next frame, however. Remember that the current call frame is now _C_specific_handler – the C-language exception handler for the current function that was originally being unwound after an exception occured. This means that the next call frame will be the original RtlUnwindEx, or more precisely, RtlpExecuteHandlerForUnwind.

This is where RtlpExecuteHandlerForUnwind and RtlpUnwindHandler get to shine. If we take a look at the next call frame in the debugger, we see that it is indeed RtlpExecuteHandlerForUnwind, and that it also has (as expected) an exception handler marked for unwind support: RtlpUnwindHandler.

0:000> !fnseh ntdll!RtlpExecuteHandlerForUnwind+0xd
ntdll!RtlpExecuteHandlerForUnwind L13 04,01 [EU ]
  ntdll!RtlpUnwindHandler (assembler/unknown)

Because this call frame does have an exception handler that supports unwind callouts, it will be returned to RtlUnwindEx by RtlVirtualUnwind. This, in turn, will lead to RtlUnwindEx calling RtlpUnwindHandler, as registered by RtlpExecuteHandlerForUnwind in the original call stack (by RtlUnwindEx). We can verify this in the debugger:

0:000> bp ntdll!RtlpUnwindHandler
0:000> g
Breakpoint 1 hit
ntdll!RtlpUnwindHandler:
00000000`779507e0 488b4220  mov rax,qword ptr [rdx+20h]
0:000> k
Child-SP          Call Site
00000000`0012d9a8 ntdll!RtlpUnwindHandler
00000000`0012d9b0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012d9e0 ntdll!RtlUnwindEx+0x236
00000000`0012e060 ntdll!local_unwind+0x1c
00000000`0012e540 TestApp!`FaultingFunction2'::`1'::fin$2+0x34
00000000`0012e570 ntdll!_C_specific_handler+0x140
00000000`0012e5e0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012e610 ntdll!RtlUnwindEx+0x236
00000000`0012ec90 TestApp!UnwindExceptionHandler2+0xf8
00000000`0012f1b0 TestApp!`FaultingFunction2'::`1'::filt$1+0xe
[...]

This is where things start to get a little interesting. From the discussion in the previous article, we know that RtlpUnwindHandler essentially does the following:

Retrieve the PDISPATCHER_CONTEXT argument that RtlpExecuteHandlerForUnwind (the original instance, from the original unwind operation initiated by the first call to RtlUnwindEx) saved on its stack. This is done via the use of the EstablisherFrame argument to RtlpUnwindHandler.
Copy the contents of RtlpExecuteHandlerForUnwind’s DISPATCHER_CONTEXT over the DISPATCHER_CONTEXT of the current RtlUnwindEx instance, through the PDISPATCHER_CONTEXT argument provided to RtlpUnwindHandler. Note that the TargetIp member of the DISPATCHER_CONTEXT is not copied from RtlpExecuteHandlerForUnwind’s DISPATCHER_CONTEXT.
Return the manifest ExceptionCollidedUnwind constant to the caller (RtlpExecuteHandlerForUnwind, which will in turn return this value to RtlUnwindEx).

After all this is done, control returns to RtlUnwindEx. Because RtlpExecuteHandlerForUnwind returned ExceptionCollidedUnwind, though, a previously unused code path is activated. This code path (as described previously) copies the contents of the DISPATCHER_CONTEXT structure whose address was passed to RtlpExecuteHandlerForUnwind back into the internal state of RtlUnwindEx (including the context record), and then attempts to re-start unwinding of the current stack frame.

If you’ve been paying attention so far, then you probably understand what is going to happen next.

Because of the fact that RtlpUnwindHandler copied the DISPATCHER_CONTEXT from the original call to RtlUnwindEx over the DISPATCHER_CONTEXT from the current (collided unwind) call to RtlUnwindEx, the current instance of RtlUnwindEx now has access to all of the state information that the original RtlUnwindEx instance had placed into the PDISPATCHER_CONTEXT passed to RtlpExecuteHandlerForUnwind. Most importantly, this includes access to the original context record descibing the call frame that the original instance of RtlUnwindEx was in the process of unwinding.

Since all of this information has now been copied over the current RtlUnwindEx instance’s internal state, in effect, the current instance of RtlUnwindEx will (for the next unwind iteration) start unwinding the stack where the original RtlUnwindEx instance stopped; in other words, the stack being unwound “jumps” from the currently active call stack to the exception (or other) call stack that was originally being unwound.

At this point, the second instance of RtlUnwindEx is all setup to unwind the call stack to the new unwind target frame (and target instruction pointer; remember that TargetIp was omitted from the copying performed on the PDISPATCHER_CONTEXT in RtlpUnwindHandler) like a “conventional” unwind. The rest is, as they say, history.

Now that we know how collided unwinds work, it is important to know when one would ever see such a thing (after all, interrupting an unwind in-progress is a fairly invasive and atypical operation).

It turns out that collided unwinds are not quite as far-fetched as they might seem; the easiest way to cause such an event is to do something sleazy like execute a return/goto/continue/break to transfer control out of a __finally block. This, in effect, requires that the compiler stop the current unwind operation and transfer control to the target location (which is usually within the function that contained the __finally that the programmer jumped out of). Nevertheless, the compiler still has to deal with the fact that it has been called in the context of an unwind operation, and as such it needs a way to “break out” of the unwind call stack. This is done by executing a “local unwind”, or an unwind to a location within the current function. In order to do this, the compiler calls a small, runtime-supplied helper function known as local_unwind. This function is described below, and is essentially an extremely thin wrapper around RtlUnwindEx that, in practice, adds no value other than providing some default argument values (and scratch space on the stack for RtlUnwindEx to use to store a CONTEXT structure):

0:000> uf ntdll!local_unwind
ntdll!local_unwind:
00000000`7796f580 4881ecd8040000 sub  rsp,4D8h
00000000`7796f587 4d33c0         xor  r8,r8
00000000`7796f58a 4d33c9         xor  r9,r9
00000000`7796f58d 4889642420     mov  qword ptr [rsp+20h],rsp
00000000`7796f592 4c89442428     mov  qword ptr [rsp+28h],r8
00000000`7796f597 e844b0fdff     call ntdll!RtlUnwindEx
00000000`7796f59c 4881c4d8040000 add  rsp,4D8h
00000000`7796f5a3 c3             ret

When the compiler calls local_unwind as a result of the programmer breaking out of a __finally block in some fashion, then execution will eventually end up in RtlUnwindEx. From there, RtlUnwindEx eventually detects the operation as a collided unwind, once it unwinds past the original call to the original unwind handler that started the new unwind operation via local_unwind.

As a result, breaking out of a __finally block instead of allowing it to run to completion (which may result in control being transferred out of the “current function”, from the programmer’s perspective, and “into” the next function in the call stack for unwind processing) is how every-day programs can end up causing a collided unwind.

Next time: More unwind estorica, including details on how RtlUnwindEx and RtlRestoreContext lay the groundwork used to build C++ exception handling support.

Posted in NT Internals, Programming, Windows | 2 Comments »

Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)

Monday, January 8th, 2007

In the previous article in this series, I discussed the external interface exposed by RtlUnwindEx (and some of how unwinding works at a high level). This posting continues that discussion, and aims to provide insight into the internal workings of RtlUnwindEx (and as such, the inner details of all of the different aspects of unwind support on x64 Windows).

As previously described, the main behavior of RtlUnwindEx is to systematically unwind call frames (with the help of RtlVirtualUnwind) until a specific call frame, which is identified by the TargetFrame argument, is reached. RtlUnwindEx is also responsible for all interactions with language exception handlers for purposes of unwind operations. Additionally, RtlUnwindEx also imposes various validations and restrictions on execution contexts being unwound, and on the behavior of exception handlers being called for an unwind operation.

The first order of business within RtlUnwindEx is to capture the execution context at the time of the call to RtlUnwindEx (specifically, the execution context inside RtlUnwindEx, not of the caller of RtlUnwindEx). This is done with the aid of two helper functions, RtlpGetStackLimits (which retrieves the bounds of the stack for the current thread from the NT_TIB region of the current threads’ TEB), and RtlCaptureContext (which records the complete execution context of its caller within a standard CONTEXT structure). Additionally, if an unwind table is supplied, a special flag is set in it that optimizes the behavior of subsequent calls to RtlLookupFunctionTable for lookups that are unwind-driven (this is a behavior new to Windows Vista, and is a further attempt to improve the performance of unwind support on x64).

If the caller did not supply an EXCEPTION_RECORD argument, RtlUnwindEx will create the default STATUS_UNWIND exception record at this time and substitute it for what would have otherwise been a caller-supplied EXCEPTION_RECORD block. The exception record is initialized with an ExceptionAddress pointing to the Rip value captured previously by RtlCaptureContext, and with no parameters. Additionally, an initial ExceptionFlags value of EXCEPTION_UNWINDING is set, to later indicate to any exception handlers that might be called that an unwind operation is in progress (the EXCEPTION_RECORD pointer, either caller supplied or locally allocated by RtlUnwindEx in the absence of a caller-supplied value, corresponds exactly to the EXCEPTION_RECORD argument passed to any LanguageHandler that is called during unwind processing).

In the event that the caller of RtlUnwindEx did not supply a TargetFrame argument (indicating that the requested unwind operation is an exit unwind), then the EXCEPTION_EXIT_UNWIND flag is set within RtlUnwindEx’s internal ExceptionFlags value. An exit unwind is a special form of unwind where the “target” of the unwind is unknown; in other words, the caller does not have a valid target frame pointer to supply to RtlUnwindEx. Initiating a target unwind is normally dangerous unless the caller has special knowledge of an unwind handler in the call stack that will halt the unwind operation prematurely (either by initiating a secondary unwind, which leads to what is called a collided unwind, or by exiting the thread entirely). The reason for this restriction is that as RtlUnwindEx doesn’t have a clear “stopping point” to halt the unwind cycle at, it will happily unwind past the end of the stack (typically resulting in an access violation) unless an unwind handler along the way does something to halt the unwind. Most unwind operations are not exit unwinds.

At this point, RtlUnwindEx is set up to enter the main loop of the unwind algorithm, which essentially involves repeated calls to RtlVirtualUnwind, and then to unwind handlers (if present). This main loop involves multiple steps:

The RUNTIME_FUNCTION entry for the current frame (given by the Rip member of the context record captured above, and later updated in this loop) is located via RtlLookupFunctionEntry. If no function entry is present, then RtlUnwindEx will load Context->Rip with a ULONG64 value located at Context->Rsp, and then increment Context->Rsp by 8. The behavior when there is no RUNTIME_FUNCTION entry present accounts for leaf functions, for which unwind metadata is optional. If the current frame is a leaf function, then control skips forward to step 8.
Assuming that a RUNTIME_FUNCTION was found, RtlUnwindEx makes a copy of the current execution context that will be unwound – something I call the “unwind context”. After duplicating the context (via the RtlpCopyContext helper function, which only duplicates the non-volatile context), RtlVirtualUnwind is called (with the unwind context), and requested to return the address any associated language handler that is marked for unwind support. RtlVirtualUnwind thus returns several useful pieces of information; a language handler supporting unwind (if any), an updated context describing the caller of the requested call frame, a language-handler-specific (i.e. C scope table) data pointer associated with the requested call frame (if any), and the stack pointer of the call frame being unwound (the establisher frame). These pieces of information are used later in communication with a returned exception handler with unwind support, if one exists.
After calling RtlVirtualUnwind to establish the context of the next location on the stack frame (now contained within the “unwind context” location), RtlUnwindEx performs some validation of the returned EstablisherFrame value. Specifically, the EstablisherFrame value is ensured to be 8-byte aligned and within the stack limits of the current thread (in kernel mode, there is also special support for handling the case of an unwind occcuring within the context of a DPC, which may operate under a secondary stack). If either of these conditions does not hold true, a STATUS_BAD_STACK exception is raised, indicating that the stack pointer in the requested call frame is damaged or corrupted. Additionally, if a TargetFrame value is specified (that is, the unwind operation is not an exit unwind), then the TargetFrame value is validated to be greater than or equal to the EstablisherFrame value returned by RtlVirtualUnwind. This is, in effect, a sanity check designed to ensure that the unwind target actually refers to a previous call frame and not that one that has already be unwound. If this check fails, then a STATUS_BAD_STACK exception is raised.
If a language handler was returned by RtlVirtualUnwind, then RtlUnwindEx sets up for a call to the language handler. This involves the initial setup of a DISPATCHER_CONTEXT structure created on the stack of RtlUnwindEx. The DISPATCHER_CONTEXT structure describes some internal state that RtlUnwindEx shares with all participants in the unwind process, such as language handlers being called for unwind. It contains all of the state information necessary to coordinate operation between RtlUnwindEx and any language handler. Furthermore, it is also instrumental in the processing of collided unwinds; more on that later. The newly initialized DISPATCHER_CONTEXT contains two fields of significance, initially; the TargetIp field (which is simply a copy of the TargetIp argument to RtlUnwindEx), and the ScopeIndex field (which is zero initialized). Both of these fields are unused by RtlUnwindEx itself, and are simply available for the conveniene of language handlers being called for an unwind operation. If no language handler was present for the requested call frame, then control skips forward to step 8.
At this point, RtlUnwindEx is ready to make a call to an unwind handler. This first involves a quick check to determine whether the end of the unwind chain has been reached, through comparing the current frame’s EstablisherFrame value with the TargetFrame argument to RtlUnwindEx. If the two frame pointers match exactly, then the ExceptionFlags value passed in to the unwind handler has an additional bit set, EXCEPTION_TARGET_UNWIND. This flag bit lets the unwind handler know that it is the “last stop” in the unwind process (in other words, that there will be no further frame unwinds after the unwind handler finishes processing). At this point, the ReturnValue argument passed to RtlUnwindEx is copied into the Rax register image in the active context for the current frame (not the unwound context, which refers to the previous frame). Then, the last remaining fields of the DISPATCHER_CONTEXT structure are initialized based on the internal state of RtlUnwindEx; the image base, handler data, instruction pointer (ControlPc), function entry, establisher frame, and language handler values previously returned by RtlLookupFunctionEntry and RtlVirtualUnwind are copied into the DISPATCHER_CONTEXT structure, along with a pointer to the context record describing the execution state at the current frame. After the ExceptionFlags member of RtlUnwindEx’s EXCEPTION_RECORD structure is set, the stack-based exception flags image (from which the copy in the EXCEPTION_RECORD was copied from) has the EXCEPTION_TARGET_UNWIND and EXCEPTION_COLLIDED_UNWIND flags cleared, to ensure that these flags are not inadvertently passed to an exception routine unexpectedly in a future loop iteration.
After preparing the DISPATCHER_CONTEXT for the unwind handler call, RtlUnwindEx makes a call to a small helper function, RtlpExecuteHandlerForUnwind. RtlpExecuteHandlerForUnwind is an assembly-language routine whose prototype matches that of the language specific handler, given below:
```
typedef EXCEPTION_DISPOSITION (*PEXCEPTION_ROUTINE) (
    IN PEXCEPTION_RECORD               ExceptionRecord,
    IN ULONG64                         EstablisherFrame,
    IN OUT PCONTEXT                    ContextRecord,
    IN OUT struct _DISPATCHER_CONTEXT* DispatcherContext
);
```
RtlpExecuteHandlerForUnwind is fairly straightforward. All it does is store the DispatcherContext argument on the stack, and then make a call to the LanguageHandler member in the DISPATCHER_CONTEXT structure. RtlpExecuteHandler then returns the return value of the LanguageHandler itself.

While this may seem like a rather useless helper routine at first, RtlpExecuteHandlerForUnwind actually does add some value, although it might not be immediately apparent unless one looks closely. Specifically, RtlpExecuteHandlerForUnwind registers an exception/unwind handler for itself (RtlpUnwindHandler). RtlpUnwindHandler does not go through _C_specific_handler; in other words, it is a raw exception handler registration. Like RtlpExecuteHandlerForUnwind, RtlpUnwindHandler is a raw assembly language routine. It, too, is fairly simple (and as a language-level exception handler routine, RtlpUnwindHandler is compatible with the LanguageHandler prototype described above); RtlpUnwindHandler uses the EstablisherFrame argument given to a LanguageHandler routine to locate the saved pointer to the DISPATCHER_CONTEXT on the stack of RtlpExecuteHandlerForUnwind, and then copies most of the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandlerForUnwind over the DISPATCHER_CONTEXT structure that was passed to RtlpUnwindHandler itself (conspicuously omitted from the copy is the TargetIp member of the DISPATCHER_CONTEXT structure, for reasons that will become clear later). After performing the copy of the DISPATCHER_CONTEXT structure, RtlpUnwindHandler returns the manifest ExceptionCollidedUnwind constant. Although one might naively assume that all of this just leads up to protecting against the case of an unwind handler throwing an exception, it actually has a much more common (and significant) use; more on that later.
After RtlpExecuteHandlerForUnwind returns, RtlUnwindEx decides what course of action to persue based on the return value. There are two legal return values from an exception handler called for unwind, ExceptionContinueSearch (the general “success”) return, and ExceptionCollidedUnwind. If any other value is returned, then RtlUnwindEx raises a STATUS_INVALID_DISPOSITION exception, indicating that an unwind handler has returned an illegal value (this is typically rarely seen in practice, as most unwind handlers are compiler generated, and therefore always get the return value correct). If ExceptionContinueSearch is returned, and the current EstablisherFrame doesn’t match the TargetFrame argument, then the unwind context and the context for the “current frame” are swapped (this positions the current frame context as referring to the context of the next function in the call chain, which will then be duplicated and unwound in the next loop iteration). If ExceptionCollidedUnwind is returned, then the execution path is a little bit more complicated. In the collided unwind case, all of the internal state information that RtlUnwindEx had previously copied into the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandler back out of the DISPATCHER_CONTEXT structure. RtlVirtualUnwind is then executed to determine the next lowest call frame using the context copied out of the DISPATCHER_CONTEXT structure, the EXCEPTION_COLLIDED_UNWIND flag is set, and control is transferred to step 5. This step may initially seem strange, but it will become clear after it is explained later.
If control reaches this point, then a frame has been successfully unwound, and any applicable unwind handler has been notified of the unwind operation. The next step is a re-validation of the EstablisherFrame value (as it may have changed in the collided unwind case). Assuming that EstablisherFrame is valid, if its value does not match the TargetFrame argument, then control is transferred to step 1. Otherwise, if there is a match, then the loop terminates. (If the EstablisherFrame is not valid, and is not the expected TargetFrame value, then either the unwind exception record is raised as an exception, or a STATUS_BAD_FUNCTION_TABLE exception is raised.)

At this point, RtlUnwindEx has arrived at its target frame, and all intermediary unwind handlers have been called. It is now time to transfer control to the unwind point. The ReturnValue argument is again loaded into the current frame’s context (Rax register), and if the exception code supplied by the RtlUnwindEx caller via the ExceptionRecord argument does not match STATUS_UNWIND_CONSOLIDATE, the Rip value in the current frame’s context is replaced with the TargetIp argument.

The final task is to realize the finalized context; this is done by calling RtlRestoreContext, passing it the current frame’s context and the ExceptionRecord argument (or the default exception record constructed if no ExceptionRecord argument was supplied). RtlRestoreContext will in most cases simply copy the given context into the currently active register set, although in two special cases (if a STATUS_LONGJUMP or STATUS_UNWIND_CONSOLIDATE exception code is set in the optional ExceptionRecord argument), this behavior deviates from the norm. In the long jump case (as previously documented), the ExceptionRecord argument is assumed to contain a pointer to a jmp_buf, which contains a nonvolatile register set to restore on top of the unwound context supplied by RtlUnwindEx. The unwind consolidate case is rather more complicated, and will be discussed in a future posting.

For reference, I have posted some annotated, reverse engineered C and assembler code describing the internal operations of RtlUnwindEx and several of its helper functions (such as RtlpUnwindHander). This C code is based off of the Windows Vista implementation of RtlUnwindEx, and as such takes advantage of new Windows Vista-specific optimizations to unwind handling. Specifically, the “Unwind” flag in the UNWIND_HISTORY_TABLE structure is new in Windows Vista (although the size of the structure has not changed; there used to be empty alignment padding at that offset in previous Windows versions). This flag is used as a hint to RtlLookupFunctionEntry, in order to expedite lookup of function entries for some commonly referenced functions in the unwind path. Between the provided comments and the above description of the overall functionality of RtlUnwindEx, the inner workings of it should begin to come clear. There are some aspects (in particular, collided unwind) that are a bit more complicated than one might initially imagine; I’ll discuss collided unwinds (and more) in the next posting in this series.

It would be best to call the system version of RtlUnwindEx instead of reimplementing it for general purpose use (which I have done so here primarily to illustrate how unwind works on x64 Windows). There have been improvements made to RtlUnwindEx between Windows Server 2003 SP1 x64 and Windows Vista x64, so it would be unwise to assume that RtlUnwindEx will remain devoid of new performance or feature additions forever.

Next up: Collided unwinds, and other things that go “bump” in the dark when you use compiler exception handling and unwind support.

Posted in NT Internals, Programming, Reverse Engineering, Windows | 2 Comments »

Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)

Sunday, January 7th, 2007

Previously, I provided a brief overview of what each of the core APIs relating to x64’s extensive data-driven unwind support were, and when you might find them useful.

This post focuses on discussing the interface-level details of RtlUnwindEx, and how they relate to procedure unwinding on Windows (x64 versions, specifically, though most of the concepts apply to other architecture in principle).

The main workhorse of unwind support on x64 Windows is RtlUnwindEx. As previously described, this routine encapsulates all of the work necessary to restore execution context to a prior point in the call stack (relying on RtlVirtualUnwind for this task). RtlUnwindEx also implements all of the logic relating to interactions with unwind/exception handlers during the unwind process (which is essentially the value added by RtlUnwindEx on top of what RtlVirtualUnwind implements).

In order to understand the inner workings of how unwinding works, it is first necessary to understand the high level theory behind how RtlUnwindEx is used (as RtlUnwindEx is at the heart of unwind support on Windows). Although there have been previously posted articles that touch briefly on how unwind is implemented, none that I have seen include all of the details, which is something that this segment of the x64 exception handling series shall attempt to correct.

For the moment, it is simpler to just consider the unwind half of exception handling. The nitty-gritty, exhaustive details of how exceptions are handled and dispatched will be discussed in a future posting; for now, assume that we are only interested in the unwind code path.

When a procedure unwind is requested, by any place within the system, the first order of business is a call to RtlUnwindEx. The prototype for RtlUnwindEx was provided in a previous posting, but in an effort to ensure that everyone is on the same page with this discussion, here’s what it looks like for x64:

VOID
NTAPI
RtlUnwindEx(
   __in_opt ULONG64               TargetFrame,
   __in_opt ULONG64               TargetIp,
   __in_opt PEXCEPTION_RECORD     ExceptionRecord,
   __in     PVOID                 ReturnValue,
   __out    PCONTEXT              OriginalContext,
   __in_opt PUNWIND_HISTORY_TABLE HistoryTable
   );

These parameters deserve perhaps a bit more explanation.

TargetFrame describes the stack pointer (rsp) value for the target of the unwind operation. In normal circumstances, this is always the EstablisherFrame argument to an exception handler that is handling an exception. In the context of an exception handler, EstablisherFrame refers to the stack pointer of the caller of the function that caused the exception being inspected. Likewise, in this context, TargetFrame refers to the stack pointer of the function that the call stack should be unwound to. Although given the fact that with data-driven unwind semantics, one might initially think that this argument is unnecessary (after all, one might assume that RtlUnwindEx could simply invoke RtlVirtualUnwind in order to determine the expected stack pointer value for the next function on the call stack), this argument is actually required. The reason is that RtlUnwindEx supports unwinding past multiple procedure frames; that is, RtlUnwindEx can be used to unwind to a function that is several levels down in the call stack, instead of the immediately lower function in the call stack. Note that the TargetFrame argument must match exactly the expected stack pointer value of the target function in the call stack.
Observant readers may pick up on the SAL annotation describing the TargetFrame argument and notice that it is marked as optional. In general, TargetFrame is always supplied; it can be omitted in one specific circumstance, which is known as an exit unwind; more on that later.
TargetIp serves a similar purpose as TargetFrame; it describes the instruction pointer value that execution should be unwound to. TargetIp must be an instruction in the same function on the call stack that corresponds to the target stack frame described by TargetFrame. This argument is supplied as a particular function may have multiple points that could be resumed in response to an exception (this typically the case if there are multiple try/except clauses).
Like TargetFrame, the TargetIp argument is also optional (though in most cases, it will be present). Specifically, if a frame consolidation unwind operation is being executed, then the TargetIp argument will be ignored by RtlUnwindEx and may be set to zero if desired (it will, however, still be passed to unwind handlers for use as they see fit). This specialized unwind operation will be discussed later, along with C++ exception support.
ExceptionRecord is an optional argument describing the reason for an unwind operation. This is typically the same exception record that was indicated as the cause of an exception (if the caller is an exception handler), although it does not strictly have to be as such. If no exception record is supplied, RtlUnwindEx constructs a default exception record to pass on to unwind handlers, with an exception code of STATUS_UNWIND and an exception address referring to an instruction within RtlUnwindEx itself.
ReturnValue describes a pointer-sized value that is to be placed in the return value register at the completion of an unwind operation, just before control is transferred to the newly unwound context. The interpretation of this value is entirely up to the routine being unwound into. In practice, the Microsoft C/C++ compiler does not use the return value at all in typical cases. Usually, the Microsoft C/C++ compiler will indicate the exception code that caused the exception as the return value, but due to how unwinding across functions works with try/except, there is no language-level support for retrieving the return value of a function that has been unwound due to an exception. As a result, in most circumstances, the return value placed in the unwound execution context based on this argument is ignored.
OriginalContext describes an out-only pointer to a context record that is updated with the execution context as procedure call frames are unwound. In practice, as RtlUnwindEx does not ever “return” to its caller, this value is typically only provided as a way for a caller to supply its own storage to be used as scratch space by RtlUnwindEx during the intermediate unwind operations comprimising an unwind to the target call frame. Typically, the context record passed in to an exception handler from the exception dispatcher is supplied. Because the initial contents of the OriginalContext argument are not used, however, this argument need not necessarily be the context record passed in from the exception dispatcher.
HistoryTable describes a cache used to improve the performance of repeated function entry lookups via RtlLookupFunctionEntry. Under normal circumstances, this is the same history table passed in from the exception dispatcher to an exception handler, although it could also be a caller-allocated structure as well. This argument can also be safely omitted entirely, although if a non-trivial set of call frames are being unwound, passing in even a newly-initialized history table may improve performance.

Given all of the above information, RtlUnwindEx performs a procedure call unwind by performing a successive sequence of RtlVirtualUnwind calls (to determine the execution context of the next call frame in the call stack), followed by a call to the registered language handler for the call frame (if one exists and is marked for unwinding support). In most cases where there is a language unwind handler, it will point to _C_specific_handler, which internally searches all of the internal exception handling scopes (e.g. try/except or try/finally constructs), calling “finally” handlers as need be. There may also be internal unwind handlers that are present in the scope table for a particular function, such as for C++ destructor support (assuming asynchronous C++ exception handling has been enabled). Most users will thus interact with unwind handlers in the form of a “finally” handler in a try/finally construct in a function whose language handler refers to _C_specific_handler.

If RtlUnwindEx encounters a “leaf function” during the unwind process (a leaf function is a function that does not use the stack and calls no subfunctions), then it is possible that there will be no matching RUNTIME_FUNCTION entry for the current call frame returned by RtlLookupFunctionEntry. In this case, RtlUnwindEx assumes that the return address of the current call frame is at the current value of Rsp (and that the current call frame has no unwind or exception handlers). Because the x64 calling convention enforces hard rules as to what functions without RUNTIME_FUNCTION registrations can do with the stack, this is a valid assumption for RtlUnwindEx to make (and a necessary assumption, as there is no way to call RtlVirtualUnwind on a function with no matching RUNTIME_FUNCTION entry). The current call frame’s value of Rsp (in the context record describing the current call frame, not the register value of rsp itself within RtlUnwindEx) is dereferenced to locate the call frame’s return address (Rip value), and the saved Rsp value is then adjusted accordingly (increased by 8 bytes).

When RtlUnwindEx locates the endpoint frame of the unwind, a special flag (EXCEPTION_TARGET_UNWIND) is set in the ExceptionFlags member of the EXCEPTION_RECORD passed to the language handler. This flag indicates to the language handler (and possibly any C-language scope handlers) that the handler is being called as the “final destination” of the unwind operation. The Microsoft C/C++ compiler does not expose functionality to detect whether a “finally” handler is being called in the context of a target unwind or if the “finally” handler is simply being called as an intermediate step towards the unwind target.

After the last unwind handler (if applicable) has been called, RtlUnwindEx restores the execution context that has been continually updated by successive calls to RtlVirtualUnwind. This restoration is performed by a call to RtlRestoreContext (a documented, exported function), which simply transfers a given context record to the thread’s execution context (thus “realizing” it).

RtlUnwindEx does not return a value to its caller. In fact, it typically does not return to its caller at all; the only “return” path for RtlUnwindEx is in the case where the passed-in execution context is corrupted (typically due to a bogus stack pointer), or if an exception handler does something illegal (such as returning an unrecognized EXCEPTION_DISPOSITION) value. In these cases, RtlUnwindEx will raise a noncontinuable exception describing the problem (via RtlRaiseStatus). These error conditions are usually fatal (and are indicative of something being seriously corrupted in the process), and virtually always result in the process being terminated. As a result, it is atypical for a caller of RtlUnwindEx to attempt to handle these error cases with an exception handler block.

In the case where RtlUnwindEx performs the requested unwind successfully, a new execution context describing the state at the requested (unwound) call frame is directly realized, and as such RtlUnwindEx does not ever truly return in the success case.

Although RtlUnwindEx is principally used in conjunction with exception handling, there are other use cases implemented by the Microsoft C/C++ compiler which internally rely upon RtlUnwindEx in unrelated capacities. Specifically, RtlUnwindEx implements the core of the standard setjmp and longjmp routines (assuming the exception safe versions of these are enabled by use of the <setjmpex.h> header file) provided by the C runtime library in the Microsoft CRT.

In the exception-safe setjmp/longjmp case, the jmp_buf argument essentially contains an abridged version of the execution context (specifically, volatile register values are omitted). When longjmp is called, the routine constructs an EXCEPTION_RECORD with STATUS_LONGJUMP as the exception code, sets up one exception information parameter (which is a pointer to the jmp_buf), and passes control to RtlUnwindEx (for the curious, the x64 version of the jmp_buf structure is described as _JUMP_BUFFER in setjmp.h under the _M_AMD64_ section). In this particular instance, the ReturnValue argument of RtlUnwindEx is significant; it corresponds to the value that is seemingly returned by setjmp when control is being transferred to the saved setjmp context as part of a longjmp call (somewhat similar in principal as to how the UNIX fork system call indicates whether it is returning to the child process or the parent process). The internal operations of RtlUnwindEx are identical whether it is being used for the implementation of setjmp/longjmp, or for conventional exception-handler-based triggering of procedure call frame unwinding.

However, there are differences that appear when RtlUnwindEx restores the execution context via RtlRestoreContext. There is special support inside RtlRestoreContext for STATUS_LONGJUMP exceptions with one exception information parameter; if this situation is detected, then RtlRestoreContext internally reinitializes portions of the passed-in context record based on the jmp_buf pointer stored in the exception information parameter block of the exception record provided to RtlRestoreContext by RtlUnwindEx. After this special-case partial reinitialization of the context record is complete, RtlRestoreContext realizes the context record as normal (causing execution control to be transferred to the stored Rip value). This can be seen as a hack (and a violation of abstraction layers; there is intended to be a logical separation between operating system level SEH support, and language level SEH support; this special support in RtlRestoreContext blurs the distinction between the two for C language support with the Microsoft C/C++ compiler). This layering violation is not the most egregious in the x64 exception handling scene, however.

This concludes the basic overview of the interface provided by RtlUnwindEx. There are some things that I have not yet covered, such as exit unwinds, collided unwinds, or the deep integration and support for C++ try/catch, and some of the highly unsavory things done in the name of C++ exception support. Next time: A walkthrough of the complete internal implementation of RtlUnwindEx, including undocumented, never-before-seen (or barely documented) corner cases like exit unwinds or collided unwinds (the internals of C++ exception support from the perspective of RtlUnwindEx are reserved for a future posting, due to size considerations).