Archive for the ‘Networking’ Category

Power management observations with my XV6800

Tuesday, December 11th, 2007

As I previously mentioned, I recently switched to a full-blown Windows Mobile-based phone. One of the interesting conundrums that I’ve ran into along the way is feeling out what’s actually killer on the battery life for the device and what isn’t.

It’s actually kind of surprising how seemingly innocent little things can have a dramatic impact on battery life in a situation like an always-on handheld phone. For example, one of the main power saving features of modern CDMA cell networks and packet switched data connections is the ability for the device to go into dormant mode, a state in which the device relinquishes it’s active radio link resources and essentially goes into a standby mode that is in some respects akin to how the device operates when it is otherwise idle and waiting for a call. This is beneficial for the device’s battery life (because while in dormant mode, much of the logic related to maintaining the active high-rate radio link can be powered off while the network link is idle). Dormant mode is also beneficial for network operators, in that it allows the resources previously dedicated to a particular device to be used by other users that are actively sending or receiving data.

When data is ready to be sent on either end of the link, the device will “wake up” and exit dormant mode in order to send/receive data, keeping the link fully active until the session has been idle for enough time for the dormant idle timer to expire. (Entering and exiting dormant mode is mostly seamless, but may involve delays of 1-2 seconds, at least in my experience, so doing so on every packet would be highly undesirable.)

What all of this means is that there’s some significant gains to be had in terms of radio power consumption if network traffic can be limited to where it is really necessary. There are a couple of things that I had initially running on my device that didn’t really fit this criteria. For example, I have an always-active SSH connection to a screen session that, among other things, runs my console SILC client, which is based off of irssi (an IRC client). One gotcha that I encountered there is that in the default configuration, this client has a clock on the ncurses console UI. The clock is updated every minute with the current time, which is (normally) handy as you can at-a-glance compare the current timestamp with message timestamps in the backlog.

However, this becomes problematic in my case, because the clock forces the device out of dormant mode every minute in order to repaint that portion of the console UI. (Aside from the clock, the remainder of the SILC/irssi UI is static until there is actual chat activity, which would otherwise make it relatively well suited for this situation.) Fortunately, it turned out to be relatively easy to remove the clock from the client’s UI, which prevents the SSH session from causing network activity every minute while I’ve got the SILC client up.

Another network power management “gotcha” that I ran into was the built-in L2TP/IPsec VPN client. It turns out that while an L2TP/IPsec link is active, the device was, in my case, unable to enter dormant mode for any non-trivial length of time. I suspect that this is caused by the default behavior of L2TP to send keep-alive messages every minute to ensure that the session is still alive (at the very least, packet captures VPN-server-side did not seem to indicate any traffic destined to the device’s projected IP Address).

In any case, this unpleasant side effect of the L2TP/IPsec client ruled out using it for full network connectivity in an always-on fashion. Unfortunately, dialing the VPN link on-demand has it’s own share of pitfalls, as the device seems to make the VPN link the default gateway. When I configured IMAPv4 mail to be fetched every so often via demand-dialing the VPN link (which would alleviate the difficulty with dormant mode not operating as desired with the VPN link always on), my SSH session would often die if either side happened to try and send data while Pocket Outlook was polling IMAP mail.

My solution in this case was to fall back to using SSH port-forwards through PocketPuTTY for IMAP mail (after I fixed SSH port forwards to work reliably, that is, of course). Unlike the L2TP/IPsec link, the SSH link doesn’t inherently force the device out of dormant mode without there being real activity, which allows me to continue to leave the SSH connection always on without sacrificing my battery life.

These two changes seem to have fixed the worst offenders in terms of battery drain, but there’s still plenty of room for improvement. For example, PocketPuTTY will tell the remote end of the session that the window has resized when you switch the device from portrait to landscape mode (or vice versa), such as when pulling out the device’s built in keyboard. While this is a very handy feature if you’re actually using the SSH session (as it allows the remote console to resize itself appropriately, which is how SILC/irssi magically reshapes itself to fit the screen when the device keyboard is engaged), it does present the annoyance that pulling out the keyboard while PocketPuTTY is not active will still cause the packet radio link to be fully established to send the terminal resize notification and receive the resulting redraw data.

This behavior could, for instance, be further optimized to defer sending such notifications until the PocketPuTTY window becomes the foreground window (potentially aborting the resize notification entirely if the screen size is switched back to how it previously was before PocketPuTTY is activated).

One of the nice things about having a (relatively) easily end-user-programmable device instead of a locked-down, DRM’ified BREW-based handset is that there’s actually opportunity to change these sorts of things instead of just blithely accepting what’s offered with the device with no real hope of improving it.

How to not write code for a mobile device

Monday, December 10th, 2007

Earlier this week, I got a shiny, brand-new XV6800 to replace my aging (and rather limited in terms of running user-created programs) BREW-ified phone.

After setting up ActiveSync, IPsec, and all of the other usual required configuration settings, the next step was, of course, to install the baseline minimum set of applications on the device to get by. Close to the top of this list is an SSH client (being able to project arbitrary console programs over SSH and screen to the device is, at the very least, a workable enough solution in the interm for programs that are not feasible to run directly on the device, such as my SILC client). I’ve found PuTTY a fairly acceptable client for “big Windows”, and there just so happened to be a Windows Mobile port of it. Great, or so I thought…

While PocketPuTTY does technically work, I noticed a couple of deficiencies with it pretty quickly:

  1. The page up / page down keys are treated the same as the up arrow / down arrow keys when sent to the remote system. This sucks, because I can’t access my scrollback buffer in SILC easily without page up / page down. Other applications can tell the difference between the two keys (obviously, or there wouldn’t be much of a point to having the keys at all), so this one is clearly a PocketPuTTY bug.
  2. There is no way to restart a session once it is closed, or at least, not that I’ve found. Normally, on “big Windows”, there’s a “Restart Session” menu option in the window menu of the PuTTy session, but (as far as I can tell) there’s no such equivalent to the window menu on PocketPuTTY. There is a “Tools” menu, although it has some rather useless menu items (e.g. About) instead of some actually useful menu items, like “Restart Session”.
  3. Running PocketPuTTY seems to have a significantly negative effect on battery life. This is really unfortunate for me, since the expected use is to leave an SSH session to a terminal running screen for a long period of time. (Note that this was resolved, partially, by locating a slightly more maintained copy of PocketPuTTY.)
  4. SSH port forward support seems to be fairly broken, in that as soon as a socket is cleaned up, all receives in the process break until you restart it. This is annoying, but workable if one can go without SSH port forwards.

Most of these problems are actually not all that difficult to fix, and since the source code is available, I’m actually planning on trying my hand at doing just that, since I expect that this is an app that I’ll make fairly heavy use of.

The latter problem is one I really want to call attention to among these deficiencies, however. My intent here is not to bash the PocketPuTTY folks (and I’m certainly happy that they’ve at least gotten a (semi)-working port of PuTTY with source code out, so that other people can improve on it from there), but rather to point out some things that should really just not be done when you’re writing code that is intended to run on a mobile device (especially if the code is intended to run exclusively on a mobile device).

On a portable device, one of the things that most users really expect is long battery life. Though this particular point certainly holds true for laptops as well, it is arguably even more important that for converged mobile phone devices. After all, most people consider their phone an “always on” contact mechanism, and unexpectedly running out of battery life is extremely annoying in this aspect. Furthermore, if your mobile phone has the capability to run all sorts of useful programs on it, but doing so eats up your battery in a couple of hours, then there is really not that much point in having that capability at all, is there?

Returning to PocketPuTTY, one of the main problems (at least with the version I initially used) was, again, that PocketPuTTY would reduce battery life significantly. Looking around in the source code for possible causes, I noticed the following gem, which turned out to be the core of the network read loop for the program.

Yes, there really is a Sleep(1) spin loop in there, in a software port that is designed to run on battery powered devices. For starters, that blows any sort of processor power management completely out of the water. Mobile devices have a lot of different components vying for power, and the easiest (and most effective) way to save on power (and thus battery life) is to not turn those components on. Of course, it becomes difficult to do that for power hungry components like the 400MHz CPU in my XV6800 if there’s a program that has an always-ready-to-run thread…

Fortunately, there happened to be a newer revision of PocketPuTTY floating about with the issue fixed (although getting ahold of the source code for that version proved to be slightly more difficult, as the original maintainer of the project seems to have disappeared). I did eventually manage to get into contact with someone who had been maintaining their own set of improvements and grab a not-so-crusty-old source tree from them to do my own improvements, primarily for the purposes of fixing some of the annoyances I mentioned previously (thus beginning my initial forey into Windows CE development).

Linux features that I’d like to see in Windows: iptables

Wednesday, November 7th, 2007

One of the things that I really miss about Linux-based boxen when I’m working with Windows from time to time is the fact that the built-in Windows firewall capabilities are just downright anemic when compared to the power and flexibility of iptables.

Sure, there’s Windows Firewall, RRAS, Advanced TCP/IP Filtering (which is anything but advanced), and IPSec Policies that come with Windows and allow you to firewall things off. Unfortunately, while Windows Firewall and RRAS (with respect to “Basic Firewall” in Windows Server 2003) do a passable job of an inbound host firewall, there is really just nothing that comes with Windows that is reasonably good at managing a complicated network (e.g. multi-machine) firewall.

RRAS has built-in static packet filtering, but it’s downright ridiculously limited given the fact that it’s something that it is ostensibly oriented towards network administrators (who should, theoretically, know what they’re doing). You essentially have the option of creating either an allow list with a default deny, or a deny list with a default allow, and that’s it. (There’s also not really any support for stateful packet filtering in this mode of RRAS, as is available from Basic Firewall, although you can at least differentiate between established and non-established TCP packets. Barely.)

IPSec Policies are slightly less limited than RRAS static packet filtering, but they’re still nowhere near expressive enough for any sort of non-trivial network firewall configuration. You can at least mix and match allow and deny rules, but the ordering is only based on netmask and, as far as I know, is otherwise not user controllable.

Iptables and ip_conntrack, on the other hand, are highly expressive and allow one to comparatively easily create rules that are either downright impossible or extremely difficult to do (e.g. requiring convoluted use of both RRAS static packet filtering and IPSec Policies) in any managable fashion with the built-in Windows firewall tools. As and added bonus, they also have highly flexible NAT capabilities built-in that easily integrate and cooperate with firewall rules.

Now, it’s not really a matter of there being anything that is technically wrong or deficient with the Windows networking stack that would prevent there being a reasonably high quality firewall, but more that just nobody has gone out and done it and shipped it with the platform. (No, I don’t count the “personal firewall” type things that ship with XYZ AV/”Home Security” product as anywhere in this category. I don’t trust those far enough to not create security holes, much less act competently as a firewall.)

There are a number of various third party firewall packages out there, but I tend to be fairly suspicious of installing third party code on my boxes in general, much less third party kernel level code that is facing the network outside of any firewall or packet filtering. Most of them don’t seem to have anywhere near the sort of capabilities that iptables provides, anyway.

Video card + 3D game = dropped 802.11 (wifi) link?

Thursday, August 30th, 2007

It seems I have no luck at all with bizzare problems lately.

So, I’m on vacation, and there’s an 802.11g network here that I’m using for my network link. Where I’ve got my laptop, it’s really right on the fringe of coverage, just barely able to maintain a stable link. The manufacturer’s 802.11 diagnostics utility reports signal as -87 ~ -88 dBm, and noise as -87 dBm in this configuration; not exactly ideal RF conditions. (There exist no other SSID-broadcasting 802.11 networks in close proximity, at least as far as I can tell without downloading some sort of more specialized (and capable) link sniffing software.)

Anyways, the connection is usable like this, although it’s as I said on the very edge of tipping over into losssy or just plain lost link land. I’ve determined that, for instance, using Bluetooth A2DP to my headphones injects enough traffic into the 2.4GHz RF environment to push the 802.11 link to doing something like ~10-20% packet loss (not exactly all that friendly for TCP). Without A2DP streaming audio, packet loss is somewhere around ~1% or less, which is usable, although the link doesn’t appear to be able to do more than half a megabit reliably. Again, not all that great, but it at least works.. mostly.

That is, until I start any program that kicks the video card (a GeForce 7950 Go GTX) into high gear. If I do that, the 802.11 diagnostics utility immediately registers somewhere in the neighborhood of a 4-6dBm increase in noise, which is enough to completely swamp the already relatively weak net signal and cause the link to drop entirely shortly after. I guess that the video card is not quite sufficiently shielded relative to the 802.11 antenna on my laptop. (In this particular case, my box is the only system on the 802.11 link; it’s definitely not a problem related to someone else on the same AP.)

To better illustrate the problem, I took a screenshot of the combined signal and noise graph history from the 802.11 diagnostics utility; you can very clearly see when I had a program using the video card up and running:

802.11 Noise Graph Screenshot

Now, normally, this isn’t a problem when I use 802.11 at my apartment, even with playing games or watching video over 802.11, but then again I’m normally well into the “green” as far as signal goes there, and my apartment 802.11 link uses a superior radio technology to boot compared to where I’m staying right now (802.11n as opposed to 802.11g).

So, naturally, frustration ensues. I can do whatever I want on the link, except (of course!) play games. And, of course, since it’s vacation, that sucks all the more – so much for retiring from a long day of plane flights to blast some bad guys in a video game.

Well, that’s overstating it a bit. I can always sit my laptop down elsewhere, but the problem was curious enough that I decided to investigate why my network connnection was magically disappearing the moment I tried to play any game. (Update: It actually appears to vary based on how intensively the game uses the video card, which makes sense as it’s a laptop video card that’s supposed to conserve power as much as possible. For example, starting Quake 3 only creates enough noise to cause significant packet loss, but World of Warcraft causes the video card to generate enough noise above that to result in the link flaking out entirely.)

It’s actually a kind of interesting problem, though, if one that caught me totally off-guard. I have to wonder what other operations will generate enough internal noise in a laptop to noticibly affect 802.11. It also makes you wonder if someone might be able to guess when you’re, say, slacking off and playing games on the office 802.11 link by observing indications of link quality from your machine (irrespective of whether your traffic is encrypted or not)…

Analysis of a networking problem: The case of the mysterious SMB connection resets (or “How to not design a network protocol”)

Monday, December 4th, 2006

Recently, I had the unpleasant task of troubleshooting a particularly strange problem at work, in which a particular SMB-based file server would disconnect users if more than one user attempted to simultaneously initiate a file transfer. This would typically result as a file transfer (i.e. drag and drop file copy) initiated by Explorer failing with a “The specified network name is no longer available.” (error #64) dialog box. As this particular problem involved SMB traffic going through a fair amount of custom code we had written, being a VPN/remote access company (this particular problem was seemingly only occuring when using our VPN software), we (naturally) assumed that the issue was caused by some sort of bug in our software that was doing something to upset either the Windows SMB client or SMB server.

So, first things first; after doing initial troubleshooting, we obtained some packet captures of the problematic SMB server (and SMB clients), detailing just what was happening on the network when the problem occured.

Unfortunately, in this particular case, the packet captures did not really end up providing a whole lot of useful information; they did, however, succeed in raising more questions. What we observed is that in the middle of an SMB datastream, the SMB server would just mysteriously send a TCP RST packet, thereby forcibly closing the TCP connection on which SMB was running. This corresponded exactly with one of the file share clients getting the dreaded error #64 dialog, but there was no clue as to what the problem was. In this particular case, there was no packet loss to speak of, and nothing else to indicate some kind of connectivity problem; the SMB server just simply sent an RST seemingly out of the blue to one of the SMB clients shortly after a different SMB client attempted to initiate a file transfer.

To make matters worse, there was no correlation at all as to what the SMB client whose connection got killed was doing when the connection got reset. The client could be either submitting a read request for more data, waiting for a previously sent read request to finish processing, or doing any other operation; the SMB server would just mysteriously close the connection.

In this particular case, the problem would also only occur when SMB was used in conjunction with our VPN software. When the SMB server was accessed over the LAN, the SMB connection would operate fine in the presence of multiple concurrent users. Additionally, when the SMB server was used in conjunction with alternative remote access methods other than our standard VPN system, the problem would mysteriously vanish.

By this time, this problem was starting to look like a real nightmare. The information we had said that there was some kind of problem that was preventing SMB from being used with our VPN software (which obviously would need to be fixed, and quickly), and yet gave no actual leads as to what might cause the problem to occur. According to logs and packet captures, the SMB server would just arbitrarily reset connections of users connecting to SMB servers when used in conjunction with our VPN software.

Fortunately, we did eventually manage to duplicate the problem in-house with our own internal test network. This eventually turned out to be key to solving the problem, but did not (at least initially) provide any immediate new information. While it did prevent us from having to bother the customer this problem was originally impacting while troubleshooting, it did not immediately get us closer to understanding the problem. In the mean time, our industrious quality assurance group had engaged Microsoft in an attempt to see if there was any sort of known issue with SMB that might possibly explain this problem.

After spending a significant amount of time doing exhaustive code reviews of all of our code in the affected network path, and banging our collective heads on the wall while trying to understand just what might be causing the SMB server to kill off users of our VPN software, we eventually ended up hooking up a kernel debugger to an SMB server machine exhibiting this problem in order to see if I could find anything useful by debugging the SMB server (which is a kernel mode driver known as srv.sys). After not getting anywhere with that initially, I decided to try and trace the problem from the root of its observable effects; the reset TCP connection. Through the contact that our quality assurance group had made with Microsoft Product Support Services (PSS), Microsoft had supplied us with a couple of hotfixes for tcpip.sys (the Windows TCP stack) for various issues that ultimately turned out to be unrelated to the underlying trouble with SMB (and did not end up alleviating our problem). Although the hotfixes we received didn’t end up resolving our problem, we decided to take a closer look at what was happening inside the TCP state machine when the SMB connections were being reset.

This turned out to hit the metaphorical jackpot. I had set a breakpoint on every function in tcpip.sys that is responsible for flagging TCP connections for reset, and the call stack caught the SMB server (srv.sys) red-handed:

kd> k
ChildEBP RetAddr  
f4a4fa98 f4cad286 tcpip!SendRSTFromTCB
f4a4fac0 f4cb0ee3 tcpip!CloseTCB+0xbc
f4a4fad0 f4cb0ec7 tcpip!TryToCloseTCB+0x38
f4a4faf4 f4cace69 tcpip!TdiDisconnect+0x205
f4a4fb40 f4cabbed tcpip!TCPDisconnect+0xfd
f4a4fb5c 804e37f7 tcpip!TCPDispatchInternalDeviceControl+0x14d
f4a4fb6c f4c7d132 nt!IopfCallDriver+0x31
f4a4fb84 f4c7cd93 netbt!TdiDisconnect+0x10a
f4a4fbb4 f4c7d0b8 netbt!TcpDisconnect+0x40
f4a4fbd4 f4c7d017 netbt!DisconnectLower+0x42
f4a4fc14 f4c7d11f netbt!NbtDisconnect+0x339
f4a4fc44 f4c7ba5a netbt!NTDisconnect+0x4b
f4a4fc60 804e37f7 netbt!NbtDispatchInternalCtrl+0xb4
f4a4fc70 f43fb9c1 nt!IopfCallDriver+0x31
f4a4fc7c f441e8db srv!StartIoAndWait+0x1b
f4a4fcb0 f441f13e srv!SrvIssueDisconnectRequest+0x4d
f4a4fccc f43eed3e srv!SrvDoDisconnect+0x18
f4a4fce4 f440d3b5 srv!SrvCloseConnection+0xec
f4a4fd18 f44007ae srv!SrvCloseConnectionsFromClient+0x163
f4a4fd88 f43fba98 srv!BlockingSessionSetupAndX+0x274

As it turns out, the SMB server was explicitly disconnecting the pre-existing SMB client when the second SMB client tried to setup a session with the SMB server. This explains why even though the pre-existing SMB client was operating normally, not breaking the SMB protocol and running with no packet loss, it would mysteriously have its connection reset for apparently no good reason.

Further analysis on packet captures revealed that there was always a correlation to one client sending an SMB_SESSION_SETUP_ANDX command to the SMB server and all other clients on the SMB server being abortively disconnected. After digging around a bit more in srv.sys, it became clear that what was happening is that the SMB server would explicitly kill all SMB connections from an IPv4 address that was opening a new SMB session (via SMB_SESSION_SETUP_ANDX), except for the new SMB TCP connection. It turns out that this behavior is engaged if the SMB client sets the “VcNumber” field of the SMB_SESSION_SETUP_ANDX request to a zero value (which the Windows SMB redirector, mrxsmb.sys, does), and the client enables extended security on the connection (this is typically true for modern Windows versions, say Windows XP and Windows Server 2003).

This explains the problem that we were seeing. In one of the configurations, this customer was setup to have remote clients NAT’d behind a single LAN IP address. This, when combined with the SMB redirector sending a zero VcNumber field, resulted in the SMB server killing everyone else’s SMB connections (behind the NAT) when a new remote client opened an SMB connection through the NAT. Additionally, this also fit with an additional piece of information that we eventually uncovered from running into this trouble with customers; one customer in particular had an old Windows NT 4.0 fileserver which always worked perfectly over the VPN, but their newer Windows Server 2003 boxes would tend to experience these random disconnects. This was because NT4 doesn’t support the extended security options in the SMB protocol that Windows Server 2003 does, further lending credence to this particular theory. (It turns out that the code path that disconnects users for having a zero VcNumber is only active when extended security is being negotiated on an SMB session.)

Additionally, someone else at work managed to dig up knowledge base article 301673, otherwise titled “You cannot make more than one client connection over a NAT device”. Evidently, the problem is actually documented (and has been so since Windows 2000), though the listed workaround of using NetBIOS over TCP doesn’t seem to actually work. (Aside, it’s pretty silly that of all things, Microsoft is recommending that people fall back to NetBIOS. Haven’t we been trying to get rid of NetBIOS for years and years now…?).

Looking around a bit on Google for documentation about this particular problem, I ran into this article about SMB/CIFS which documented the shortcoming relating to SMB and NAT. Specifically:

Whenever a new transport-layer connection is created, the client is supposed to assign a new VC number. Note that the VcNumber on the initial connection is expected to be zero to indicate that the client is starting from scratch and is creating a new logical session. If an additional VC is given a VcNumber of zero, the server may assume that any existing connections with that same client are now bogus, and shut them down.

Why do such a thing?

The explanation given in the LANMAN documentation, the Leach/Naik IETF draft, and the SNIA doc is that clients may crash and reboot without first closing their connections. The zero VcNumber is the client’s signal to the server to clean up old connections. Reasonable or not, that’s the logic behind it. Unfortunately, it turns out that there are some annoying side-effects that result from this behavior. It is possible, for example, for one rogue application to completely disrupt SMB filesharing on a system simply by sending Session Setup requests with a zero VcNumber. Connecting to a server through a NAT (Network Address Translation) gateway is also problematic, since the NAT makes multiple clients appear to be a single client by placing them all behind the same IP address.

So, at least we (finally) found out just what was causing the mysterious resets. Apparently, from a protocol design standpoint, SMB is deliberately incompatible with NAT.

As someone who has designed network protocols before, this just completely blows my mind. I cannot think of any good reason to possibly justify breaking NAT, especially with the incredible proliferation of NAT devices in every day life (it seems like practically everyone has a cable/DSL/wireless router type device that does NAT today, not to mention the increased pressure to reuse IP addresses as pressure on the limited IPv4 address space grows). Not to mention that the Windows file sharing protocol has to be one of the most widely used networking protocols in the world. Breaking NAT for that (evidently just for the sake of breaking NAT!) just seems like such an incredibly horrible, not-well-thought-out design decision. Normally, I am usually fairly impressed by how well Microsoft designs their software (particularly their kernel software), but this particular little part of the SMB protocol design is just uncharacteristically completely wrong.

Even more unbelievably, the stated reason that Microsoft gave for this behavior is an optimization to handle the case where the user’s computer has crashed, rebooted, and reconnected to the server before the SMB server’s TCP stack has noticed that the crashed SMB client’s TCP connection has gone away. Evidently, conserving server resources in the case of a client crash is more important than being compatible with NAT. One has to wonder how unstable the Microsoft SMB redirector must have been at the time that this “feature” was added to the SMB protocol to make anyone in their right mind consider such an absolutely ridiculous, mind-bogglingly bad tradeoff.

To date, we haven’t had a whole lot of luck in trying to get Microsoft to fix this “minor” problem in SMB. I’ll post an update if we ever succeed, but as of now, things are unfortunately not looking very promising on that front.

VMware Server and RDP don’t always play nicely together.

Wednesday, July 5th, 2006

Steve already stole my thunder (well, if that makes sense, since it was my paper anyway) by posting my analysis of this earlier, but I figure that it is also worth discussing here.

 Recently, I finally* got a got a new development box at work – multiproc, x64 capable (with the ability to run 64-bit VMs too!), lots of RAM, generally everything you would want in a really nice development box.  Needless to say, I was rather excited to see what I could do with it.  The first thing I had in mind was setting up a dedicated set of VMs to run my test network on and host various dedicated services such as our symbol server here at the office.

 (*: There is a long, sad story behind this.  For a long time, I’ve been having a VM running on an ancient ~233MHz box that nobody else at the office wanted (for obvious reasons!).  I had been trying to get a replacement box that didn’t suck so much to put this VM (and others) on to run full time, but just about every thing that could possibly go wrong with requesting a purchase from work did go wrong, resulting in it being delayed in the order of over half a year…).

 The box came with Windows XP Professional x64 Edition installed, so I figured that I might as well use the install instead of blowing it away and putting Windows Server 2003 on for now.  As it turned out, this came around to bite me later.  After installing all of the usual things (service packs, hotfixes, and soforth), I went to grab the latest VMware Server installer so that I could put the box to work running my library of VMs.  Everything seemed to be going okay at the start, until I began to do things that were a bit outside the box, so to speak.  Here, I wanted to have my XP x64 box route through a VM running on the same computer.  Why on earth would I possibly want to do that, you ask?  Well, I have an internal VPN-based network that overlays the office network here at work and connects all of the VMs I have running on various boxes at the office.  I wanted to be able to interconnect all of those VMs with various services (in particular, lots and lots of storage space) running on the beefy x64 box over this trusted VPN network instead of the public office network (which I have for testing purposes designated the untrusted Internet network).  If I have the x64 box routing through something that is connected to the entire overlay network, then I don’t need to worry about creating connections to every single other VM in existance to grant access to those resources.  (At this point, our x64 support is still in beta, and XP doesn’t have a whole lot of useful support for dedicated VPN links.)

 Anyways, things start to get weird when I finally get this setup going.  The first problem I run into is that sometimes on boot, all of the VMs that I had configured to autostart would appear to hang on startup – I would have to go to Task Manager and kill the vmware-vmx.exe processes, then restart the vmserverdWin32 service before I could get them to come up properly.  After a bit of poking around, I noticed a suspicious Application Eventlog entry that seemed to correlate with when this problem happened on a boot:

Event Type: Information
Event Source: VMware Registration Service
Event Category: None
Event ID: 1000
Date:  6/13/2006
Time:  2:10:06 PM
User:  N/A
Computer: MOGHEDIEN
Description:
vmserverdWin32 took too much time to initialize.

 Hmm… that doesn’t look good.  Well, digging a bit deeper, it turns out that VMware Server has several different service components, and apparently there are dependencies between them.  However, the VMware Server developers neglected to properly assign dependencies between all of the services; instead, they appear to have just let the services start in whatever order and have a timeout window in which the services are supposed to establish communication with eachother.  Unfortunately, this tends to randomly break on some configurations (like mine, apparently).

 Fortunately, the fix for this problem turned out to be fairly easy.  Using sc.exe, the command line service configuration app (which used to ship with the SDK, but now ships with Windows XP and later – a handy tool to remember), I added an SCM dependency between the main VMware launcher service (“VMServerdwin32”) and the VMware authorization service (“VMAuthdService”): 

C:\Documents and Settings\Administrator>sc config vmserverdWin32 depend= RPCSS/VMAuthdService
[SC] ChangeServiceConfig SUCCESS
After fixing the service dependencies, everything seemed to be okay, but of course, that wasn’t really the case…

 When I went home later that day, I decided to VPN into the office and RDP into my new development box in order to change some hardware settings on one of my VMs.  In this particular case, some of the VPN traffic from my apartment to the development box on the office happened to pass through that router VM which I had running on the development box.  Whenever I tried to RDP into the development box, it would just freeze whenever I tried to enter my credentials; the RDP connection would hang after I entered valid logon credentials at the winlogon prompt until TCP gave up and broke off the connection.  This happened every single time I tried to RDP into my new box, but the office connection was fine (I could still connect to other things at the office while this was happening).  Definitely not cool.  So, I opened a session on our development server at the office and decided to try an experiment – ping my new dev box from it while I try to RDP in.  The initial results of this experiment were not at all what I expected; my dev box responded to pings the whole time while it was apparently unreachable over RDP while the TCP connection was timing out.  The next time I tried RDPing in, I ran a ping from my home box to my dev box, and the pings were dropped while I was trying to make the RDP session connection to the console session after providing valid logon credentials, and yet the box still responded to pings from a different box at the office.

After poking around a bit more, I determined that every single VM on my brand new dev box would just freeze and stop responding whenever I tried to RDP into my dev box from home (but not from the office).  To make matters even more strange, I could connect to a different box at the office, and bounce RDP through that box to my new dev server and things would work fine.  Well, that sucks – what’s going on here?  A partial explanation stems from how exactly I had setup the routing on my new dev box; the default gateway was set to my router VM (running on that box) using one of the VMnet virtual NICs, but I had left the physical NIC on the box still bound to TCP (without a default gateway set however).  So, for traffic destined to the office subnet, there is no need for packets to traverse the router VM – but for traffic from the VPN connection to my home, packets are routed through the router VM.

 Given this information, it seemed that I had at least found why the problem was happening, on some level – whenever I tried to RDP into my new dev box over the VPN, all of the VMs on my new dev box would freeze.  Because traffic through the VPN to my new dev box is routed through a VM on the new dev box, the RDP connection stalls and times out (because the router VM has hung).

 At this point, I had to turn to a debugger to understand what was going on.  Popping the vmware-vmx.exe process corresponding to the router VM open in the debugger and comparing call stacks between when it was running normally and when it was frozen while I was trying to RDP in pointed to the thread that emulated the virtual CPU becoming blocked on an internal NtUser call to win32k.sys.  At this point, I couldn’t really do a whole lot more without resorting to a kernel debugger, making that my next step.

 With the help of kd, I was able to track down the problem a bit further; the vmware CPU simulator thread was blocking on acquiring the global win32k NtUser lock that almost all NtUser calls acquire at the start of their implementation.  With the `!locks’ command, I was able to track down the owner of the lock – which happened to be (surprise!) a Terminal Server thread in CSRSS for the console session.  This thread was waiting on a kernel event, which turns out to be signalled when the RDP TCP transport driver receives data from the network.  So, we have a classical deadlock situation; the router VM is blocking on win32k’s internal NtUser global lock, and there is a CSRSS thread that is holding the win32k internal NtUser global lock while waiting on network I/O (from the RDP client).  Because the RDP client (me at home connecting through the VPN) needs to route traffic through the router VM to reach the RDP TCP transport on my new dev box, everything appears to freeze until the TCP connection times out.

 Unfortunately, there isn’t really a very good solution to this problem.  Installing Windows Server 2003 would have helped, in my case, because then VMware Server and its services would be running on session 0, and RDP connections would be diverted to new Terminal Server sessions (with their own per-session-instanced win32k NtUser locks), thus avoiding the deadlock (unless you happened to connect to Terminal Server using the `/console’ option).

 So there you have it – why VMware Server and RDP can make a bad mix sometimes.  This is a real shame, too, because RDPing into a box and running the VMware Server console client “locally” is sooo superior to running the VMware Server console client over the network (updates *much* faster, even over a LAN).

 If you’re interested, I did a writeup of most of the technical details of the actual debugging (with WinDbg and kd) of this problem that you can look at here – you are encouraged to do so if you want to see some of the steps I took in the debugger to further analyze the problem.

 In the future, I’ll try not to gloss over some of the debugger steps so much in blog posts; for this time, I had already written the writeup before hand, and didn’t want to just reformat the whole thing for an entire blog post.

 Whew, that was a long second post – hopefully, future ones won’t be quite so long-winded (if you consider that a bad thing).  Hopefully, future posts won’t be written at 1am just before I go to sleep, too…