VMware Server and RDP don’t always play nicely together.

Steve already stole my thunder (well, if that makes sense, since it was my paper anyway) by posting my analysis of this earlier, but I figure that it is also worth discussing here.

 Recently, I finally* got a got a new development box at work – multiproc, x64 capable (with the ability to run 64-bit VMs too!), lots of RAM, generally everything you would want in a really nice development box.  Needless to say, I was rather excited to see what I could do with it.  The first thing I had in mind was setting up a dedicated set of VMs to run my test network on and host various dedicated services such as our symbol server here at the office.

 (*: There is a long, sad story behind this.  For a long time, I’ve been having a VM running on an ancient ~233MHz box that nobody else at the office wanted (for obvious reasons!).  I had been trying to get a replacement box that didn’t suck so much to put this VM (and others) on to run full time, but just about every thing that could possibly go wrong with requesting a purchase from work did go wrong, resulting in it being delayed in the order of over half a year…).

 The box came with Windows XP Professional x64 Edition installed, so I figured that I might as well use the install instead of blowing it away and putting Windows Server 2003 on for now.  As it turned out, this came around to bite me later.  After installing all of the usual things (service packs, hotfixes, and soforth), I went to grab the latest VMware Server installer so that I could put the box to work running my library of VMs.  Everything seemed to be going okay at the start, until I began to do things that were a bit outside the box, so to speak.  Here, I wanted to have my XP x64 box route through a VM running on the same computer.  Why on earth would I possibly want to do that, you ask?  Well, I have an internal VPN-based network that overlays the office network here at work and connects all of the VMs I have running on various boxes at the office.  I wanted to be able to interconnect all of those VMs with various services (in particular, lots and lots of storage space) running on the beefy x64 box over this trusted VPN network instead of the public office network (which I have for testing purposes designated the untrusted Internet network).  If I have the x64 box routing through something that is connected to the entire overlay network, then I don’t need to worry about creating connections to every single other VM in existance to grant access to those resources.  (At this point, our x64 support is still in beta, and XP doesn’t have a whole lot of useful support for dedicated VPN links.)

 Anyways, things start to get weird when I finally get this setup going.  The first problem I run into is that sometimes on boot, all of the VMs that I had configured to autostart would appear to hang on startup – I would have to go to Task Manager and kill the vmware-vmx.exe processes, then restart the vmserverdWin32 service before I could get them to come up properly.  After a bit of poking around, I noticed a suspicious Application Eventlog entry that seemed to correlate with when this problem happened on a boot:

Event Type: Information
Event Source: VMware Registration Service
Event Category: None
Event ID: 1000
Date:  6/13/2006
Time:  2:10:06 PM
User:  N/A
Computer: MOGHEDIEN
Description:
vmserverdWin32 took too much time to initialize.

 Hmm… that doesn’t look good.  Well, digging a bit deeper, it turns out that VMware Server has several different service components, and apparently there are dependencies between them.  However, the VMware Server developers neglected to properly assign dependencies between all of the services; instead, they appear to have just let the services start in whatever order and have a timeout window in which the services are supposed to establish communication with eachother.  Unfortunately, this tends to randomly break on some configurations (like mine, apparently).

 Fortunately, the fix for this problem turned out to be fairly easy.  Using sc.exe, the command line service configuration app (which used to ship with the SDK, but now ships with Windows XP and later – a handy tool to remember), I added an SCM dependency between the main VMware launcher service (“VMServerdwin32”) and the VMware authorization service (“VMAuthdService”): 

C:\Documents and Settings\Administrator>sc config vmserverdWin32 depend= RPCSS/VMAuthdService
[SC] ChangeServiceConfig SUCCESS
After fixing the service dependencies, everything seemed to be okay, but of course, that wasn’t really the case…

 When I went home later that day, I decided to VPN into the office and RDP into my new development box in order to change some hardware settings on one of my VMs.  In this particular case, some of the VPN traffic from my apartment to the development box on the office happened to pass through that router VM which I had running on the development box.  Whenever I tried to RDP into the development box, it would just freeze whenever I tried to enter my credentials; the RDP connection would hang after I entered valid logon credentials at the winlogon prompt until TCP gave up and broke off the connection.  This happened every single time I tried to RDP into my new box, but the office connection was fine (I could still connect to other things at the office while this was happening).  Definitely not cool.  So, I opened a session on our development server at the office and decided to try an experiment – ping my new dev box from it while I try to RDP in.  The initial results of this experiment were not at all what I expected; my dev box responded to pings the whole time while it was apparently unreachable over RDP while the TCP connection was timing out.  The next time I tried RDPing in, I ran a ping from my home box to my dev box, and the pings were dropped while I was trying to make the RDP session connection to the console session after providing valid logon credentials, and yet the box still responded to pings from a different box at the office.

After poking around a bit more, I determined that every single VM on my brand new dev box would just freeze and stop responding whenever I tried to RDP into my dev box from home (but not from the office).  To make matters even more strange, I could connect to a different box at the office, and bounce RDP through that box to my new dev server and things would work fine.  Well, that sucks – what’s going on here?  A partial explanation stems from how exactly I had setup the routing on my new dev box; the default gateway was set to my router VM (running on that box) using one of the VMnet virtual NICs, but I had left the physical NIC on the box still bound to TCP (without a default gateway set however).  So, for traffic destined to the office subnet, there is no need for packets to traverse the router VM – but for traffic from the VPN connection to my home, packets are routed through the router VM.

 Given this information, it seemed that I had at least found why the problem was happening, on some level – whenever I tried to RDP into my new dev box over the VPN, all of the VMs on my new dev box would freeze.  Because traffic through the VPN to my new dev box is routed through a VM on the new dev box, the RDP connection stalls and times out (because the router VM has hung).

 At this point, I had to turn to a debugger to understand what was going on.  Popping the vmware-vmx.exe process corresponding to the router VM open in the debugger and comparing call stacks between when it was running normally and when it was frozen while I was trying to RDP in pointed to the thread that emulated the virtual CPU becoming blocked on an internal NtUser call to win32k.sys.  At this point, I couldn’t really do a whole lot more without resorting to a kernel debugger, making that my next step.

 With the help of kd, I was able to track down the problem a bit further; the vmware CPU simulator thread was blocking on acquiring the global win32k NtUser lock that almost all NtUser calls acquire at the start of their implementation.  With the `!locks’ command, I was able to track down the owner of the lock – which happened to be (surprise!) a Terminal Server thread in CSRSS for the console session.  This thread was waiting on a kernel event, which turns out to be signalled when the RDP TCP transport driver receives data from the network.  So, we have a classical deadlock situation; the router VM is blocking on win32k’s internal NtUser global lock, and there is a CSRSS thread that is holding the win32k internal NtUser global lock while waiting on network I/O (from the RDP client).  Because the RDP client (me at home connecting through the VPN) needs to route traffic through the router VM to reach the RDP TCP transport on my new dev box, everything appears to freeze until the TCP connection times out.

 Unfortunately, there isn’t really a very good solution to this problem.  Installing Windows Server 2003 would have helped, in my case, because then VMware Server and its services would be running on session 0, and RDP connections would be diverted to new Terminal Server sessions (with their own per-session-instanced win32k NtUser locks), thus avoiding the deadlock (unless you happened to connect to Terminal Server using the `/console’ option).

 So there you have it – why VMware Server and RDP can make a bad mix sometimes.  This is a real shame, too, because RDPing into a box and running the VMware Server console client “locally” is sooo superior to running the VMware Server console client over the network (updates *much* faster, even over a LAN).

 If you’re interested, I did a writeup of most of the technical details of the actual debugging (with WinDbg and kd) of this problem that you can look at here – you are encouraged to do so if you want to see some of the steps I took in the debugger to further analyze the problem.

 In the future, I’ll try not to gloss over some of the debugger steps so much in blog posts; for this time, I had already written the writeup before hand, and didn’t want to just reformat the whole thing for an entire blog post.

 Whew, that was a long second post – hopefully, future ones won’t be quite so long-winded (if you consider that a bad thing).  Hopefully, future posts won’t be written at 1am just before I go to sleep, too…

9 Responses to “VMware Server and RDP don’t always play nicely together.”

  1. […] Be sure to read my earlier posting for an interoperability problem with RDP if you try to connect to a console session, as this problem may limit its usefulness if you do not use full Terminal Server (a reason to consider installing it on Windows Server 2003 instead of Windows XP). […]

  2. Tim Kent says:

    Thanks, I was wondering why my VMware Server machine was doing this!

  3. Jonathan says:

    THank you too! I figured it was something like that but your explanation explains things that I wouldn’t be able to…

  4. A.Lizard says:

    Try VNC instead for accessing the BUILT IN VMware Server ***VNC Server***. Tell it to connect to [IP]:5900. I’d enclose a screenshot if there was any place to put one here, I triied it from Remote Desktop Viewer 0.5.1 (gnome GUI for rdesktoP) and I’ve got a W98SE guest onscreen right now. Note that a VNC server on the guestVM isn’t necesary… just as well because I didn’t install any.

    I don’t know if you can run different ports for different VMware Server guests, check the docs for VMware Server.

    BTW, RDP works great for accessing the built-in RDP server in Virtualbox. For that, if you want to access different clients from separate sessions, assign each guestVM a different host number. That’s how I RDP into my Windows XP guest VM.

    Both VMware Server and Virtualbox are hosted on my Linux workstation right now. However, this should work on a Windows host. Have fun and good luck.

  5. Vima says:

    Thank you BIIG
    Thanks, I was wondering why my VMware Server machine was doing this
    regards

  6. Mux says:

    Thank you, i’m experiencing the same issue and now i at least know what is behind all this.

    My Environment is somewhat bigger, a juicy Quadcore Workstation running ESX4 with a linuxguest using Openswan for IPSec and routing to my workplace network, some Windows Servers and a XP x86 VM for downloads.

    Said XP VM “bluescreens” about 80% of the time when i RDP into it from my local network at home, when previosly the session was disconnected from my workplace computer. No Probs at all connecting to a disconnected session from my workplace, only the other way around.

    I found a “workaround” for it, namely accessing the Consolescreen via ESX Manager, unlocking the disconnected session and then RDPing into it. Quite annoying though.

  7. Mux says:

    Oh and to supplement my previous post,
    i just checked for the same issue on my workplace’s ESXi4 server.
    (said machine above is my private one)

    I’ve also got a XP x86 VM there, dying all over me when trying to connect from home (to a previously disconnected session that was RDP’d to from the local workplace LAN).
    But who wants to work from home anyway… ;)

  8. Rei says:

    Thank you!!! You saved me reinstalling the new VM Server.
    My VM stopped working with no apparent reason since last week after 2 years. I have no idea what happened. I run your code on the command line and it works!!

  9. Michael says:

    Thanks for this :) And I like the use of the names of Aes Sedai and Forsaken in your environment :)