Analysis of a networking problem: The case of the mysterious SMB connection resets (or “How to not design a network protocol”)

Recently, I had the unpleasant task of troubleshooting a particularly strange problem at work, in which a particular SMB-based file server would disconnect users if more than one user attempted to simultaneously initiate a file transfer. This would typically result as a file transfer (i.e. drag and drop file copy) initiated by Explorer failing with a “The specified network name is no longer available.” (error #64) dialog box. As this particular problem involved SMB traffic going through a fair amount of custom code we had written, being a VPN/remote access company (this particular problem was seemingly only occuring when using our VPN software), we (naturally) assumed that the issue was caused by some sort of bug in our software that was doing something to upset either the Windows SMB client or SMB server.

So, first things first; after doing initial troubleshooting, we obtained some packet captures of the problematic SMB server (and SMB clients), detailing just what was happening on the network when the problem occured.

Unfortunately, in this particular case, the packet captures did not really end up providing a whole lot of useful information; they did, however, succeed in raising more questions. What we observed is that in the middle of an SMB datastream, the SMB server would just mysteriously send a TCP RST packet, thereby forcibly closing the TCP connection on which SMB was running. This corresponded exactly with one of the file share clients getting the dreaded error #64 dialog, but there was no clue as to what the problem was. In this particular case, there was no packet loss to speak of, and nothing else to indicate some kind of connectivity problem; the SMB server just simply sent an RST seemingly out of the blue to one of the SMB clients shortly after a different SMB client attempted to initiate a file transfer.

To make matters worse, there was no correlation at all as to what the SMB client whose connection got killed was doing when the connection got reset. The client could be either submitting a read request for more data, waiting for a previously sent read request to finish processing, or doing any other operation; the SMB server would just mysteriously close the connection.

In this particular case, the problem would also only occur when SMB was used in conjunction with our VPN software. When the SMB server was accessed over the LAN, the SMB connection would operate fine in the presence of multiple concurrent users. Additionally, when the SMB server was used in conjunction with alternative remote access methods other than our standard VPN system, the problem would mysteriously vanish.

By this time, this problem was starting to look like a real nightmare. The information we had said that there was some kind of problem that was preventing SMB from being used with our VPN software (which obviously would need to be fixed, and quickly), and yet gave no actual leads as to what might cause the problem to occur. According to logs and packet captures, the SMB server would just arbitrarily reset connections of users connecting to SMB servers when used in conjunction with our VPN software.

Fortunately, we did eventually manage to duplicate the problem in-house with our own internal test network. This eventually turned out to be key to solving the problem, but did not (at least initially) provide any immediate new information. While it did prevent us from having to bother the customer this problem was originally impacting while troubleshooting, it did not immediately get us closer to understanding the problem. In the mean time, our industrious quality assurance group had engaged Microsoft in an attempt to see if there was any sort of known issue with SMB that might possibly explain this problem.

After spending a significant amount of time doing exhaustive code reviews of all of our code in the affected network path, and banging our collective heads on the wall while trying to understand just what might be causing the SMB server to kill off users of our VPN software, we eventually ended up hooking up a kernel debugger to an SMB server machine exhibiting this problem in order to see if I could find anything useful by debugging the SMB server (which is a kernel mode driver known as srv.sys). After not getting anywhere with that initially, I decided to try and trace the problem from the root of its observable effects; the reset TCP connection. Through the contact that our quality assurance group had made with Microsoft Product Support Services (PSS), Microsoft had supplied us with a couple of hotfixes for tcpip.sys (the Windows TCP stack) for various issues that ultimately turned out to be unrelated to the underlying trouble with SMB (and did not end up alleviating our problem). Although the hotfixes we received didn’t end up resolving our problem, we decided to take a closer look at what was happening inside the TCP state machine when the SMB connections were being reset.

This turned out to hit the metaphorical jackpot. I had set a breakpoint on every function in tcpip.sys that is responsible for flagging TCP connections for reset, and the call stack caught the SMB server (srv.sys) red-handed:

kd> k
ChildEBP RetAddr  
f4a4fa98 f4cad286 tcpip!SendRSTFromTCB
f4a4fac0 f4cb0ee3 tcpip!CloseTCB+0xbc
f4a4fad0 f4cb0ec7 tcpip!TryToCloseTCB+0x38
f4a4faf4 f4cace69 tcpip!TdiDisconnect+0x205
f4a4fb40 f4cabbed tcpip!TCPDisconnect+0xfd
f4a4fb5c 804e37f7 tcpip!TCPDispatchInternalDeviceControl+0x14d
f4a4fb6c f4c7d132 nt!IopfCallDriver+0x31
f4a4fb84 f4c7cd93 netbt!TdiDisconnect+0x10a
f4a4fbb4 f4c7d0b8 netbt!TcpDisconnect+0x40
f4a4fbd4 f4c7d017 netbt!DisconnectLower+0x42
f4a4fc14 f4c7d11f netbt!NbtDisconnect+0x339
f4a4fc44 f4c7ba5a netbt!NTDisconnect+0x4b
f4a4fc60 804e37f7 netbt!NbtDispatchInternalCtrl+0xb4
f4a4fc70 f43fb9c1 nt!IopfCallDriver+0x31
f4a4fc7c f441e8db srv!StartIoAndWait+0x1b
f4a4fcb0 f441f13e srv!SrvIssueDisconnectRequest+0x4d
f4a4fccc f43eed3e srv!SrvDoDisconnect+0x18
f4a4fce4 f440d3b5 srv!SrvCloseConnection+0xec
f4a4fd18 f44007ae srv!SrvCloseConnectionsFromClient+0x163
f4a4fd88 f43fba98 srv!BlockingSessionSetupAndX+0x274

As it turns out, the SMB server was explicitly disconnecting the pre-existing SMB client when the second SMB client tried to setup a session with the SMB server. This explains why even though the pre-existing SMB client was operating normally, not breaking the SMB protocol and running with no packet loss, it would mysteriously have its connection reset for apparently no good reason.

Further analysis on packet captures revealed that there was always a correlation to one client sending an SMB_SESSION_SETUP_ANDX command to the SMB server and all other clients on the SMB server being abortively disconnected. After digging around a bit more in srv.sys, it became clear that what was happening is that the SMB server would explicitly kill all SMB connections from an IPv4 address that was opening a new SMB session (via SMB_SESSION_SETUP_ANDX), except for the new SMB TCP connection. It turns out that this behavior is engaged if the SMB client sets the “VcNumber” field of the SMB_SESSION_SETUP_ANDX request to a zero value (which the Windows SMB redirector, mrxsmb.sys, does), and the client enables extended security on the connection (this is typically true for modern Windows versions, say Windows XP and Windows Server 2003).

This explains the problem that we were seeing. In one of the configurations, this customer was setup to have remote clients NAT’d behind a single LAN IP address. This, when combined with the SMB redirector sending a zero VcNumber field, resulted in the SMB server killing everyone else’s SMB connections (behind the NAT) when a new remote client opened an SMB connection through the NAT. Additionally, this also fit with an additional piece of information that we eventually uncovered from running into this trouble with customers; one customer in particular had an old Windows NT 4.0 fileserver which always worked perfectly over the VPN, but their newer Windows Server 2003 boxes would tend to experience these random disconnects. This was because NT4 doesn’t support the extended security options in the SMB protocol that Windows Server 2003 does, further lending credence to this particular theory. (It turns out that the code path that disconnects users for having a zero VcNumber is only active when extended security is being negotiated on an SMB session.)

Additionally, someone else at work managed to dig up knowledge base article 301673, otherwise titled “You cannot make more than one client connection over a NAT device”. Evidently, the problem is actually documented (and has been so since Windows 2000), though the listed workaround of using NetBIOS over TCP doesn’t seem to actually work. (Aside, it’s pretty silly that of all things, Microsoft is recommending that people fall back to NetBIOS. Haven’t we been trying to get rid of NetBIOS for years and years now…?).

Looking around a bit on Google for documentation about this particular problem, I ran into this article about SMB/CIFS which documented the shortcoming relating to SMB and NAT. Specifically:

Whenever a new transport-layer connection is created, the client is supposed to assign a new VC number. Note that the VcNumber on the initial connection is expected to be zero to indicate that the client is starting from scratch and is creating a new logical session. If an additional VC is given a VcNumber of zero, the server may assume that any existing connections with that same client are now bogus, and shut them down.

Why do such a thing?

The explanation given in the LANMAN documentation, the Leach/Naik IETF draft, and the SNIA doc is that clients may crash and reboot without first closing their connections. The zero VcNumber is the client’s signal to the server to clean up old connections. Reasonable or not, that’s the logic behind it. Unfortunately, it turns out that there are some annoying side-effects that result from this behavior. It is possible, for example, for one rogue application to completely disrupt SMB filesharing on a system simply by sending Session Setup requests with a zero VcNumber. Connecting to a server through a NAT (Network Address Translation) gateway is also problematic, since the NAT makes multiple clients appear to be a single client by placing them all behind the same IP address.

So, at least we (finally) found out just what was causing the mysterious resets. Apparently, from a protocol design standpoint, SMB is deliberately incompatible with NAT.

As someone who has designed network protocols before, this just completely blows my mind. I cannot think of any good reason to possibly justify breaking NAT, especially with the incredible proliferation of NAT devices in every day life (it seems like practically everyone has a cable/DSL/wireless router type device that does NAT today, not to mention the increased pressure to reuse IP addresses as pressure on the limited IPv4 address space grows). Not to mention that the Windows file sharing protocol has to be one of the most widely used networking protocols in the world. Breaking NAT for that (evidently just for the sake of breaking NAT!) just seems like such an incredibly horrible, not-well-thought-out design decision. Normally, I am usually fairly impressed by how well Microsoft designs their software (particularly their kernel software), but this particular little part of the SMB protocol design is just uncharacteristically completely wrong.

Even more unbelievably, the stated reason that Microsoft gave for this behavior is an optimization to handle the case where the user’s computer has crashed, rebooted, and reconnected to the server before the SMB server’s TCP stack has noticed that the crashed SMB client’s TCP connection has gone away. Evidently, conserving server resources in the case of a client crash is more important than being compatible with NAT. One has to wonder how unstable the Microsoft SMB redirector must have been at the time that this “feature” was added to the SMB protocol to make anyone in their right mind consider such an absolutely ridiculous, mind-bogglingly bad tradeoff.

To date, we haven’t had a whole lot of luck in trying to get Microsoft to fix this “minor” problem in SMB. I’ll post an update if we ever succeed, but as of now, things are unfortunately not looking very promising on that front.

19 Responses to “Analysis of a networking problem: The case of the mysterious SMB connection resets (or “How to not design a network protocol”)”

  1. Michael Kohne says:

    I know little or nothing about this particular behaviour, but if it goes as far back as Windows for Workgroups, it might actually have been the right behaviour in the presence of a LAN-only environment, with limited or no internet connectivity and no NAT worth worrying about.

    I’m thinking that this (like many unfortunate problems in MS protocols) is a holdover from times when it made sense, and the universe has over-run it.

    Just a thought.

  2. dispensa says:

    Did we ever decide if this was a Terminal Server DoS waiting to happen? Seems like it is; all you’d need is a rogue smbclient to drop everyone from all server connections.

  3. […] Ken has a good write-up of a feature we found while working on a customer’s SMB networking issue: SMB is incompatible with (overload) NAT. […]

  4. igorsk says:

    Um, why do you get the notion that they’re *deliberately* breaking NAT? To me it seem just a leftover compatibility function from the days of LanMan. Back then probably no one even imagined such a situation.
    That aside, a nice detective story :)

  5. Holy macaroni! I agree that Microsoft usually does a good job of designing things well but this is just purely bad design.
    I doubt you’ll have any luck convincing them to fix the bug taking into consideration the issue has been known for a very long time…

    I’m particularly worried about the obvious DoS attack that results from sending a zeroed VcNumber field.

    Seriously, what were they thinking…

  6. Skywing says:

    Well, here’s the thing: As far as I can tell, this particular code path is only taken if extended security is enabled on the client. So, *old clients* typically tend to work fine through NATs (as do *old servers*). It’s just when you start throwing post-NT4 into the mix that NAT starts breaking down.

  7. Skywing says:

    As far as I know, the Terminal Server scenario isn’t a DoS. This is because there is only one local SMB redirector, and it is smart enough to not try and get the other sessions to the remote server booted off while establishing new sessions.

    There is certainly a DoS potential when you are behind a shared NAT with somebody else. At least in the SMB over TCP case, though, you do need to be able to send data from the same source IP (and with the right sequence number for TCP), which limits the DoS scope to people on your local LAN segment (if you’re on a hub), or otherwise just to compromised routers.

  8. dispensa says:

    The TS case is a DoS as well – all you have to do is compile a specially hacked smbclient from the Samba tree and run it as a good old-fashioned non-privileged user. Run it w/ a 0 Vc number and bang! everyone is dead.

  9. Skywing says:

    That is true. I was speaking more from a standpoint of just using drive shares to the same server with the default redirector won’t break other people on the same box.

  10. […] Connections suddenly quit working. While there are sometimes other explanations for this, a big cause of this problem is dropped datagrams due to MTU. TCP (usually) sets the DF bit on outgoing segments. If a router receives a segment that’s too big for the egress interface, and that segment has the DF bit set, it has no choice but to drop it. It then sends back its PMTU ICMP message, but the firewall drops that message, so the sender simply thinks that that the receiver has dropped off the face of the earth. Retransmissions happen and eventually the sender gives up. This generally happens with the first big packets in a session, such as (for example) when a terminal services client is painting the desktop. […]

  11. Seán says:

    I’d be extremely interested if anyone has anything new to add to this issue. I am actually EXTREMELY interested in this as I’ve got a few users at an office in a different city and they’ll need to connect to our Win2003 server.

  12. Skywing says:

    If you don’t NAT your cross-site traffic, then you shouldn’t run into this issue. That is my recommendation for avoiding the problem (don’t NAT SMB traffic).

    From what we have heard from Microsoft, they are not planning on changing this “feature”.

  13. Tim says:

    It’s a nuisance that SMB doesn’t work through NAT, but hardly surprising. There are lots of protocols that don’t work through NAT – FTP being a prime example – NAT software has to be specially built to allow FTP to work properly through it. It’s only because FTP is so widely used on the Internet that most NAT implementations are built with special support for it. As with FTP, SMB was probably devised before NAT was even heard of.
    It seems to me that you went through a lot of un-necessary diagnosis phases here. Personally, I would immediately have suspected a NAT issue, and directed my attentions there.

  14. Skywing says:

    The reason we did not immediately suspect the NAT being the problem is that we (or rather our customers) had been using SMB through NAT for a long, long time (on the order of a year or more) without reporting any issues. IOW, it had been “just working” for a long time, presumably either with users just writing off occasional disconnects as Internet flakiness or concurrent access to the same file server via the NAT not being high enough for it to become a real problem.

    If we had recently introduced a NAT into the equation, that would have certainly been a higher thing on the list of suspects, but you have to understand that configurations like this had been in use for a very long time before we had a report of this issue which got to this stage of troubleshooting.

    We have run into plenty of other protocols that break with NAT, but most of these have historically broken right away with the NAT and not just appeared to have been limping along for a year or more before somebody complained about it. (In our case, this is probably also related to the fact that if there are multiple SMB servers being accessed, as long as users are using different SMB servers, they won’t run into the issue, whereas many other protocols that are NAT-unfriendly that we have ran into typically had one box that everyone would use as opposed to several file share servers. As a result, with the configurations we typically saw, the SMB/NAT problem was often a bit less obvious than one might imagine at first glance.)

  15. Ilya says:

    I’ve been fighting with a similar problem for some time. We had two computers communicating over NAT for many months without any issues, but when we tried adding one more client we encountered the problem. Glad I’ve found this article! Now I understand what’s going on.

    So, the question to all the gurus: what are the workarounds? How can I access a shared drive from behind a NAT? I need a quick solution. And why do you say the Microsoft’s proposed workaround (disabling direct hosting) does not work? I do understand why it’s not desirable, but does it solve the problem?

    Also, in the same 301673 KB Article Microsoft recommends WebDAV as a possible solution. Is it worth looking into? Thanks!

  16. FrancoCTX says:

    WebDAV is *extremely* slow and users will not accept it after using SMB shares!

  17. Eric says:

    I’ve been searching for a solution to this problem for a very long time. This article gave me an idea, I simply blocked port 445, the direct hosting port, in my NAT firewall.

    Now clients behind the NAT firewall automatically fall back to NetBIOS over TCP/IP which seems to work ok through NAT unlike Direct Hosting.

    From my initial tests it seems to be working and is not causing any additional issues.

  18. Phil C says:

    You can Disable Direct hosting on the Server or Client per KB301637 or use Pool NAT to provide distiguishable connections for the SMB server.

  19. spb_nick says:

    to FrancoCTX: WebDAV is not extremely slow, it is the Microsoft implementation that is extremely buggy.