Weblog entry #59 for dkg

TCP weirdness, IMAP, wireshark, and perdition
Posted by dkg on Thu 21 Jan 2010 at 19:37
This is the story of a weirdly unfriendly/non-compliant IMAP server, and some nice interactions that arose from a debugging session around it.

Over the holidays, i got to do some computer/network debugging for friends and family. One old friend (I'll call him Fred) had a series of problems i managed to help work through, but was ultimately basically stumped based on the weird behavior of an IMAP server. Here's the details (names of the innocent and guilty have been changed), just in case it helps other folks in at least diagnosing similar situations.

the diagnosis

The initial symptom was that Fred's computer was "very slow". Sadly, this was a Windows™ machine, so my list of tricks for diagnosing sluggishness is limited. I went through a series of questions, uninstalling things, etc, until we figured it would be better to just have him do his usual work while i watched, kibitzing on what seemed acceptable and what seemed slow. Quite soon, we hit a very specific failure: Fred's Thunderbird installation (version 2, FWIW) was sometimes hanging for a very long period of time during message retrieval. This was not exhaustion of the CPU, disk, RAM, or other local resource. It was pure network delay, and it was a frequent (if unpredictable) frustrating hiccup in his workflow.

One thought i had was Thunderbird's per-server max_cached_connections setting, which can sometimes cause a TB instance to hang if a remote server thinks Thunderbird is being too aggressive. After sorting out why Thunderbird was resetting the values after we'd set them to 0 (grr, thanks for the confusing UI, folks!), we set it to 1, but still had the same occasional, lengthy (about 2 minutes) hang when transfering messages between folders (including the trash folder!), or when reading new messages. Sending mail was quite fast, except for occasional (similarly lengthy) hangs writing the copy to the sent folder. So IMAP was the problem (not SMTP), and the 2-minute timeouts smelled like an issue with the networking layer to me.

At this point, i busted out wireshark, the trusty packet sniffer, which fortunately works as well on Windows as it does on GNU/Linux. Since Fred was doing his IMAP traffic in the clear, i could actually see when and where in the IMAP session the hang was happening. (BTW, Fred's IMAP traffic is no longer in the clear: after all this happened, i switched him to IMAPS (IMAP wrapped in a TLS session), because although the IMAP server in question actually supports the STARTTLS directive, it fails to advertise it in response to the CAPABILITIES query, so Thunderbird refuses to try it. arrgh.)

The basic sequence of Thunderbird's side of an initial IMAP conversation (using plain authentication, anyway) looks something like this:

1 capability
2 login "user" "pass"
3 lsub "" "*"
4 list "" "INBOX"
5 select "INBOX"
6 UID fetch 1:* (FLAGS)
What i found with this server was that if i issued commands 1 through 5, and then left the connection idle for over 5 minutes, then the next command (even if it was just a 6 NOOP or 6 LOGOUT) would cause the IMAP server to issue a TCP reset. No IMAP error message or anything, just a failure at the TCP level. But a nice, fast, responsive failure -- any IMAP client could recover nicely from that by just immediately opening a new connection. I don't mind busy servers killing inactive connections after a reasonable timeout. If it was just this, though, Thunderbird should have continued to be responsive.

the deep weirdness

But if i issued commands 1 through 6 in rapid succession (the only difference is that extra 6 UID fetch 1:* (FLAGS) command), and then let the connection idle for 5 minutes, then sent the next command: no response of any kind would come from the remote server (not even a TCP ACK or TCP RST). In this circumstance, my client OS's TCP stack would re-send the data repeatedly (staggered at appropriate intervals), until finally the client-side TCP timeout would trigger, and the OS would report the failure to the app, which could turn around and do a simple connection restart to finish up the desired operation. This was the underlying situation causing Fred's Thunderbird client to hang.

In both cases above (with or without the 6th command), the magic window for the idle cutoff was a little more than 300 seconds (5 minutes) of idleness. If the client issued a NOOP at 4 minutes, 45 seconds from the last NOOP, it could keep a connection active indefinitely.

Furthermore, i could replicate the exact same behavior when i used IMAPS -- the state of the IMAP session itself was somehow modifying the TCP session behavior characteristics, whether it was wrapped in a TLS tunnel or not.

One interesting thing about this set of data is that it rules out most common problems in the network connectivity between the two machines. Since none of the hops between the two endpoints know anything about the IMAP state (especially under TLS), and some of the failures are reported properly (e.g. the TCP RST in the 5-command scenario), it's probably safe to say that the various routers, NAT devices, and such were not themselves responsible for the failures.

So what's going on on that IMAP server? The service itself does not announce the flavor of IMAP server, though it does respond to a successful login with You are so in, and to a logout with IMAP server logging out, mate. A bit of digging on the 'net suggests that they are running a perdition IMAP proxy. (clearly written by an Aussie, mate!) But why does it not advertise its STARTTLS capability, even though it is capable? And why do some idle connections end up timing out without so much as an RST, when other idle connections give at least a clean break at the TCP level?

Is there something about issuing the UID command that causes perdition to hand off the connection to some other service, which in turn doesn't do proper TCP error handling? I don't really know anything about the internals of perdition, so i'm just guessing here.

the workaround

I ultimately recommended to Fred to reduce the number of cached connections to 1, and to set Thunderbird's interval to check for new mail down to 4 minutes. Hopefully, this will keep his one connection active enough that nothing will timeout, and will keep the interference to his workflow to a minimum.

It's an unsatisfactory solution to me, because the behavior of the remote server still seems so non-standard. However, i don't have any sort of control over the remote server, so there's not too much i can do to provide a real fix (other than point the server admins (and perdition developers?) at this writeup).

I don't even know the types of backend server that their perdition proxy is balancing between, so i'm pretty lost for better diagnostics even, let alone a real resolution.

some notes

I couldn't have figured out the exact details listed above just using Thunderbird on Windows. Fortunately, i had a machine with a decent OS available, and was able to cobble together a fake IMAP client from a couple files (imapstart contained the lines above, and imapfinish contained 8 LOGOUT), bash, and socat.

Here's the bash snippet i used as a fake IMAP client:

spoolout() { while read foo; do sleep 1 && printf "%s\r\n" "$foo" ; done }

( sleep 2 && spoolout < imapstart && sleep 4 && spoolout < imapfinish && sleep 500 ) | socat STDIO TCP4:imap.fubar.example.net:143
To do the test under IMAPS, i just replaced TCP4:imap.fubar.example.net:143 with OPENSSL:imap.fubar.example.net:993.

And of course, i had wireshark handy on the GNU/Linux machine as well, so i could analyze the generated packets over there.

One thing to note about user empowerment: Fred isn't a tech geek, but he can be curious about the technology he relies on if the situation is right. He was with me through the whole process, didn't get antsy, and never tried to get me to "just fix it" while he did something else. I like that, and wish i got to have that kind of interaction more (though i certainly don't begrudge people the time if they do need to get other things done). I was nervous about breaking out wireshark and scaring him off with it, but it turned out it actually was a good conversation starter about what was actually happening on the network, and how IP and TCP traffic worked.

Giving a crash course like that in a quarter of an hour, i can't expect him to retain any concrete specifics, of course. But i think the process was useful in de-mystifying how computers talk to each other somewhat. It's not magic, there are just a lot of finicky pieces that need to fit together a certain way. And Wireshark turned out to be a really nice window into that process, especially when it displays packets during a real-time capture. I usually prefer to do packet captures with tcpdump and analyze them as a non-privileged user afterward for security reasons. But in this case, i felt the positives of user engagement (how often do you get to show someone how their machine actually works?) far outweighed the risks.

As an added bonus, it also helped Fred really understand what i meant when i said that it was a bad idea to use IMAP in the clear. He could actually see his username and password in the network traffic!

This might be worth keeping in mind as an idea for a demonstration for workshops or hacklabs for folks who are curious about networking -- do a live packet capture of the local network, project it, and just start asking questions about it. Wireshark contains such a wealth of obscure packet dissectors (and today's heterogenous public/open networks are so remarkably chatty and filled with weird stuff) that you're bound to run into things that most (or all!) people in the room don't know about, so it could be a good learning activity for groups of all skill levels.

 

Comments on this Entry

Posted by Anonymous (88.96.xx.xx) on Fri 22 Jan 2010 at 00:38
It sounds to me like there's a connection-tracking firewall in the middle. That doesn't really explain how the state of the IMAP session could affect TCP behaviour, but the timing of packets sent in each direction could be significant.

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Fri 22 Jan 2010 at 16:22
[ Send Message | View dkg's Scratchpad | View Weblogs ]
The bizarre thing is that i don't think it was the timing. It actually was the IMAP state.

If i'm remembering the debugging session correctly, I found i could string along a series of a half-dozen NOOPs and then still get the TCP RST on the first command after letting the session idle for 5 minutes, as long as i didn't do the UID fetch command. Once the UID fetch command was issued, then commands after the 5 minute idle window elapsed would just hang and never get a RST response, forcing the client's TCP stack to decide that the connection was dropped only after the local TCP retry timeout kicked in.

Maybe UID fetch caused the IMAP proxy to somehow transform or hand off the TCP session itself in ways that tickle some bugginess at that layer of the communications stack?

[ Parent | Reply to this comment ]

Posted by Anonymous (87.254.xx.xx) on Fri 22 Jan 2010 at 02:49
That's a perfectly correct client IMAP implementation...

[ Parent | Reply to this comment ]

Posted by stsimb (193.92.xx.xx) on Fri 22 Jan 2010 at 07:37
[ Send Message ]
From perdition(8):
If perdition is listening for TLS connections then the capability STLS for POP3 or STARTTLS for IMAP4 will be appended to the list of capabilities if it is not already present. Similarly these capabilities will be removed from the list of capabilities if they are present and perdition is not listening for TLS connections.

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Fri 22 Jan 2010 at 16:11
[ Send Message | View dkg's Scratchpad | View Weblogs ]
I'm glad to hear it. nonetheless, what i observed was an IMAP server which did not announce STARTTLS in response to a CAPABILITY query, but when i issue STARTTLS in an IMAP session (via netcat, for example), i see:
0 dkg@pip:~$ nc imap.fubar.example.com 143
* OK IMAP4 Ready empp1 00021888
1 capability
* CAPABILITY IMAP4 IMAP4REV1
1 OK CAPABILITY
2 starttls
2 OK Begin TLS negotiation now
0 dkg@pip:~$ 
Hrm. when i try directly with OpenSSL's s_client, though, i get a failure:
0 dkg@pip:~$ openssl s_client -starttls imap -connect imap.fubar.example.com:143
CONNECTED(00000003)
didn't found STARTTLS in server response, try anyway...
3309:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:s23_lib.c:188:
0 dkg@pip:~$ 
So maybe the server doesn't really support STARTTLS, even though it does respond to the command with an OK response.

Thanks for prompting me to look into that further. It looks like i was fooled by the OK Begin TLS negotiation now, and the server is actually correct in not offering STARTTLS as a capability.

[ Parent | Reply to this comment ]

Posted by Anonymous (212.121.xx.xx) on Mon 1 Feb 2010 at 17:14
You supply perdition's config with a capability string that it will report, the default is just IMAP4 IMAP4REV1. Why? You can request CAPABILITY from IMAP before you've logged in, yet the main use of perdition is that it chooses the real backend server based on the username when you login. As you don't have to be logged in to issue CAPABILITY, it has know way of knowing the capabilities of the backend server, because it doesn't know which backend server is about to be chosen. Instead the admin should setup the capability string in perdition, presumably using the lowest common denominator if the backend IMAP servers are differing in capabilities.

As for this IMAP state weirdness, thanks for the heads up, I will be testing this before we put perdition into production.

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Tue 2 Feb 2010 at 15:37
[ Send Message | View dkg's Scratchpad | View Weblogs ]
This bit actually appears to be a bug in older versions of perdition, which have since been resolved (in two commits). So at the moment, i can conclude that the remote server is running perdition version 1.17 or earlier.

[ Parent | Reply to this comment ]

Posted by Anonymous (80.101.xx.xx) on Fri 22 Jan 2010 at 12:47
perdition wouldn't be the first IMAP proxy, server or client to deliver a flawed IMAP implementation. We could yell at programmers to build better apps, but we really need a more simple protocol.

[ Parent | Reply to this comment ]