We were having an issue where some TLS connections were failing with "SSL_accept error from". There were a couple domains but all microsoft was one of the larger legitimate ones we were having a problem with.
quick answer -> The Answer
SSL_accept error from smtp.microsoft.com[22.214.171.124]: -1
The problem only started occurring after an upgrade from debian etch to debian Lenny. One server had not been upgraded yet and could successfully handle all mails that the upgrdae servers were getting the "SSL_accept error from". This meant something had changed during the upgrade process that was causing this error. To give you an idea of the scope of the issue we get about 7500 TLS e-mails per day and 7000 were working fine. Only about 500 were failing on the upgraded mail servers and then working on the older etch server.
Here are more details on the different systems. The new servers were running debian Lenny with postfix 2.5.5-1.1 and openssl 0.9.8g-15+Lenny6. The system that hadn't been upgraded and what all the other were running before the upgrade is running debian etch with 2.3.8-2+etch1 0.9.8c-4etch9.
The first thing I did was to check that my certs were good even though 7000 messages a day were working fine I wanted to double check. Using openssl and the directions at:
I confirmed the certs and the fact that ssl was working.
One thing to note while testing TLS from openssl with the following command:
openssl s_client -starttls smtp -crlf -connect mail2.xxx.com:25 -CApath ~/.cert/mail2.xxx.com -cipher RC4-MD5
I found out about a bug running the s_client within openssl. It is not a perfect client and has some limitations. I have always used all caps when doing trouble shooting in a SMTP conversation. This turned out to be a problem with openssl s_client. Specificly when doing RCPT TO: it kept RENEGOTIATING. You must use "rcpt to:" and NOT "RCPT TO:", the first note of this I could find was at http://archives.neohapsis.com/archives/postfix/2007-01/1334.html
. Well it is still a problem.
Why did I force the RC4-MD5 cipher? I had noticed most of the failures were using this cipher, just turned out to be a coincidence.
Ok back to what else I did to try and trouble shot this issue. I made the mistake of trying to turn up TLS logging passed level 2 which wasn't showing me anything. I had done this on a spamtrap server and the logs directory filled in under 2 hours. So this was not a good idea. I forced a log rotate and looked to peer logging. I turned on peer logging for several of the domains that were having the issue. The main.cf entries for postfix are below:
debug_peer_list = microsoft.com, XXX.com, XXX.com
debug_peer_level = 3
After I had turned logging way up for just those domains I did not really see anything on the servers that were having the problems. It still looked like they were just connecting and then dropping off.
At this point I decided to try a brute force approach and upgraded the TLS and postfix packages to squeeze on one of the spamtrap servers. However in order to test it I moved around the MX weights. When I came in the next day it took about 2 minutes to notice it was still having a problem. On the central log server I was monitoring the mail log for anything with TLS or SSL in the line. This gave a good picture of the problem. I quickly saw someone connect to the spamtrap(now with a normal MX record) server get dropped and then switch to the secondary server running etch and have the transaction run with no problems.
tail -f /var/log/mail.log | egrep 'TLS|SSL'
primary1 postfix/smtpd: setting up TLS connection from mailserver.xxx.com[xx.217.202.16]
primary1 postfix/smtpd: SSL_accept error from mailserver.xxx.com[xx.217.202.16]: -1
primary1 postfix/smtpd: lost connection after STARTTLS from mailserver.xxx.com[xx.217.202.16]
primary2 postfix/smtpd: setting up TLS connection from mailserver.xxx.com[xx.217.202.16]
primary2 postfix/smtpd: SSL_accept error from mailserver.xxx.com[xx.217.202.16]: -1
primary2 postfix/smtpd: lost connection after STARTTLS from mailserver.xxx.com[xx.217.202.16]
secondary setting up TLS connection from mailserver.xxx.com[xx.217.202.16]
secondary postfix/smtpd: TLS connection established from mailserver.xxx.com[xx.217.202.16]: TLSv1 with cipher RC4-MD5 (128/128 bits)
Ok back to the drawing board. After some more searching http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=573748
didn't exactly match but I figured I would try compiling from source and include the extra library call. I was running out of ideas and I was getting desperate. That failed as quickly as the idea of upgrading. At this point I looked at how hard it would be to downgrade postfix and openssl. Well that didn't look fun and didn't provide the answer which was almost as important as fixing the issue.
During this time with the increased logging I was also using ssldump to try and get more information regarding the problem. Both on the working server and the servers that were having a problem.
ssldump -i eth0 -AadnkxX -k /etc/postfix/tls/key.pem
ssldump -i eth0 -AadnkxX -k /etc/postfix/tls/key.pem > /tmp/ssldump.txt
Again the connections dropped off without any additional information in the logs on the primary servers running lenny and the secondary server reported nothing special.
I didn't include that I had gone to IRC #postfix and #openssl with little luck. One person had suggested it could be a library missmatch. When I first got this answer there was only one domain having the problem, even if it was a very large 3rd party e-mail handler. However as I continued to monitor the problem and noticed more legitimate e-mail servers having the same problem that answer was no longer sufficient. I keep using microsoft as an example as they are very large, legitimate and posting about them having a problem won't hurt anyones feelings.
After revisiting IRC on freenode I failed to get any additional information regarding my problem. After not getting anywhere on freenode I thought I would ask my local friends on irc-debian.org. I asked "anyone want to help trouble shoot postfix openssl tls issue? Its starting to get to me :)"
Lucky for me dkg volunteered to help look into the situation and we discussed some information back and forth. dkg had another idea "weight 20(secondary etch server) has a smaller handshake than the other ones." The secondary server running etch has a 17KB handshake and the others have a 21KB handshake. I go ahead and do a dpkg-reconfigure ca-certificates to reduce the size of the handshake. I figure the etch server is working so I remove any of the ca roots that are not on the etch server.
Low and behold I tell dkg I owe him a beer as this fixed the problem.
Some interesting notes
The root Authorities removed did not include Microsoft's root Authority and the TLS connections with them were Trusted once the issue was resolved. An other interesting problem that also got resolved was that when I started writing the ssldumps to files I noticed that I was getting handshake errors about the length of the handshake being to short. After getting the root certificates file down below 20k, ~17k that error went away as well.
I wrote all of this in the hopes that I could save someone else 2 weeks of losing their wits and having to find a very obscure solution to a difficult problem.
Please feel free to leave any comments, suggestions, questions or any corrections :)