Background
When performing automation using OpenSSH/Cron you will inevitably run into concurrency problems. Recently, we had a problem where one machine was receiving 21 ssh connection within one second. This is because the standard cron daemon only has a granularity of one minute. In this article, I am going to quickly elaborate on how we spotted the problem, how we setup a verification test rig in our Lab and how we solved it. Quickly!</>
The Find
For several years we noticed intermittent ssh connection failures, but they were never enough to send us down the path to really tracking it down. Honestly, it was attributed to glitches and assumed it was just the cost of doing business. It turns out, it was too many ssh connections loading up at once.
The specific log entry that was the last straw was a connection error noticed on a box which receives and imports data. The following was received in an email from a failed cron job.
ssh_exchange_identification: Connection closed by remote host
This led to some basic log analysis with Petit on the remote host. Notice the Postponed publickey for. Also, notice that there are 21 connections and only 20 Postponed entries. This is because the first connection made it through with no problem, while the other 20 connections were delayed slightly.
zcat /var/log/secure.1.gz | grep "Dec 23 01:10" | petit --hash
21: sshd[#]: Accepted publickey for #
21: sshd[#]: pam_unix(sshd:session): session closed for #
21: sshd[#]: pam_unix(sshd:session): session opened for #
20: sshd[#]: Postponed publickey for #
Finally, notice that the entries all begin in the first few seconds of the minute. The minute before did not have any ssh connections which helps verify that the odd number is not coincidental.
zcat /var/log/secure.1.gz | grep "Dec 23 01:10" | petit --sgraph
2
0
### ## ###
### ## ###
###########
# ###########
#############
############################################################
02 32 01
Start Time: 2010-12-23 01:10:02 Minimum Value: 0
End Time: 2010-12-23 01:11:01 Maximum Value: 10
Duration: 60 seconds
The Fix
OpenSSH has a many security protections in place, one of them being the MaxStartups directive. The purpose of this directive is to prevent too many unauthenticated connections. This helps fight brute force attacks and also prevents a hacker from denying a legitimate user access to the server. The sshd_config man page describes the directive as:
Specifies the maximum number of concurrent unauthenticated connections to the sshd daemon. Additional connections will be dropped until authentication succeeds…
Cutting to the chase, bumping the setting up to 25 from the default of 10 solved the problem. The verification process is elaborated upon in the next section.
Change:
#MaxStartups 10
To:
MaxStartups 25
Experiment
The following shell code was ran against an OpenSSH test server to verify the problem state. This produces 100 connections in milliseconds, effectively concurrent. As would be expected, it instantly produced the disconnection problem along with all of the associated log entries.
for i in {1..100}; do (ssh [email protected] uname &) 2>&1; done | grep ssh_exchange| wc -l
Conclusion
Sometimes it is very difficult, especially with intermittent problems, to verify that a configuration change actually solved a problem. Often, the problem doesn’t surface again because the specific set of criteria that causes the problem is so specific. This can be very frustrating and very time consuming. Luckily, OpenSSH had a directive to help in this scenario.
In this particular case, the evidence and test rigging was delightfully perfect.
- Disconnection problem at: Dec 23 01:10
- 21 Correlated connection at: Dec 23 01:10
- Perfect isolation, no ssh connections in minute before or after
- Excellent lab tools: Petit, Bash, Grep, Admin Server, Test Server
- Experiment which produced the same problem and logging
The issue was tracked down, patched, and verified quicker than the problem could be documented. Having a good lab environment in place is critical for this kind of work. It has been gone ever since the change and we have confidence that we nailed it!
I really like the website. I also have a website and I am searching for a nice wordpress theme.
If you suspect that this might be affecting you, you can set the sshd LogLevel to DEBUG. Doing so will show you syslog messages along the lines of, “dropped connection #10” for each occurrence of a connection dropped because of the MaxStartups check.
Check your syslog.conf to find where syslog has been configured to log your ssh messages.
Both LogLevel and MaxStartups can be found in the sshd_config file usually found in /etc/ssh.
What can be max/largest value can be possible to set for variable MaxStartups in sshd_config..?
YES!!! Thank you! This blog entry helped me track down an “intermittent ssh connection problem” that we were experiencing. To complicate matters, the main consumer of ssh was Mercurial, so it was initially reported as an hg-problem… anyway, this has been very frustrating for our developers.
Happy now. 🙂
I know the feeling, glad the article helped somebody else!