Navigation Menu

Background

When performing automation using OpenSSH/Cron you will inevitably run into concurrency problems. Recently, we had a problem where one machine was receiving 21 ssh connection within one second. This is because the standard cron daemon only has a granularity of one minute. In this article, I am going to quickly elaborate on how we spotted the problem, how we setup a verification test rig in our Lab and how we solved it. Quickly!

 

The Find

For several years we noticed intermittent ssh connection failures, but they were never enough to send us down the path to really tracking it down. Honestly, it was attributed to glitches and assumed it was just the cost of doing business. It turns out, it was too many ssh connections loading up at once.

The specific log entry that was the last straw was a connection error noticed on a box which receives and imports data. The following was received in an email from a failed cron job.

 

This led to some basic log analysis with Petit on the remote host. Notice the Postponed publickey for. Also, notice that there are 21 connections and only 20 Postponed entries. This is because the first connection made it through with no problem, while the other 20 connections were delayed slightly.

 

Finally, notice that the entries all begin in the first few seconds of the minute. The minute before did not have any ssh connections which helps verify that the odd number is not coincidental.

 

The Fix

OpenSSH has a many security protections in place, one of them being the MaxStartups directive. The purpose of this directive is to prevent too many unauthenticated connections. This helps fight brute force attacks and also prevents a hacker from denying a legitimate user access to the server. The sshd_config man page describes the directive as:

Specifies the maximum number of concurrent unauthenticated connections to the sshd daemon. Additional connections will be dropped until authentication succeeds…

Cutting to the chase, bumping the setting up to 25 from the default of 10 solved the problem. The verification process is elaborated upon in the next section.

Change:

 

To:

 

Experiment

The following shell code was ran against an OpenSSH test server to verify the problem state. This produces 100 connections in milliseconds, effectively concurrent. As would be expected, it instantly produced the disconnection problem along with all of the associated log entries.

 

Conclusion

Sometimes it is very difficult, especially with intermittent problems, to verify that a configuration change actually solved a problem. Often, the problem doesn’t surface again because the specific set of criteria that causes the problem is so specific. This can be very frustrating and very time consuming. Luckily, OpenSSH had a directive to help in this scenario.

In this particular case, the evidence and test rigging was delightfully perfect.

  • Disconnection problem at: Dec 23 01:10
  • 21 Correlated connection at: Dec 23 01:10
  • Perfect isolation, no ssh connections in minute before or after
  • Excellent lab tools: Petit, Bash, Grep, Admin Server, Test Server
  • Experiment which produced the same problem and logging

The issue was tracked down, patched, and verified quicker than the problem could be documented. Having a good lab environment in place is critical for this kind of work. It has been gone ever since the change and we have confidence that we nailed it!

    5 Comments

  1. I really like the website. I also have a website and I am searching for a nice wordpress theme.

  2. If you suspect that this might be affecting you, you can set the sshd LogLevel to DEBUG. Doing so will show you syslog messages along the lines of, “dropped connection #10” for each occurrence of a connection dropped because of the MaxStartups check.

    Check your syslog.conf to find where syslog has been configured to log your ssh messages.

    Both LogLevel and MaxStartups can be found in the sshd_config file usually found in /etc/ssh.

  3. What can be max/largest value can be possible to set for variable MaxStartups in sshd_config..?

  4. YES!!! Thank you! This blog entry helped me track down an “intermittent ssh connection problem” that we were experiencing. To complicate matters, the main consumer of ssh was Mercurial, so it was initially reported as an hg-problem… anyway, this has been very frustrating for our developers.

    Happy now. 🙂

    • I know the feeling, glad the article helped somebody else!

Trackbacks/Pingbacks

  1. OpenSSH and Keychain for Systems Administrators | Crunch Tools - [...] Sometimes when controlling jobs from cron, several or even hundreds will execute at the exact same second. This can…

Post a Reply

Your email address will not be published.