Designing a Robust Monitoring System

Reading Ted Dziuba’s article Monitoring Theory article, I was reminded of several conventions that I have developed over the years to help with monitoring servers, network devices, software services, batch processes, etc. First, break down your data points into levels so that you can decide how to route them. Second avoid interrupt driven technology like email, it lowers your productivity and prohibits good analysis techniques.

The Logs Are an Approximation of Reality

The logs are an approximation of reality and they cannot be taken as canonical or gospel. This is true in several senses. Logs can give insight to the standard investigative questions of who, what, when, where, and why, but almost always requires other information to truly answer all of these questions.

Today, Postfix reiterated this lesson for me. I had a problem where our gateway mail server couldn’t deliver mail to a peer. The receiving mail server kept bouncing the email address with a 550 even though the mailbox being delivered to was real and active. Gmail, Yahoo, and MSN would all accept email from our gateway, but this one provider would not accept email. Of course, it wasn’t a simple problem. We had a web server running Apache/PHP delivering to the local Sendmail server which forwarded to a Post fix gateway server, which then tried to deliver to an Exim server which received for the destination email address.

I am not going to dig into all of the details, but of course, the first thing I did was go to the logs. The problem is, the logs were wrong! In the following examples, the users and domains in the logs have been changed to be anonymous, but the logs are real.

Decade of Storage: Analysis of Data Costs

Yesterday, I noticed this interesting tidbit from Rackspace calculating the cost of data over the last Decade of Storage. Of course, there a few bumps in the road that made me chuckle. Interestingly, in the last couple of years it plots the cost from $0.40/GB to $0.06/GB. This ties together a whole bunch of things that I have thought about over the last couple of years. First, now is a wonderful time to be a user buying storage for personal audio and video. Second, regular people are going to have to start to learn data management strategies. Finally, this cost isn’t even close to what it is for me in my data center. It is easy for us to celebrate the cheap cost of raw storage while loosing track of the total cost of ownership for data. I will elaborate.

Passion for the Science of Computing

I recently read an article called “Computer. Science. Paradox?” by Ben Rockwood which pointed me to a phenomenal project called “Great Principles of Computing.” The project’s founding principle is that Computing, not Computers are the center of our study and that the Science of Computing is, indeed, a natural science. This project touches on so many issues in the teaching of Computer Science and how we index our knowledge. It also provides solutions to so many frustrations I felt while working my way through the undergraduate Computer Science curriculum at the University of Akron.

Systems Administrator’s Lab: OpenSSH MaxStartups

Background When performing automation using OpenSSH/Cron you will inevitably run into concurrency problems. Recently, we had a problem where one machine was receiving 21 ssh connection within one second. This is because the standard cron daemon only has a granularity of one minute. In this article, I am going to quickly elaborate on how we

Amazon EC2 & Rackspace Cloud Servers

Background Recently, I had the chance to work on a couple of projects that took me into the cloud. The first project had me setting up Eucalyptus on KVM. The second had me building out an infrastructure in Rackspace Cloud Servers. This has given me some hands on insight into the problems that are facing

OpenSSH and Keychain for Systems Administrators

This tutorial provides guidance on best practices and configuration of OpenSSH/Keychain, but also includes some important troubleshooting techniques for which documentation is somewhat lacking. These techniques took me several years to develop and I have tried to compile them here in one concise post so that others do not have to suffer through the arduous learning process