Designing a Robust Monitoring System

Reading Ted Dziuba’s article Monitoring Theory article, I was reminded of several conventions that I have developed over the years to help with monitoring servers, network devices, software services, batch processes, etc. First, break down your data points into levels so that you can decide how to route them. Second avoid interrupt driven technology like email, it lowers your productivity and prohibits good analysis techniques.

The Logs Are an Approximation of Reality

The logs are an approximation of reality and they cannot be taken as canonical or gospel. This is true in several senses. Logs can give insight to the standard investigative questions of who, what, when, where, and why, but almost always requires other information to truly answer all of these questions.

Today, Postfix reiterated this lesson for me. I had a problem where our gateway mail server couldn’t deliver mail to a peer. The receiving mail server kept bouncing the email address with a 550 even though the mailbox being delivered to was real and active. Gmail, Yahoo, and MSN would all accept email from our gateway, but this one provider would not accept email. Of course, it wasn’t a simple problem. We had a web server running Apache/PHP delivering to the local Sendmail server which forwarded to a Post fix gateway server, which then tried to deliver to an Exim server which received for the destination email address.

I am not going to dig into all of the details, but of course, the first thing I did was go to the logs. The problem is, the logs were wrong! In the following examples, the users and domains in the logs have been changed to be anonymous, but the logs are real.