Designing a Robust Monitoring System


Reading Ted Dziuba’s article Monitoring Theory article, I was reminded of several conventions that I have developed over the years to help with monitoring servers, network devices, software services, batch processes, etc. First, break down your data points into levels so that you can decide how to route them. Second avoid interrupt driven technology like email, it lowers your productivity and prohibits good analysis techniques.

I like Ted’s break down into three main categories: “Neither Informative nor Actionable”, “Informative, but not Actionable”, and Informative and Actionable. I have made a few modifications to Ted’s breakdown, but his gist is spot on.

I also break things down into three categories: Logged, Non-Critical Action, Critical Action. This allows me to quickly operationalize any data point that I can conceive. First, the data point should either go to a log or it should alert. Second, if it is indeed an alert, then it should be prioritized, non-critical (8-5) or critical (24×7).

Personally, I also like to have the non-critical alerts go to a dashboard instead of paging, you really have no choice but to page for critical alerts. There are things that need taken care of and there are things that can and should wait. If you make everything critical, you will burn your operations people out. Surprisingly, a lot of smart people get this wrong.



This is anything that I really don’t need to know about unless there is some other problem. This is how a web server transaction works, if there is a problem you receive an error, otherwise, just log it or graph it. Also, I suggest staying away from email for this role. You usually end up pestering the operations people to the point that they filter the email to a folder and never look at it. It is better to use a centralized logging system of some kind and then do proper log analysis on it. I use Petit to write reports for daily, weekly, and monthly analysis. The reports give an approximation of reality, a folder in your email gives you nothing.

There are two main types of data here, numbers and letters. If it is some kind of numeric data that you are capturing, use a graph. If the data you are capturing is made up of discreet entries, use a log, then graph the log if it makes sense. Again, don’t email yourself the graphs, have it in a central system, something like Cacti is good.

Examples of logged data points include successful processes of all kinds:

  • Data Import/Exports: In many environments there are data import/export jobs which are critical to applications. That is fine, but you don’t want to pollute your email with successful job output, instead put it into a log where it can be properly analyzed.
  • Granular Job Tracking: When you put job tracking data into a log instead of email, you can start to take advantage or very granular job tracking. Instead of logging success/failure for the whole job, you can start to track individual parts of a job and really get some granularity to your system. This helps track down partial failures like slowdowns, etc.
  • Load Average, CPU, Memory: Many people make the mistake of thinking these are good fault monitors. They are not, graph them instead. They will help tell you when a problem started after you receive a fault from some other check.
  • BGP Route View Checks: In our environment, we check the ATT IX BGP looking glass every four hours to get a list of routes back to us through our two upstream providers. The Internet is dynamic, so sometime the number of routes through each provider changes. That is is fine, but I might want to have it logged if I have a problem later.
  • Trace Routes: We log trace route between our data centers to have a historic view of what path things are taking. This can help when troubleshooting flaky VPN connections, but I don’t want to know about it unless the VPN get’s flaky.
  • Conifig File Generation: In many environments, configuration files are built and deployed. Many pieces of this process should be logged. It provides a paper trail for what has been deployed and may be used later if a bug is found in your build process. This will give you a starting point for repairing/rebuilding systems that were deployed during the time the bug was in the wild. It may also help track a start/stop time for when things occurred.
  • Backup Processes: Parts of the back up process, for example, MySQL dumps, SVN Dumps, special DVD copies can be orchestrated and logged from a central script. Your operations team won’t need to know this in an email or on a daily basis, but it is a blessing when trying to reverse engineer the system when there is a problem.

Now, you shouldn’t forget that there is granularity here. If a particular piece of a process, check, import, etc fails, you can escalate it to a non-critical or critical action. Sometimes it is even necessary to cause cascading failure. For example, when collecting data points for a geographically redundant web application, if MySQL replication fails, you may also want to stop synchronization of a docroot until an operations person can investigate what went wrong.


Non-Critical Action

I prefer piping this kind of data point to a dashboard instead of paging. I expect our operations people to look at the dash board first thing in the morning and periodically through out the day. These kinds of data points do not need tending in the middle of the night.

You can be a bit more liberal with non-critical alerts, especially if they are fed to a dash board, but don’t get carried away. You don’t want to pester your operations people so bad during the day that they can’t get project work done

Examples of non-critical actions include failed processes of all kinds

  • Software Vulnerabilities: These are published on public RSS feeds, but your operations team isn’t going to care about this at 2AM when they are sleeping. They can look at it during business hours. Even if it is a critical vulnerability, no one is going to look at this at 2AM. Many people get this wrong.
  • SLA in Danger: These kinds of checks can be very difficult to tune. Four second page load times may be OK at 4AM, but not at 11AM. Sometimes time based metrics are necessary, but this can require significant scripting or configuration in your monitoring system. I prefer embedding this in some kind of scripting framework because it gives you granular access by check. Often this is difficult to do in something like Nagios.
  • Tape Cleaning: Good to know, but don’t wake me up
  • Captured Command Output: When scripting use a utility to collect the non-zero output of important commands. If something goes wrong alert. This provides very granular reporting of failed portions of a script. I use something called scriptlog to pipe command return values and output to syslog. Then I have a Nagios check which captures failed commands and displays a warning in the dashboard. This allows the team to fix problems in the morning and also gives good place to start when fixing scripting problems.


Critical Action

Do not underestimate how difficult it is to determine these kinds of checks. It is really easy to put too many things in this category. If a switch, router, firewall, server, apache, or the application is in a fault state, someone must be alerted and the service must be restored but be careful when specifying an SLA for slow service. Make sure they are well defined and manageable. For example, many people will ping network or server gear once a minute and page when it is slow for 2 minutes. This is just impossible to manage, it will page you at 2AM every day. This is completely useless and when you have a real problem people will end up waiting for a recovery notice and won’t even start investigating until the problem seems real

  • Network Down: It is preferable to monitor fault state with a true/false check of some kind such as ping. Routers can slow down when updating BGP and does not necessarily indicate an actionable alert unless the slow state persists for long enough.
  • Server Down: Server down state can be even trickier to determine. Often a server will stop reporting for a while because of heaving processing. If there is enough deviation in your systems, you may even need different checks for different groups of servers.
  • Service Down: Services can be difficult to check. These checks can become very sophisticated, which makes them impractical for one minute intervals. In a blog add/delete post can be tested; in a shopping cart an entire transaction can be tested. These checks can really stretch the boundary between testing and fault monitoring
  • SLA Not Met: As mentioned above, be very careful when setting these, they can and will drive your operations team insane. Use them sparingly and look for more deterministic ways of finding faults. Also make sure it the SLA is not met over some number of checks over some number of minutes ore you WILL have false positives
  • Open Sockets, Pipes, Files: Monitoring open sockets pipes and files can be a very good indication that something is going to go wrong if you don’t look into it. If the threshold is high enough it can even indicate impending doom. In production, alerting on these numbers can give you five minutes to respond and fix before a melt down.

Do not limit your imagination when determining what can be checked, but be realistic when determining what it means if a check fails for one, five or ten minutes. When determining an SLA for sophisticated check, I suggest setting your targets long especially during off hours. These numbers may sound wild, but I suggest you set the bar low and tune up. If you desensitize your operations people, your return to service times will be worse than these numbers anyway.

Network Gear: 3-4 failed checks at one minute intervals
Servers: 7-8 failed checks at one minute intervals
Services: 7-8 failed checks at one minute intervals



Obviously, this system could be expanded upon to include more granular time windows or 24×7 network operation centers, but hopefully this gives you the gist of how to break things down. The goal is to provide your operations team with real actionable items when their time is the most expensive (off hours) while at the same time providing reasonable coverage of problems.


One comment on “Designing a Robust Monitoring System”

Leave a Reply

Your email address will not be published. Required fields are marked *