Navigation Menu
Centralized Logging System, Analysis, and Troubleshooting

Centralized Logging System, Analysis, and Troubleshooting

By on Jun 22, 2010 in Article | 1 comment

 

Background

Building a feature complete centralized logging system that provided the ability to trouble shoot problems and pro-actively find new issues before they became service outages was a top priority when I first started at (www.eyemg.com). I call it feature complete because it has successfully done both for us without spending too much time of false positives.

In this post I will review the design decisions that went into creating our, more or less, feature complete centralized logging system. Generally, the words “log analysis” and “logging system” are thrown around as buzz words in an attempt to sell a product, but this post is a genuine attempt to identify best practices that any systems administrator should be aware of. Another goal here, is to identify systems and feature other administrators have in place to achieve a feature complete logging system, so comments are welcomed and encouraged.

 

Goals & Workflows

For us, feature complete meant that we could could leverage our centralized logging system to assist in the completion of two main workflows.

  • Daily & Monthly Log Analysis: Provides proactive searching for issues which might not be identified by our service monitoring system (Nagios)
  • Point Problem Research: Used when a service outage or issue has already been identified through another mechanism, usually Nagios or a human being that noticed the problem.

 

Assumptions & Philosophy

The creation of this system was based on some critical conclusions about entropy, reality, and the universe in general.

  • Approximately Right: Is better than absolutely wrong. No analysis is perfect, by it’s very nature it is qualitative and imperfect, but none the less, genuine value can be gained from it
  • The 80/20 Rule: Log analysis has the classic economic problem of unlimited want and limited resource. We must find a level of investment which we are comfortable with. In general, %80 or of our goals can be achieved with %20 of the work, so don’t invest too much
  • Approximation of an Approximation of Reality: Logs by their very nature are an approximation of what the programmer thinks is happening in his program. It is important to remember this because some of the log analysis techniques used here will give an approximation of what is happening in the logs

 

Policy

Policy will have a profound effect on what must be implemented in a feature complete logging system. The following is a non-exhaustive list of laws and industry standards that might affect your retention policy. One of the goals of this article is to fill in some best practices even when the following do not apply. This guide is meant to give common sense best practices.

For more information on how to create a log review policy, especially in a PCI environment, Anton Chuvakin has excellent documentation including books, blog entries and a website.

 

Technology

This is just a quick overview of the different pieces of software we use in an all Red Hat Linux environment. This could also be used for Windows systems if their event logs are shipped to syslog, which can be done.

  • syslog-ng: Used on central receiving server
  • syslog: Default install on all Redhat Linux systems
  • petit: log analysis tool used to create reports and do spot analysis
  • Log Formats: apache access, apache error, linux syslog, secure logs

 

Infrastructure

Syslog-NG

On our syslog collector we installed syslog-ng, this fits the qualitative to quantitative work flow because it is not necessary to install it on every machine, only the centralized collector.

First download it

 

Compile & Install

Eventlog is a requirement for syslog-ng. Installed in /usr/local/eventlog

 

Syslog-ng

 

Init Scripts

Modify redhat’s basic syslog script and save it

 

Then find/replace all instances of syslog with syslog-ng, this should get you close to having an init script that works

 

Log Rotation

Remember to modify log rotate to include new logs added and also to restart syslog-ng instead of syslogd

 

Clean Up & Clarification

Make sure people are not starting or trying to modify syslogd

 

Configuration

Since our Syslog-NG daemon captures all of our servers, routers, switches, and firewalls logs, the configuration file is somewhat long & verbose. There are four main types of entries understand. Sources, Destinations, Filters and Logs must be specified to have a working Syslog-NG daemon.

Sources allow syslog-ng to receive entries from the local log file and the network. Destination entries connect a label to a physical file, while filters connect a matching pattern (regex) to a label. The Log entries connect it all together. It is a little confusing at first, but allows for a lot of flexibility. Use the example configuration file as a guideline.

Notice that the “filter default” section has many entries that look similar to the following. These are logically equivalent to “grep -v”, they remove the pattern that is matched. If you don’t want to duplicate data, it is necessary to do this for entries which are split off into separate files.

 

For example, if you want postfix entries to go to /var/log/maillog and you don’t want them to be duplicated in /var/log/messages, you need to apply a filter similar to the one above. I won’t dig into the details of each entry in our configuration file, but complex rule sets can be constructed for your needs. Separating logs by type is critical for more advanced log analysis, it could be done later with grep, but it is convenient to have the work done for you with syslog-ng. Again, since it is central, it is a finite amount of work to setup and does not require linear scaling, e.g. installed on every server of every kind of operating system.

 

Example Configuration File

 

 

 

Analysis

Daily Log Analysis

This is an example of how we do our daily log analysis. This report is received in an email and takes 3-5 minutes to read every morning. Clearly, the goal is not perfection, but to catch gross anomalies in the system before they cause major problems. In other systems, this system has scaled to 1500 Linux server with similar success. The number of unique entries does not grow linearly with the number of servers so it is a useful tool even if you have thousands of servers. Monthly log analysis has similar scaling properties.

Code

Output

Analysis

Individual entries which are found to be interesting can be inspected like this:

 

Monthly Log Analysis

While this code is not as elegant as I would like, I thought it was important to show a work in progress. This code gets the major pieces done, but it could be written in a way that made it easier to expand/add new log files. On the flip side, this script has been used for over a year in production and only required very minimal changes. It is a testament to the fact that infrastructure does not change quite as often as one might think.

Some things to note; First, there is a check at the beginning of this script to ensure that it is running on our man analysis box. This prevents it from running on the wrong machine. Second, there are notes to the analyst about each log file, which is output to the screen during run time. This gives the analyst notes on how much time to spend on analyzing each log file. It also gives the analyst notes on important aspects of the analysis. It is a simple solution, but quite elegant.

Finally, when an interesting line is found, it can be analyzed with the script it’s self by passing a match string and a log file to analyze. The script is smart enough to get historic data and grep through everything. SEE “Example Analysis” below:

Code

 

Output

The analysis script displays a report similar to the following output for each set of log files analyzed.

 

Analysis

Indiividual entries can be inspected with the following command:

    1 Comment

  1. Magnificent site. A lot of helpful information here.
    I am sending it to some friends ans also sharing in delicious.
    And naturally, thanks in your effort!

Trackbacks/Pingbacks

  1. How did you implement log management on your servers? - Just just easy answers - […] Link: http://crunchtools.com/centralizing-log-files/ […]
  2. Fixed: How did you implement log management on your servers? #answer #development #fix | SevenNet - […] Link: http://crunchtools.com/centralizing-log-files/ […]

Post a Reply

Your email address will not be published.