CentOS Post Mortem & Analysis

Background

I manage the crunchtools lab and the infrastructure for this blog similar to a development data center. I have a rigorous weekly checklist, which includes optionally applying operating system patches as they are available. I do not perform the updates every week because of time constraints, but when I do, I patch all of the systems. Most of my infrastructure is built on Red Hat Enterprise Linux, but I run this blog on Linode which doesn’t have an image for Red Hat Enterprise Linux. They have the ability to create a custom image, but I have continued to use the CentOS build, partially to better understand the differences from a hands on perspective.

Several weeks ago, I patched both the CentOS operating system, on which this blog runs, and the other Red Hat Enterprise Linux systems in the crunchtools environment. The latest available patches were applied which caused an Apache web server outage, but only on the CentOS system. Clients could not connect so, I sifted through the system logs, but didn’t see any AVC denials or messages that indicated the cause of the problem. After some experimentation, I realized that SELinux was blocking access to the web server. The SELinux booleans looked fine, so to return to service as quickly as possible, I temporarily disabled SELinux.

A couple of weeks later, I applied some new patches that were available. To some relief, after applying these new patches and testing SELinux, Apache was able to accept connections again.

Post Mortem

Environment Details

 

Timeline

On February 27th updates were applied, which caused the outage. Though perhaps not the cause, the SELinux policy was updated.

...Feb 27 08:37:49 Updated: selinux-policy-3.7.19-155.el6_3.14.noarch
Feb 27 08:54:45 Updated: selinux-policy-targeted-3.7.19-155.el6_3.14.noarch

 

At this point, I spent some time digging through the logs and did not see anything to indicate what was blocking the web server. I restarted the web server and eventually decided to reboot the system. When the system came back up, the web server still couldn’t be accessed. At this point, even tough there was no indication that SELinux was blocking access to the web server, I disabled it. As soon as SELinux was disabled, the web server was able to accept connections again.

Feb 27 08:54:10 lance kernel: __ratelimit: 22683 callbacks suppressed
Feb 27 08:54:12 lance nagios: SERVICE ALERT: lance;Basic Web Services;OK;HARD;4;HTTP OK: HTTP/1.1 200 OK - 211 bytes in 9.074 second response time
Feb 27 08:54:12 lance nagios: SERVICE NOTIFICATION: nagiosadmin;lance;Basic Web Services;OK;notify-service-by-email;HTTP OK: HTTP/1.1 200 OK - 211 bytes in 9.074 second response time
Feb 27 08:54:28 lance kernel: __ratelimit: 10011 callbacks suppressed
Feb 27 08:54:45 lance yum[4619]: Updated: selinux-policy-targeted-3.7.19-155.el6_3.14.noarch
Feb 27 08:54:55 lance yum[4619]: Updated: epel-release-6-8.noarch
Feb 27 08:55:04 lance kernel: __ratelimit: 14232 callbacks suppressed
Feb 27 08:55:42 lance nagios: SERVICE ALERT: learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;SOFT;1;HTTP CRITICAL - No data received from host
Feb 27 08:55:42 lance nagios: SERVICE ALERT: rt.fatherlinux.com;String Check: rt.fatherlinux.com;CRITICAL;SOFT;1;Connection refused
Feb 27 08:55:56 lance kernel: __ratelimit: 9972 callbacks suppressed
Feb 27 08:56:43 lance nagios: SERVICE ALERT: learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
Feb 27 08:56:52 lance nagios: SERVICE ALERT: rt.fatherlinux.com;String Check: rt.fatherlinux.com;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
Feb 27 08:56:52 lance kernel: __ratelimit: 4161 callbacks suppressed
Feb 27 08:57:04 lance kernel: __ratelimit: 46248 callbacks suppressed
Feb 27 08:57:19 lance kernel: __ratelimit: 8328 callbacks suppressed
Feb 27 08:57:32 lance nagios: SERVICE ALERT: learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;SOFT;3;Connection refused
Feb 27 08:57:42 lance nagios: SERVICE ALERT: rt.fatherlinux.com;String Check: rt.fatherlinux.com;CRITICAL;SOFT;3;Connection refused
Feb 27 08:58:26 lance kernel: __ratelimit: 3675 callbacks suppressed
Feb 27 08:58:32 lance nagios: SERVICE ALERT: learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;HARD;4;Connection refused
Feb 27 08:58:32 lance nagios: SERVICE NOTIFICATION: nagiosadmin;learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;notify-service-by-email;Connection refused
Feb 27 08:58:37 lance init: serial (hvc0) main process (1355) killed by TERM signal
Feb 27 08:58:38 lance nagios: Caught SIGTERM, shutting down...

 

A couple of weeks later, new patches were applied, which included a new SELinux policy. Once the new policy was installed and I enabled SELinux, the web server was able to accept connections while in enforcing mode

Mar 12 22:50:23 Updated: selinux-policy-3.7.19-195.el6_4.3.noarch
Mar 12 22:52:34 Updated: selinux-policy-targeted-3.7.19-195.el6_4.3.noarch...

 

Analysis

During this strange outage, I decided to analyze some of the core variables which correlated with this outage.

Updates

CentOS and RHEL have significant differences in the way they receive and consume updates. From the log snippets below, it can clearly be seen. The number of patches and the dates on which updates are received are different. Notice, CentOS receives 251 patches when it is updated to CentOS 6.4, while RHEL receives 674 patches.

The number of patches is a key difference. When RHEL 6.4 was released, all of the patches were released together. This is because they were all tested together and sent through a quality assurance (QA) cycle together. When CentOS 6.4 was released, a subset of the RHEL patches were released. This means that, on any given patch cycle, the permutation of patches applied to a CentOS operating system is different than RHEL.

A CentOS user does not inherit all of the testing and QA for a given RHEL release or patch set. Since the CentOS repository provides a different permutation of patches, the CentOS team may have to wait until users report a problem to start fixing issues, such as the web server outage described here. This kind of regression bug could have been caught with some kind of automated testing.

CentOS

cat /var/log/yum.log| grep "Feb 27" | wc -l
37

cat /var/log/yum.log| grep "Mar 12" | wc -l
251

cat /var/log/yum.log| grep centos-release
Mar 12 22:49:16 Updated: centos-release-6-4.el6.centos.10.x86_64

 

RHEL

cat /var/log/yum.log| grep "Feb 21" | wc -l
674

cat /var/log/yum.log| grep "Mar 18" | wc -l
24

cat /var/log/yum.log| grep redhat-release
Feb 21 10:30:18 Updated: redhat-release-server-6Server-6.4.0.4.el6.x86_64

 

Tests

Given that the kind of regression described above could have been caught and fixed with automated testing, I decided to analyze the testing process used for CentOS. As I analyzed the pieces of the CentOS test suite which cover Apache and SELinux, I discovered that one of the contributors, Athane Madjoudj, contributes to not only CentOS, but also the Fedora test framework. For this discussion, I will focus on the CentOS versions of the test framework. The test components described from the Fedora framework are only mentioned to provide guidance for possible future integration of SELinux coverage into other tests. Integrated SELinux coverage could have prevented the outage described here.

A quick analysis of the tests around Apache and SELinux show some shortcomings in the test coverage, but the infrastructure necessary to test for behavior described in this post mortem is fairly complicated to set up and maintain.

Since CentOS is a community supported operating system, the CentOS test suite focuses heavily on testing installation. CentOS does not have a large number of high level application testing or post installation testing such as verifying configuration options or kernel variable tuning.

In particular, the CentOS test suite contains httpd_php.sh which is a good place to start this analysis. This script does some basic testing, but has some limitations. Notice that the web server is only tested from localhost. This is less than optimal because sometimes user space utility bugs, kernel network, kernel firewall, or other configuration problems can cause a web server to be available to localhost, but prevent access from a remote machine.

Similarly, the SELinux test checks for AVC entries in the log and whether Enforcing mode is enabled. This is only a very high level test which does not provide much coverage.

In contrast, a couple of Fedora tests do have integrated testing of SELinux. For example, the Tuned test script does do some SELinux testing.

 

CentOS Tests

 

Fedora Tests

 

Bug Tracker

As a final step, I thought it would be prudent to check the Bug Trackers for CentOS and Red Hat Enterprise Linux for reported problems with SELinux. I searched for http, 155, and 195; I could not find a bug which correlated with SELinux behavior described in this post mortem.

For a point of reference, I went on to investigate the number of open SELinux bugs. Accessed March 26th, 2013 10:47PM EST I found that there were 301 open SELinux bugs open for Red Hat Enterprise Linux and 50 open for CentOS.

The difference in bugs could be interpreted in any number of ways. It may be indicative of more RHEL users testing and finding SELinux bugshttp://news.cnet.com/8301-13505_3-10312978-16.html, it might indicate that RHEL has more SELinux bugs, or it might indicate that CentOS patches SELinux bugs faster than RHEL, but that is doubtful.

  • Red Hat Enterprise Linux: 301 SELinux bugs
  • CentOS: 50 SELinux bugs

Red Hat Enterprise Linux Bugzilla

CentOS Bug Tracker

 

Conclusions

In this particular case, CentOS experienced an outage and I could not quickly determine why. After a quick analysis syslog messages and the audit log without finding anything, I attempted a reboot. Apache would still not respond to connections. At this point, I took a guess and just disabled SELinux, which succeeded in allowing Apache to accept connection. At this point, I saved /var/log/yum.log and /var/log/messages so that I could document the outage here.

Several weeks later, I applied available patches bringing the CentOS system to the latest available patches, and SELinux and Apache began working together again. During this time the RHEL servers in the crunchtools environment did not experience the same outage. This is in no way meant to imply that the inverse couldn’t have happened. I believe it is a is possible that a bug may cause an outage on RHEL but not affect CentOS, because RHEL receives updates first and on any given day, the two bit streams are not the same.

Additional Notes: Development and Production

While RHEL and other enterprise Linux rebuilds share upstream source code, the binary builds are not completely the same. Temporally, each distribution contains different packages and versions. That is to say, on any given day, hour and minute, a rebuild cannot be the same as RHEL. Although, RHEL and rebuilds such as CentOS share Major and Minor versions, for example 6.4, each contains different permutations of patch sets. This makes them different in critical ways; rebuild distributions do not inherit all of the benefits from the testing and quality assurance (QA) provided by Red Hat. This is also the reason that rebuilds not inherit the same hardware and software certifications from third party vendors such as HP or SAP.

Architects and engineers will sometimes investigate a mixed environment of a rebuild distribution in development, while using RHEL in production. Typically, this architecture is investigated to save on subscription costs in the development environment. This architecture contains several important caveats. Since rebuilds are downstream distributions from RHEL, they can contain bugs that do not exist in RHEL. Using a downstream distribution to build an upstream test environment creates some interesting challenges.

If a bug is found in the test environment, it could be specific to the rebuild. If it is specific to the rebuild, Red Hat support cannot be called to work on a fix. At this point, the operations team will have to work with the community to develop a patch, or even develop a patch themselves. Even if the bug can be reproduced on a RHEL system, the development environment can’t be patched until Red Hat develops and publishes a patch, and the downstream community rebuilds and distributes it. This could take months and there is no guarantee that the patch will ever be released. Over years, this workflow can create significant technical debt for engineers. Technical debt leads to increased cost.

Increased technical debt between testing and production environments can work both ways. This could lead to untested problems in production which were never found or tested in the development environment. It is better to have a homogeneous environment built from either all CentOS or all RHEL. Having a mix, sets the environment up to accrue, at least some, technical debt, which off sets the costs of running RHEL in development.

Specific to the crunchtools environment, the challenge with managing RHEL and a rebuild like CentOS has been determining if they are at the same revision at any given point in time. The way the RHEL repositories are exposed via channels vs. the mechanism of base, extras, and updates in CentOS make it difficult to have a core build and associated update repositories for management of both over time. It becomes tempting to apply RHEL updates from my Satellite server to a CentOS box, but that would clearly defeat the purpose for which most users deploy a rebuild.

As this post mortem attempts to demonstrates, there are differences between RHEL and any rebuild. Installing and supporting multiple distributions between development and production environments creates technical debt. Specific to the crunchtools environment, CentOS and RHEL have different build, test, and release mechanisms. Each distribution supports and manages individual Bug Trackers and completely different support mechanisms. Architects and business analysts should consider these variables when calculating ROI.

6 comments on “CentOS Post Mortem & Analysis

  1. I might be missing something, but in that level of detail, you completely forgot to mention what the problem really was.

    Can you share some details on that, ideally via a reproducer script.

    Finally, contributing to the CentOS test suite is a great way to ensure these sort of problems don’t occur in the future, look forward to seeing that merge request from you ๐Ÿ™‚

    – KB

    ( I contribute time towards the CentOS Project )

    1. I am going to modify the article to clarify a few things. I wish I could reproduce it, but regretfully, I couldn’t even really get to the bottom of what was happening. There was no information in /var/log/messages or the the audit log. When I initially started writing this article, my intention was not to criticize the way CentOS does testing, it was purely to demonstrate that they are different because an outage affected CentOS and not RHEL, even though they receive updates on the same time each week.

      For full disclosure, I have seen this happen several times over the years with RHEL and CentOS usage. In this particular case, I finally had a chance to document it in some detail. The fact that a problem would affect CentOS and not RHEL is really ancillary. What I really want to demonstrate is that a bug can affect one and not the other.

      I did it in a post mortem fashion because I wanted to keep most of it to technical analysis of how I returned to service. I will be promptly adding some more detail to clarify.

  2. You provided counts of updated packages on two systems, a CentOS system and a RHEL system. Are both systems running the same exact packages? If the RHEL system has significantly more packages, it will see significantly more updates when going to RHEL 6.4.

    I don’t agree with your assertion that CentOS 6.4 and RHEL 6.4 use different package versions. CentOS 6.4 was released on 08 March 2013. The selinux-policy package you installed was not part of the 6.4 release. (That particular package was released on 02 January.) The CentOS 6.4 release included the same version of the selinux-policy package as the RHEL 6.4 release, selinux-policy-3.7.19-195.el6_4.1.

    Also, did you check the audit logs when the webserver stopped working? Anything blocked by SELinux should be reported there. You may have better luck searching on the actual error than searching on the package version numbers.

    1. Chris, so this demonstrates my point. It is not as easy as you might think to get RHEL and a rebuild to have the exact same set of package names. In fact, it may be pretty close to impossible. Even for the packages names that overlap, matching versions on every given day, is also nearly impossible because RHEL and any rebuild distribution, by their very upstream/downstream relationship, release versions at different times.

      I spent hours trying to analyze this. I eventually developed a method to build a subset of packages I could find on both systems. I then used this method to get all available versions of the packages using “yum -q list available –showduplicates” The problem is, this doesn’t work on CentOS. The way the repositories are setup doesn’t allow you to show all revisions of a package. It also prevents “yum downgrade” from functioning. Again, this demonstrates my point, they are never the same in real life.

      CentOS 6 in particular has base, update, and extras repositories. RHEL 6 has a base repository that can contain any number of channels including, but not limited to base server channel, optional, supplementary, cluster, xfs, openshift, openstack, etc, etc.

      While it is theoretically possible to match package names and versions, I have not found any sane way to do this. This article was more a documentation of what I saw in a real life side by side environment. This was not a scientific experiment where I set up controlled CentOS and RHEL boxes that were engineered to have identical packages.

      As I stated initially, In real life with standard repositories and tools, this is quite difficult to achieve.

  3. i had a server outtage last year around 6.3 which was caused by yum update, partial or out of order, where selinux made syslog block, and any process that logged, hung รขโ‚ฌโ€œ so that includes sshd

    other servers where i did yum update of all packages, were fine

    the solution was to reboot and disable selinux via grub edit, and apply all yum updates before re-enabling selinux

Leave a Reply

Your email address will not be published. Required fields are marked *