I manage the crunchtools lab and the infrastructure for this blog similar to a development data center. I have a rigorous weekly checklist, which includes optionally applying operating system patches as they are available. I do not perform the updates every week because of time constraints, but when I do, I patch all of the systems. Most of my infrastructure is built on Red Hat Enterprise Linux, but I run this blog on Linode which doesn’t have an image for Red Hat Enterprise Linux. They have the ability to create a custom image, but I have continued to use the CentOS build, partially to better understand the differences from a hands on perspective.
Several weeks ago, I patched both the CentOS operating system, on which this blog runs, and the other Red Hat Enterprise Linux systems in the crunchtools environment. The latest available patches were applied which caused an Apache web server outage, but only on the CentOS system. Clients could not connect so, I sifted through the system logs, but didn’t see any AVC denials or messages that indicated the cause of the problem. After some experimentation, I realized that SELinux was blocking access to the web server. The SELinux booleans looked fine, so to return to service as quickly as possible, I temporarily disabled SELinux.
A couple of weeks later, I applied some new patches that were available. To some relief, after applying these new patches and testing SELinux, Apache was able to accept connections again.
- Operating System: CentOS 6.4
- Installation Date: Tue 12 Jul 2011 11:24:06 AM EDT
- Provider: Linode
- Hardware: XEN Hypervisor
- Repositories: CentOS-6 – Base, CentOS-6 – Extras, CentOS-6 – Updates
- Optional Repositories: Extra Packages for Enterprise Linux 6 – x86_64
On February 27th updates were applied, which caused the outage. Though perhaps not the cause, the SELinux policy was updated.
...Feb 27 08:37:49 Updated: selinux-policy-3.7.19-155.el6_3.14.noarch
Feb 27 08:54:45 Updated: selinux-policy-targeted-3.7.19-155.el6_3.14.noarch
At this point, I spent some time digging through the logs and did not see anything to indicate what was blocking the web server. I restarted the web server and eventually decided to reboot the system. When the system came back up, the web server still couldn’t be accessed. At this point, even tough there was no indication that SELinux was blocking access to the web server, I disabled it. As soon as SELinux was disabled, the web server was able to accept connections again.
Feb 27 08:54:10 lance kernel: __ratelimit: 22683 callbacks suppressed
Feb 27 08:54:12 lance nagios: SERVICE ALERT: lance;Basic Web Services;OK;HARD;4;HTTP OK: HTTP/1.1 200 OK - 211 bytes in 9.074 second response time
Feb 27 08:54:12 lance nagios: SERVICE NOTIFICATION: nagiosadmin;lance;Basic Web Services;OK;notify-service-by-email;HTTP OK: HTTP/1.1 200 OK - 211 bytes in 9.074 second response time
Feb 27 08:54:28 lance kernel: __ratelimit: 10011 callbacks suppressed
Feb 27 08:54:45 lance yum: Updated: selinux-policy-targeted-3.7.19-155.el6_3.14.noarch
Feb 27 08:54:55 lance yum: Updated: epel-release-6-8.noarch
Feb 27 08:55:04 lance kernel: __ratelimit: 14232 callbacks suppressed
Feb 27 08:55:42 lance nagios: SERVICE ALERT: learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;SOFT;1;HTTP CRITICAL - No data received from host
Feb 27 08:55:42 lance nagios: SERVICE ALERT: rt.fatherlinux.com;String Check: rt.fatherlinux.com;CRITICAL;SOFT;1;Connection refused
Feb 27 08:55:56 lance kernel: __ratelimit: 9972 callbacks suppressed
Feb 27 08:56:43 lance nagios: SERVICE ALERT: learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
Feb 27 08:56:52 lance nagios: SERVICE ALERT: rt.fatherlinux.com;String Check: rt.fatherlinux.com;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
Feb 27 08:56:52 lance kernel: __ratelimit: 4161 callbacks suppressed
Feb 27 08:57:04 lance kernel: __ratelimit: 46248 callbacks suppressed
Feb 27 08:57:19 lance kernel: __ratelimit: 8328 callbacks suppressed
Feb 27 08:57:32 lance nagios: SERVICE ALERT: learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;SOFT;3;Connection refused
Feb 27 08:57:42 lance nagios: SERVICE ALERT: rt.fatherlinux.com;String Check: rt.fatherlinux.com;CRITICAL;SOFT;3;Connection refused
Feb 27 08:58:26 lance kernel: __ratelimit: 3675 callbacks suppressed
Feb 27 08:58:32 lance nagios: SERVICE ALERT: learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;HARD;4;Connection refused
Feb 27 08:58:32 lance nagios: SERVICE NOTIFICATION: nagiosadmin;learn.fatherlinux.com;String Check: learn.fatherlinux.com;CRITICAL;notify-service-by-email;Connection refused
Feb 27 08:58:37 lance init: serial (hvc0) main process (1355) killed by TERM signal
Feb 27 08:58:38 lance nagios: Caught SIGTERM, shutting down...
A couple of weeks later, new patches were applied, which included a new SELinux policy. Once the new policy was installed and I enabled SELinux, the web server was able to accept connections while in enforcing mode
Mar 12 22:50:23 Updated: selinux-policy-3.7.19-195.el6_4.3.noarch
Mar 12 22:52:34 Updated: selinux-policy-targeted-3.7.19-195.el6_4.3.noarch...
During this strange outage, I decided to analyze some of the core variables which correlated with this outage.
CentOS and RHEL have significant differences in the way they receive and consume updates. From the log snippets below, it can clearly be seen. The number of patches and the dates on which updates are received are different. Notice, CentOS receives 251 patches when it is updated to CentOS 6.4, while RHEL receives 674 patches.
The number of patches is a key difference. When RHEL 6.4 was released, all of the patches were released together. This is because they were all tested together and sent through a quality assurance (QA) cycle together. When CentOS 6.4 was released, a subset of the RHEL patches were released. This means that, on any given patch cycle, the permutation of patches applied to a CentOS operating system is different than RHEL.
A CentOS user does not inherit all of the testing and QA for a given RHEL release or patch set. Since the CentOS repository provides a different permutation of patches, the CentOS team may have to wait until users report a problem to start fixing issues, such as the web server outage described here. This kind of regression bug could have been caught with some kind of automated testing.
cat /var/log/yum.log| grep "Feb 27" | wc -l
cat /var/log/yum.log| grep "Mar 12" | wc -l
cat /var/log/yum.log| grep centos-release
Mar 12 22:49:16 Updated: centos-release-6-4.el6.centos.10.x86_64
cat /var/log/yum.log| grep "Feb 21" | wc -l
cat /var/log/yum.log| grep "Mar 18" | wc -l
cat /var/log/yum.log| grep redhat-release
Feb 21 10:30:18 Updated: redhat-release-server-6Server-188.8.131.52.el6.x86_64
Given that the kind of regression described above could have been caught and fixed with automated testing, I decided to analyze the testing process used for CentOS. As I analyzed the pieces of the CentOS test suite which cover Apache and SELinux, I discovered that one of the contributors, Athane Madjoudj, contributes to not only CentOS, but also the Fedora test framework. For this discussion, I will focus on the CentOS versions of the test framework. The test components described from the Fedora framework are only mentioned to provide guidance for possible future integration of SELinux coverage into other tests. Integrated SELinux coverage could have prevented the outage described here.
A quick analysis of the tests around Apache and SELinux show some shortcomings in the test coverage, but the infrastructure necessary to test for behavior described in this post mortem is fairly complicated to set up and maintain.
Since CentOS is a community supported operating system, the CentOS test suite focuses heavily on testing installation. CentOS does not have a large number of high level application testing or post installation testing such as verifying configuration options or kernel variable tuning.
In particular, the CentOS test suite contains httpd_php.sh which is a good place to start this analysis. This script does some basic testing, but has some limitations. Notice that the web server is only tested from localhost. This is less than optimal because sometimes user space utility bugs, kernel network, kernel firewall, or other configuration problems can cause a web server to be available to localhost, but prevent access from a remote machine.
Similarly, the SELinux test checks for AVC entries in the log and whether Enforcing mode is enabled. This is only a very high level test which does not provide much coverage.
In contrast, a couple of Fedora tests do have integrated testing of SELinux. For example, the Tuned test script does do some SELinux testing.
As a final step, I thought it would be prudent to check the Bug Trackers for CentOS and Red Hat Enterprise Linux for reported problems with SELinux. I searched for http, 155, and 195; I could not find a bug which correlated with SELinux behavior described in this post mortem.
For a point of reference, I went on to investigate the number of open SELinux bugs. Accessed March 26th, 2013 10:47PM EST I found that there were 301 open SELinux bugs open for Red Hat Enterprise Linux and 50 open for CentOS.
The difference in bugs could be interpreted in any number of ways. It may be indicative of more RHEL users testing and finding SELinux bugshttp://news.cnet.com/8301-13505_3-10312978-16.html, it might indicate that RHEL has more SELinux bugs, or it might indicate that CentOS patches SELinux bugs faster than RHEL, but that is doubtful.
- Red Hat Enterprise Linux: 301 SELinux bugs
- CentOS: 50 SELinux bugs
In this particular case, CentOS experienced an outage and I could not quickly determine why. After a quick analysis syslog messages and the audit log without finding anything, I attempted a reboot. Apache would still not respond to connections. At this point, I took a guess and just disabled SELinux, which succeeded in allowing Apache to accept connection. At this point, I saved /var/log/yum.log and /var/log/messages so that I could document the outage here.
Several weeks later, I applied available patches bringing the CentOS system to the latest available patches, and SELinux and Apache began working together again. During this time the RHEL servers in the crunchtools environment did not experience the same outage. This is in no way meant to imply that the inverse couldn’t have happened. I believe it is a is possible that a bug may cause an outage on RHEL but not affect CentOS, because RHEL receives updates first and on any given day, the two bit streams are not the same.
Additional Notes: Development and Production
While RHEL and other enterprise Linux rebuilds share upstream source code, the binary builds are not completely the same. Temporally, each distribution contains different packages and versions. That is to say, on any given day, hour and minute, a rebuild cannot be the same as RHEL. Although, RHEL and rebuilds such as CentOS share Major and Minor versions, for example 6.4, each contains different permutations of patch sets. This makes them different in critical ways; rebuild distributions do not inherit all of the benefits from the testing and quality assurance (QA) provided by Red Hat. This is also the reason that rebuilds not inherit the same hardware and software certifications from third party vendors such as HP or SAP.
Architects and engineers will sometimes investigate a mixed environment of a rebuild distribution in development, while using RHEL in production. Typically, this architecture is investigated to save on subscription costs in the development environment. This architecture contains several important caveats. Since rebuilds are downstream distributions from RHEL, they can contain bugs that do not exist in RHEL. Using a downstream distribution to build an upstream test environment creates some interesting challenges.
If a bug is found in the test environment, it could be specific to the rebuild. If it is specific to the rebuild, Red Hat support cannot be called to work on a fix. At this point, the operations team will have to work with the community to develop a patch, or even develop a patch themselves. Even if the bug can be reproduced on a RHEL system, the development environment can’t be patched until Red Hat develops and publishes a patch, and the downstream community rebuilds and distributes it. This could take months and there is no guarantee that the patch will ever be released. Over years, this workflow can create significant technical debt for engineers. Technical debt leads to increased cost.
Increased technical debt between testing and production environments can work both ways. This could lead to untested problems in production which were never found or tested in the development environment. It is better to have a homogeneous environment built from either all CentOS or all RHEL. Having a mix, sets the environment up to accrue, at least some, technical debt, which off sets the costs of running RHEL in development.
Specific to the crunchtools environment, the challenge with managing RHEL and a rebuild like CentOS has been determining if they are at the same revision at any given point in time. The way the RHEL repositories are exposed via channels vs. the mechanism of base, extras, and updates in CentOS make it difficult to have a core build and associated update repositories for management of both over time. It becomes tempting to apply RHEL updates from my Satellite server to a CentOS box, but that would clearly defeat the purpose for which most users deploy a rebuild.
As this post mortem attempts to demonstrates, there are differences between RHEL and any rebuild. Installing and supporting multiple distributions between development and production environments creates technical debt. Specific to the crunchtools environment, CentOS and RHEL have different build, test, and release mechanisms. Each distribution supports and manages individual Bug Trackers and completely different support mechanisms. Architects and business analysts should consider these variables when calculating ROI.