Linux Container Crisis Tools

Linux Container Crisis Tools

I want to highlight a great post by Brendan Gregg: Linux Crisis Tools. He does a walk through of a scenario that strikes me as very realistic, and brings back memories from my 15 years managing Linux servers. You can tell he has real-world sysadmin experience. I think that’s key for being a good thought leader in any specialized field.

Basically, he’s saying, look at a real world example of how, if you don’t have debugging tools already installed, you might not even be able to analyze what the root cause of the problem is. You’ll likely also lengthen the outage, as you try to find workarounds to the get the root cause. Worse, you’re going to cause yourself a bunch of emotional stress in the process of getting nowhere. And all for what? To save a little disk space? Because of common fallacies in understanding of attack surface? 

I love smaller attack surface which requires thinking about the entire software perimeter, not just individual container images. That’s why I’ve argued for Standard Operating Environments, even with containers: Containers need standard operating environments too! I still think smaller images are useful in certain use cases, that’s why I led the release of UBI Micro, but I think it’s easy to get overly obsessed with smaller container images, clinging to false Gods like Distroless.

While Brendan’s example highlights a problem troubleshooting disk problems, which you wouldn’t normally do from within a container image (this would typically be done from the container host), this failure pattern is still common when deploying and operating containers. If you need to rebuild a new container image with the troubleshooting tools you need and then redeploy a new running container, then hope you can reproduce the issue you found, there’s a good chance never figure out the root cause. Transient problems like this can nag administrators for months. This gets even worse at hyper scale with thousands of containers running in Kubernetes where partial and intermittent failure states are more likely.

Either way, check out his original post! I think it highlights an often overlooked problem.


Leave a Reply

Your email address will not be published. Required fields are marked *