What is sVirt and How Does it Isolate Linux Containers?

What is sVirt and How Does it Isolate Linux Containers?

Background

What is sVirt and, why does it matter for your containers? The short answer is, because sVirt is another layer of security and defense in depth is a good approach to security. The longer answer is, sVirt dynamically generates an SELinux label for every single one of your containers, which makes them less likely to be able to break into each other, break into the host, or for the host to break into their data. It’s like a Mexican standoff with containers where everybody has a gun pointed at everybody – I like it.

But, how does it actually work? I mean, technically?

If you don’t know, don’t feel bad. I didn’t either. I learned what it was a long time ago, and like most technical people, stopped there because I had the warm and fuzzy feeling. Basically, I felt like I knew enough – that is – until last month when I taught the Linux Container Internals Lab at OpenStack Summit in Vancouver (May 2018) and somebody asked me to go deeper.

They asked – what is sVirt, and how does it work? I realized that, over time, my answer had grown hand wavy and vague. I explained what it did conceptually, but didn’t really explain how it works. I didn’t know where the labels got generated and,this agitated the architect side of my brain. I had to go hunt it down. If this agitates you as well, then this blog entry is for you…

Before we get started, if like me, you have grown rusty on SELinux and can’t quite remember the difference between enforcement based on Type, Multi-Category Security (MCS) and Multi-Level Security (MLS), then check out The SELinux Coloring Book by Dan Walsh and Máirín Duffy. It’s a great refresher and quite easy to follow.

Also, make sure to brush up on the difference between a Container Engine (CRI-O/Docker) and a Container Runtime (runc). We often use these terms interchangeably in conversation – I am guilty of it – but, it will be important for understanding the rest of this article.

Now, let’s get go deeper into the goods…

Regular Processes and Containers

First, let’s take a look at a regular Linux process in RHEL. Notice that a regular cat command runs as Type unconfined. In terminal 1:

cat /dev/zero

In terminal 2:

ps -efZ | grep zero
unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 root 1670 1611 16 11:52 pts/0 00:00:02 cat /dev/zero

Now, let’s do it in a container: In terminal 1:
docker run -it rhel7 cat /dev/zero
In terminal 2:
ps -efZ | grep zero
system_u:system_r:svirt_lxc_net_t:s0:c227,c979 root 25053 25042 7 12:49 pts/1 00:00:01 cat /dev/zero

Notice that in the container, the full SELinux label is totally different. Let’s break it down. By default regular Linux processes start with:
Type: unconfined_t
MCS: s0-s0:c0.c1023

But, in the container example above, they ran with:

Type: svirt_lxc_net_t
MCS: 0:c227,c979 (dynamically generated)

Now, run a few more containers and see how the MCS label changes every time:
system_u:system_r:svirt_lxc_net_t:s0:c565,c886 root 25826 25814 7 12:55 pts/1 00:00:00 cat /dev/zero
system_u:system_r:svirt_lxc_net_t:s0:c246,c723 root 25944 25934 8 12:55 pts/1 00:00:00 cat /dev/zero
system_u:system_r:svirt_lxc_net_t:s0:c35,c426 root 26048 26038 6 12:55 pts/1 00:00:00 cat /dev/zero

Pretty cool right? Every container gets a different MCS label, so there is an extra layer of defense between every one of them. This technology was developed for virtual machines on RHEL, but works quite well with containers because they are just fancy processes, and processes can have MCS labels.

Cool, But How Does The Label Get Generated?

Well, that’s the piece we call sVirt and it’s done by the container engine. If you have forgotten, or more likely never fully known what a container engine does, here are the three main things (see also: So, What Does a Container Engine Really Do Anyway? or Competition Heats Up Between CRI-O and containerd – Actually, That’s Not a Thing.

  1. Provide API/User Interface
  2. Pulling/Expanding images to disk
  3. Building a config.json

We are going to dig into #3. This is a file which contains the set of directives which are handed to runc dynamically before each container is started. These options come from three main places:

  1. The container image (example: physical architecture – x86_64, ARM, Windows x86_64, etc)
  2. The container engine (example: seccomp rules, sVirt generated SELinux labels, default command line options)
  3. User input (example: command line options passed to the container engine

The sVirt labels fall into bucket #2. The container engine decides whether sVirt is on or off by default. The engine can also handle command line options to to override the auto-generated label, which would fall into bucket #3. The container engine has to know the MCS label because when you stop and start a container, it has to know to restart it with the same label. The engine is also responsible for relabeling any bind mounted volumes, should you select that option. Coincidentally, the Docker Engine, CRI-O and Podman all share the same selinux library which is part of the opencontainers project – here

Both CRI-O and Podman have a similar function call stacks. They look roughly like this:

cmd/podman/create.go: label.InitLabels()
vendor/github.com/opencontainers/runc/libcontainer/label/label_selinux.go: selinux.GetLxcContexts()
vendor/github.com/opencontainers/runc/libcontainer/selinux/selinux.go: uniqMcs()
mcs = fmt.Sprintf("s0:c%d,c%d", c1, c2)

In the docker engine, the “start” and “create” options both have the following call stack. They look roughly like this:
daemon/create.go: daemon.setSecurityOptions()
daemon/start.go: daemon.setSecurityOptions()
daemon/container.go: daemon.parseSecurityOpt()
daemon_unix.go: label.InitLabels()
vendor/github.com/opencontainers/runc/libcontainer/label/label_selinux.go: selinux.GetLxcContexts()
vendor/github.com/opencontainers/runc/libcontainer/selinux/selinux.go: uniqMcs()
mcs = fmt.Sprintf("s0:c%d,c%d", c1, c2)

At the end of the day, the mcs label gets generated with a simple sprintf() function. Not rocket science, but offers a lot of extra protection to your containers.

Conclusion

Thanks to open source code, and de facto standards, the industry is using sVirt to secure containers no matter which container engine you are using. This offers an extra layer of defense to anyone using it. All container engines (docker, CRI-O, and Podman) in Fedora, RHEL, and CentOS have this on by default.

Now you understand how the MCS label gets created and even have a little better understanding of what a container engine does under the covers.

P.S. Remember you can call these options from higher up the stack in Kubernetes – docs here.

P.P.S. In the spirit of sharing with others so that they can get started more easily, I wanted to dig into this code and show how it works. It was a chance to share my own journey and maybe help some others find their legs. I am still very new to Golang, but have been hacking on code for 20+ years, so I figured I could wing it, but I actually failed the first few times I tried this. This little side project gave me the opportunity to get more comfortable digging into a large Golang code base.  My hacker brain is satisfied for a while. Thanks to Dan Walsh for pointing me to that magic Sprintf() call. Without that pointer, I wouldn’t have been able to trace this all down with out it.

2 comments on “What is sVirt and How Does it Isolate Linux Containers?

Leave a Reply

Your email address will not be published. Required fields are marked *