Canonical Voices

Christian Brauner

Runtimes And the Curse of the Privileged Container

Introduction (CVE-2019-5736)

Today, Monday, 2019-02-11, 14:00:00 CET CVE-2019-5736 was released:

The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host. The level of user interaction is being able to run any command (it doesn’t matter if the command is not attacker-controlled) as root within a container in either of these contexts:

  • Creating a new container using an attacker-controlled image.
  • Attaching (docker exec) into an existing container which the attacker had previous write access to.

I’ve been working on a fix for this issue over the last couple of weeks together with Aleksa a friend of mine and maintainer of runC. When he notified me about the issue in runC we tried to come up with an exploit for LXC as well and though harder it is doable. I was interested in the issue for technical reasons and figuring out how to reliably fix it was quite fun (with a proper dose of pure hatred). It also caused me to finally write down some personal thoughts I had for a long time about how we are running containers.

What are Privileged Containers?

At a first glance this is a question that is probably trivial to anyone who has a decent low-level understanding of containers. Maybe even most users by now will know what a privileged container is. A first pass at defining it would be to say that a privileged container is a container that is owned by root. Looking closer this seems an insufficient definition. What about containers using user namespaces that are started as root? It seems we need to distinguish between what ids a container is running with. So we could say a privileged container is a container that is running as root. However, this is still wrong. Because “running as root” can either be seen as meaning “running as root as seen from the outside” or “running as root from the inside” where “outside” means “as seen from a task outside the container” and “inside” means “as seen from a task inside the container”.

What we really mean by a privileged container is a container where the semantics for id 0 are the same inside and outside of the container ceteris paribus. I say “ceteris paribus” because using LSMs, seccomp or any other security mechanism will not cause a change in the meaning of id 0 inside and outside the container. For example, a breakout caused by a bug in the runtime implementation will give you root access on the host.

An unprivileged container then simply is any container in which the semantics for id 0 inside the container are different from id 0 outside the container. For example, a breakout caused by a bug in the runtime implementation will not give you root access on the host by default. This should only be possible if the kernel’s user namespace implementation has a bug.

The reason why I like to define privileged containers this way is that it also lets us handle edge cases. Specifically, the case where a container is using a user namespace but a hole is punched into the idmapping at id 0 aka where id 0 is mapped through. Consider a container that uses the following idmappings:

id: 0 100000 100000

This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(100000)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(100000) -> host_id(200000)

With this mapping it’s evident that container_id(0) != host_id(0). But now consider the following mapping:

id: 0 0 1
id: 1 100001 99999

This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(0)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(99999) -> host_id(199999)

In contrast to the first example this has the consequence that container_id(0) == host_id(0). I would argue that any container that at least punches a hole for id 0 into its idmapping up to specifying an identity mapping is to be considered a privileged container.

As a sidenote, Docker containers run as privileged containers by default. There is usually some confusion where people think because they do not use the --privileged flag that Docker containers run unprivileged. This is wrong. What the --privileged flag does is to give you even more permissions by e.g. not dropping (specific or even any) capabilities. One could say that such containers are almost “super-privileged”.

The Trouble with Privileged Containers

The problem I see with privileged containers is essentially captured by LXC’s and LXD’s upstream security position which we have held since at least 2015 but probably even earlier. I’m quoting from our notes about privileged containers:

Privileged containers are defined as any container where the container uid 0 is mapped to the host’s uid 0. In such containers, protection of the host and prevention of escape is entirely done through Mandatory Access Control (apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.

Those technologies combined will typically prevent any accidental damage of the host, where damage is defined as things like reconfiguring host hardware, reconfiguring the host kernel or accessing the host filesystem.

LXC upstream’s position is that those containers aren’t and cannot be root-safe.

They are still valuable in an environment where you are running trusted workloads or where no untrusted task is running as root in the container.

We are aware of a number of exploits which will let you escape such containers and get full root privileges on the host. Some of those exploits can be trivially blocked and so we do update our different policies once made aware of them. Some others aren’t blockable as they would require blocking so many core features that the average container would become completely unusable.

[…]

As privileged containers are considered unsafe, we typically will not consider new container escape exploits to be security issues worthy of a CVE and quick fix. We will however try to mitigate those issues so that accidental damage to the host is prevented.

LXC’s upstream position for a long time has been that privileged containers are not and cannot be root safe. For something to be considered root safe it should be safe to hand root access to third parties or tasks.

Running Untrusted Workloads in Privileged Containers

is insane. That’s about everything that this paragraph should contain. The fact that the semantics for id 0 inside and outside the container are identical entails that any meaningful container escape will have the attacker gain root on the host.

CVE-2019-5736 Is a Very Very Very Bad Privilege Escalation to Host Root

CVE-2019-5736 is an excellent illustration of such an attack. Think about it: a process running inside a privileged container can rather trivially corrupt the binary that is used to attach to the container. This allows an attacker to create a custom ELF binary on the host. That binary could do anything it wants:

  • could just be a binary that calls poweroff
  • could be a binary that spawns a root shell
  • could be a binary that kills other containers when called again to attach
  • could be suid cat
  • .
  • .
  • .

The attack vector is actually slightly worse for runC due to its architecture. Since runC exits after spawning the container it can also be attacked through a malicious container image. Which is super bad given that a lot of container workload workflows rely on downloading images from the web.

LXC cannot be attacked through a malicious image since the monitor process (a singleton per-container) never exits during the containers life cycle. Since the kernel does not allow modifications to running binaries it is not possible for the attacker to corrupt it. When the container is shutdown or killed the attacking task will be killed before it can do any harm. Only when the last process running inside the container has exited will the monitor itself exit. This has the consequence, that if you run privileged OCI containers via our oci template with LXC your are not vulnerable to malicious images. Only the vector through the attaching binary still applies.

The Lie that Privileged Containers can be safe

Aside from mostly working on the Kernel I’m also a maintainer of LXC and LXD alongside Stéphane Graber. We are responsible for LXC - the low-level container runtime - and LXD - the container management daemon using LXC. We have made a very conscious decision to consider privileged containers not root safe. Two main corollaries follow from this:

  1. Privileged containers should never be used to run untrusted workloads.
  2. Breakouts from privileged containers are not considered CVEs by our security policy. It still seems a common belief that if we all just try hard enough using privileged containers for untrusted workloads is safe. This is not a promise that can be made good upon. A privileged container is not a security boundary. The reason for this is simply what we looked at above: container_id(0) == host_id(0). It is therefore deeply troubling that this industry is happy to let users believe that they are safe and secure using privileged containers.

Unprivileged Containers as Default

As upstream for LXC and LXD we have been advocating the use of unprivileged containers by default for years. Way ahead before anyone else did. Our low-level library LXC has supported unprivileged containers since 2013 when user namespaces were merged into the kernel. With LXD we have taken it one step further and made unprivileged containers the default and privileged containers opt-in for that very matter: privileged containers aren’t safe. We even allow you to have per-container idmappings to make sure that not just each container is isolated from the host but also all containers from each other.

For years we have been advocating for unprivileged containers on conferences, in blogposts, and whenever we have spoken to people but somehow this whole industry has chosen to rely on privileged containers.

The good news is that we are seeing changes as people become more familiar with the perils of privileged containers. Let this recent CVE be another reminder that unprivileged containers need to be the default.

Are LXC and LXD affected?

I have seen this question asked all over the place so I guess I should add a section about this too:

  • Unprivileged LXC and LXD containers are not affected.

  • Any privileged LXC and LXD container running on a read-only rootfs is not affected.

  • Privileged LXC containers in the definition provided above are affected. Though the attack is more difficult than for runC. The reason for this is that the lxc-attach binary does not exit before the program in the container has finished executing. This means an attacker would need to open an O_PATH file descriptor to /proc/self/exe, fork() itself into the background and re-open the O_PATH file descriptor through /proc/self/fd/<O_PATH-nr> in a loop as O_WRONLY and keep trying to write to the binary until such time as lxc-attach exits. Before that it will not succeed since the kernel will not allow modification of a running binary.

  • Privileged LXD containers are only affected if the daemon is restarted other than for upgrade reasons. This should basically never happen. The LXD daemon never exits so any write will fail because the kernel does not allow modification of a running binary. If the LXD daemon is restarted because of an upgrade the binary will be swapped out and the file descriptor used for the attack will write to the old in-memory binary and not to the new binary.

Chromebooks with Crostini using LXD are not affected

Chromebooks use LXD as their default container runtime are not affected. First of all, all binaries reside on a read-only filesystem and second, LXD does not allow running privileged containers on Chromebooks through the LXD_UNPRIVILEGED_ONLY flag. For more details see this link.

Fixing CVE-2019-5736

To prevent this attack, LXC has been patched to create a temporary copy of the calling binary itself when it attaches to containers (cf. 6400238d08cdf1ca20d49bafb85f4e224348bf9d). To do this LXC can be instructed to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. LXC then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host LXC binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed, writes to this will also fail. To not break downstream users of the shared library this is opt-in by setting LXC_MEMFD_REXEC in the environment. For our lxc-attach binary which is the only attack vector this is now done by default.

Workloads that place the LXC binaries on a read-only filesystem or prevent running privileged containers can disable this feature by passing --disable-memfd-rexec during the configure stage when compiling LXC.

Read more

Snapcraft 3.1

snapcraft 3.1 is now available on the stable channel of the Snap Store. This is a new minor release building on top of the foundations laid out from the snapcraft 3.1 release. If you are already on the stable channel for snapcraft then all you need to do is wait for the snap to be refreshed. The full release notes are replicated here below Build Environments It is now possible, when using the base keyword, to once again clean parts.

Read more
jdstrand

Some time ago we started alerting publishers when their stage-packages received a security update since the last time they built a snap. We wanted to create the right balance for the alerts and so the service currently will only alert you when there are new security updates against your stage-packages. In this manner, you can choose not to rebuild your snap (eg, since it doesn’t use the affected functionality of the vulnerable package) and not be nagged every day that you are out of date.

As nice as that is, sometimes you want to check these things yourself or perhaps hook the alerts into some form of automation or tool. While the review-tools had all of the pieces so you could do this, it wasn’t as straightforward as it could be. Now with the latest stable revision of the review-tools, this is easy:

$ sudo snap install review-tools
$ review-tools.check-notices \
  ~/snap/review-tools/common/review-tools_656.snap
{'review-tools': {'656': {'libapt-inst2.0': ['3863-1'],
                          'libapt-pkg5.0': ['3863-1'],
                          'libssl1.0.0': ['3840-1'],
                          'openssl': ['3840-1'],
                          'python3-lxml': ['3841-1']}}}

The review-tools are a strict mode snap and while it plugs the home interface, that is only for convenience, so I typically disconnect the interface and put things in its SNAP_USER_COMMON directory, like I did above.

Since now it is super easy to check a snap on disk, with a little scripting and a cron job, you can generate a machine readable report whenever you want. Eg, can do something like the following:

$ cat ~/bin/check-snaps
#!/bin/sh
set -e

snaps="review-tools/stable rsync-jdstrand/edge"

tmpdir=$(mktemp -d -p "$HOME/snap/review-tools/common")
cleanup() {
    rm -fr "$tmpdir"
}
trap cleanup EXIT HUP INT QUIT TERM

cd "$tmpdir" || exit 1
for i in $snaps ; do
    snap=$(echo "$i" | cut -d '/' -f 1)
    channel=$(echo "$i" | cut -d '/' -f 2)
    snap download "$snap" "--$channel" >/dev/null
done
cd - >/dev/null || exit 1

/snap/bin/review-tools.check-notices "$tmpdir"/*.snap

or if  you already have the snaps on disk somewhere, just do:

$ /snap/bin/review-tools.check-notices /path/to/snaps/*.snap

Now can add the above to cron or some automation tool as a reminder of what needs updates. Enjoy!

Read more
K. Tsakalozos

No.

No.

Read more
Colin Watson

Git per-branch permissions

We’ve had Git hosting support in Launchpad for a few years now. One thing that some users asked for, particularly larger users such as the Ubuntu kernel team, was the ability to set up per-branch push permissions for their repositories. Today we rolled out the last piece of this work.

Launchpad’s default behaviour is that repository owners may push anything to their own repositories, including creating new branches, force-pushing (rewriting history), and deleting branches, while nobody else may push anything. Repository owners can now also choose to protect branches or tags, either individually or using wildcard rules. If a branch is protected, then by default repository owners can only create or push it but cannot force-push or delete; if a tag is protected, then by default repository owners can create it but cannot move or delete it.

You can also allow selected contributors to push to protected branches or tags, so if you’re collaborating with somebody on a branch and just want to be able to quickly pair-program via git push, or you want a merge robot to be able to land merge proposals in your repository without having to add it to the team that owns the repository and thus give it privileges it doesn’t need, then this feature may be for you.

There’s some initial documentation on our help site, and here’s a screenshot of a repository that’s been set up to give a contributor push access to a single branch:

Read more
Christian Brauner

Android Binderfs

asciicast

Introduction

Android Binder is an inter-process communication (IPC) mechanism. It is heavily used in all Android devices. The binder kernel driver has been present in the upstream Linux kernel for quite a while now.

Binder has been a controversial patchset (see this lwn article as an example). Its design was considered wrong and to violate certain core kernel design principles (e.g. a task should never touch another tasks file descriptor table). Most kernel developers were not a fan of binder.

Recently, the upstream binder code has fortunately been reworked significantly (e.g. it does not touch another tasks file descriptor table anymore, the locking is very fine-grained now, etc.).

With Android being one of the major operating systems (OS) for a vast number of devices there is simply no way around binder.

The Android Service Manager

The binder IPC mechanism is accessible from userspace through device nodes located at /dev. A modern Android system will allocate three device nodes:

  • /dev/binder
  • /dev/hwbinder
  • /dev/vndbinder

serving different purposes. However, the logic is the same for all three of them. A process can call open(2) on those device nodes to receive an fd which it can then use to issue requests via ioctl(2)s. Android has a service manager which is used to translate addresses to bus names and only the address of the service manager itself is well-known. The service manager is registered through an ioctl(2) and there can only be a single service manager. This means once a service manager has grabbed hold of binder devices they cannot be (easily) reused by a second service manager.

Running Android in Containers

This matters as soon as multiple instances of Android are supposed to be run. Since they will all need their own private binder devices. This is a use-case that arises pretty naturally when running Android in system containers. People have been doing this for a long time with LXC. A project that has set out to make running Android in LXC containers very easy is Anbox. Anbox makes it possible to run hundreds of Android containers.

To properly run Android in a container it is necessary that each container has a set of private binder devices.

Statically Allocating binder Devices

Binder devices are currently statically allocated at compile time. Before compiling a kernel the CONFIG_ANDROID_BINDER_DEVICES option needs to bet set in the kernel config (Kconfig) containing the names of the binder devices to allocate at boot. By default it is set as:

CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"

To allocate additional binder devices the user needs to specify them with this Kconfig option. This is problematic since users need to know how many containers they will run at maximum and then to calculate the number of devices they need so they can specify them in the Kconfig. When the maximum number of needed binder devices changes after kernel compilation the only way to get additional devices is to recompile the kernel.

Problem 1: Using the misc major Device Number

This situation is aggravated by the fact that binder devices use the misc major number in the kernel. Each device node in the Linux kernel is identified by a major and minor number. A device can request its own major number. If it does it will have an exclusive range of minor numbers it doesn’t share with anything else and is free to hand out. Or it can use the misc major number. The misc major number is shared amongst different devices. However, that also means the number of minor devices that can be handed out is limited by all users of misc major. So if a user requests a very large number of binder devices in their Kconfig they might make it impossible for anyone else to allocate minor numbers. Or there simply might not be enough to allocate for itself.

Problem 2: Containers and IPC namespaces

All of those binder devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES will be allocated at boot and be placed in the hosts devtmpfs mount usually located at /dev or - depending on the udev(7) implementation - will be created via mknod(2) - by udev(7) at boot. That means all of those devices initially belong to the host IPC namespace. However, containers usually run in their own IPC namespace separate from the host’s. But when binder devices located in /dev are handed to containers (e.g. with a bind-mount) the kernel driver will not know that these devices are now used in a different IPC namespace since the driver is not IPC namespace aware. This is not a serious technical issue but a serious conceptual one. There should be a way to have per-IPC namespace binder devices.

Enter binderfs

To solve both problems we came up with a solution that I presented at the Linux Plumbers Conference in Vancouver this year. There’s a video of that presentation available on Youtube:

Android binderfs is a tiny filesystem that allows users to dynamically allocate binder devices, i.e. it allows to add and remove binder devices at runtime. Which means it solves problem 1. Additionally, binder devices located in a new binderfs instance are independent of binder devices located in another binderfs instance. All binder devices in binderfs instances are also independent of the binder devices allocated during boot specified in CONFIG_ANDROID_BINDER_DEVICES. This means, binderfs solves problem 2.

Android binderfs can be mounted via:

mount -t binder binder /dev/binderfs

at which point a new instance of binderfs will show up at /dev/binderfs. In a fresh instance of binderfs no binder devices will be present. There will only be a binder-control device which serves as the request handler for binderfs:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:07 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 6 Jan 10 15:07 binder-control

binderfs: Dynamically Allocating a New binder Device

To allocate a new binder device in a binderfs instance a request needs to be sent through the binder-control device node. A request is sent in the form of an ioctl(2). Here’s an example program:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <linux/android/binder.h>
#include <linux/android/binderfs.h>

int main(int argc, char *argv[])
{
        int fd, ret, saved_errno;
        size_t len;
        struct binderfs_device device = { 0 };

        if (argc != 3)
                exit(EXIT_FAILURE);

        len = strlen(argv[2]);
        if (len > BINDERFS_MAX_NAME)
                exit(EXIT_FAILURE);

        memcpy(device.name, argv[2], len);

        fd = open(argv[1], O_RDONLY | O_CLOEXEC);
        if (fd < 0) {
                printf("%s - Failed to open binder-control device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        ret = ioctl(fd, BINDER_CTL_ADD, &device);
        saved_errno = errno;
        close(fd);
        errno = saved_errno;
        if (ret < 0) {
                printf("%s - Failed to allocate new binder device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        printf("Allocated new binder device with major %d, minor %d, and "
               "name %s\n", device.major, device.minor,
               device.name);

        exit(EXIT_SUCCESS);
}

What this program simply does is to open the binder-control device node and sending a BINDER_CTL_ADD request to the kernel. Users of binderfs need to tell the kernel which name the new binder device should get. By default a name can only contain up to 256 chars including the terminating zero byte. The struct which is used is:

/**
 * struct binderfs_device - retrieve information about a new binder device
 * @name:   the name to use for the new binderfs binder device
 * @major:  major number allocated for binderfs binder devices
 * @minor:  minor number allocated for the new binderfs binder device
 *
 */
struct binderfs_device {
       char name[BINDERFS_MAX_NAME + 1];
       __u32 major;
       __u32 minor;
};

and is defined in linux/android/binderfs.h. Once the request is made via an ioctl(2) passing a struct binder_device with the name to the kernel it will allocate a new binder device and return the major and minor number of the new device in the struct (This is necessary because binderfs allocated a major device number dynamically at boot.). After the ioctl(2) returns there will be a new binder device located under /dev/binderfs with the chosen name:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder
crw-------  1 root root 242, 2 Jan 10 15:19 my-binder1

binderfs: Deleting a binder Device

Deleting binder devices does not involve issuing another ioctl(2) request through binder-control. They can be deleted via unlink(2). This means that the rm(1) tool can be used to delete them:

root@edfu:~# rm /dev/binderfs/my-binder1
root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder

Note that the binder-control device cannot be deleted since this would make the binderfs instance unuseable. The binder-control device will be deleted when the binderfs instance is unmounted and all references to it have been dropped.

binderfs: Mounting Multiple Instances

Mounting another binderfs instance at a different location will create a new and separate instance from all other binderfs mounts. This is identical to the behavior of devpts, tmpfs, and also - even though never merged in the kernel - kdbusfs:

root@edfu:~# mkdir binderfs1
root@edfu:~# mount -t binder binder binderfs1
root@edfu:~# ls -al binderfs1/
total 4
drwxr-xr-x  2 root   root        0 Jan 10 15:23 .
drwxr-xr-x 72 ubuntu ubuntu   4096 Jan 10 15:23 ..
crw-------  1 root   root   242, 2 Jan 10 15:23 binder-control

There is no my-binder device in this new binderfs instance since its devices are not related to those in the binderfs instance at /dev/binderfs. This means users can easily get their private set of binder devices.

binderfs: Mounting binderfs in User Namespaces

The Android binderfs filesystem can be mounted and used to allocate new binder devices in user namespaces. This has the advantage that binderfs can be used in unprivileged containers or any user-namespace-based sandboxing solution:

ubuntu@edfu:~$ unshare --user --map-root --mount
root@edfu:~# mkdir binderfs-userns
root@edfu:~# mount -t binder binder binderfs-userns/
root@edfu:~# The "bfs" binary used here is the compiled program from above
root@edfu:~# ./bfs binderfs-userns/binder-control my-user-binder
Allocated new binder device with major 242, minor 4, and name my-user-binder
root@edfu:~# ls -al binderfs-userns/
total 4
drwxr-xr-x  2 root root      0 Jan 10 15:34 .
drwxr-xr-x 73 root root   4096 Jan 10 15:32 ..
crw-------  1 root root 242, 3 Jan 10 15:34 binder-control
crw-------  1 root root 242, 4 Jan 10 15:36 my-user-binder

Kernel Patchsets

The binderfs patchset is merged upstream and will be available when Linux 5.0 gets released. There are a few outstanding patches that are currently waiting in Greg’s tree (cf. binderfs: remove wrong kern_mount() call and binderfs: make each binderfs mount a new instancechar-misc-linus) and some others are queued for the 5.1 merge window. But overall it seems to be in decent shape.

Read more
Christian Brauner

Android Binderfs

asciicast

Introduction

Android Binder is an inter-process communication (IPC) mechanism. It is heavily used in all Android devices. The binder kernel driver has been present in the upstream Linux kernel for quite a while now.

Binder has been a controversial patchset (see this lwn article as an example). Its design was considered wrong and to violate certain core kernel design principles (e.g. a task should never touch another tasks file descriptor table). Most kernel developers were not a fan of binder.

Recently, the upstream binder code has fortunately been reworked significantly (e.g. it does not touch another tasks file descriptor table anymore, the locking is very fine-grained now, etc.).

With Android being one of the major operating systems (OS) for a vast number of devices there is simply no way around binder.

The Android Service Manager

The binder IPC mechanism is accessible from userspace through device nodes located at /dev. A modern Android system will allocate three device nodes:

  • /dev/binder
  • /dev/hwbinder
  • /dev/vndbinder

serving different purposes. However, the logic is the same for all three of them. A process can call open(2) on those device nodes to receive an fd which it can then use to issue requests via ioctl(2)s. Android has a service manager which is used to translate addresses to bus names and only the address of the service manager itself is well-known. The service manager is registered through an ioctl(2) and there can only be a single service manager. This means once a service manager has grabbed hold of binder devices they cannot be (easily) reused by a second service manager.

Running Android in Containers

This matters as soon as multiple instances of Android are supposed to be run. Since they will all need their own private binder devices. This is a use-case that arises pretty naturally when running Android in system containers. People have been doing this for a long time with LXC. A project that has set out to make running Android in LXC containers very easy is Anbox. Anbox makes it possible to run hundreds of Android containers.

To properly run Android in a container it is necessary that each container has a set of private binder devices.

Statically Allocating binder Devices

Binder devices are currently statically allocated at compile time. Before compiling a kernel the CONFIG_ANDROID_BINDER_DEVICES option needs to bet set in the kernel config (Kconfig) containing the names of the binder devices to allocate at boot. By default it is set as:

CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"

To allocate additional binder devices the user needs to specify them with this Kconfig option. This is problematic since users need to know how many containers they will run at maximum and then to calculate the number of devices they need so they can specify them in the Kconfig. When the maximum number of needed binder devices changes after kernel compilation the only way to get additional devices is to recompile the kernel.

Problem 1: Using the misc major Device Number

This situation is aggravated by the fact that binder devices use the misc major number in the kernel. Each device node in the Linux kernel is identified by a major and minor number. A device can request its own major number. If it does it will have an exclusive range of minor numbers it doesn’t share with anything else and is free to hand out. Or it can use the misc major number. The misc major number is shared amongst different devices. However, that also means the number of minor devices that can be handed out is limited by all users of misc major. So if a user requests a very large number of binder devices in their Kconfig they might make it impossible for anyone else to allocate minor numbers. Or there simply might not be enough to allocate for itself.

Problem 2: Containers and IPC namespaces

All of those binder devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES will be allocated at boot and be placed in the hosts devtmpfs mount usually located at /dev or - depending on the udev(7) implementation - will be created via mknod(2) - by udev(7) at boot. That means all of those devices initially belong to the host IPC namespace. However, containers usually run in their own IPC namespace separate from the host’s. But when binder devices located in /dev are handed to containers (e.g. with a bind-mount) the kernel driver will not know that these devices are now used in a different IPC namespace since the driver is not IPC namespace aware. This is not a serious technical issue but a serious conceptual one. There should be a way to have per-IPC namespace binder devices.

Enter binderfs

To solve both problems we came up with a solution that I presented at the Linux Plumbers Conference in Vancouver this year. There’s a video of that presentation available on Youtube:

Android binderfs is a tiny filesystem that allows users to dynamically allocate binder devices, i.e. it allows to add and remove binder devices at runtime. Which means it solves problem 1. Additionally, binder devices located in a new binderfs instance are independent of binder devices located in another binderfs instance. All binder devices in binderfs instances are also independent of the binder devices allocated during boot specified in CONFIG_ANDROID_BINDER_DEVICES. This means, binderfs solves problem 2.

Android binderfs can be mounted via:

mount -t binder binder /dev/binderfs

at which point a new instance of binderfs will show up at /dev/binderfs. In a fresh instance of binderfs no binder devices will be present. There will only be a binder-control device which serves as the request handler for binderfs:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:07 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 6 Jan 10 15:07 binder-control

binderfs: Dynamically Allocating a New binder Device

To allocate a new binder device in a binderfs instance a request needs to be sent through the binder-control device node. A request is sent in the form of an ioctl(2). Here’s an example program:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <linux/android/binder.h>
#include <linux/android/binderfs.h>

int main(int argc, char *argv[])
{
        int fd, ret, saved_errno;
        size_t len;
        struct binderfs_device device = { 0 };

        if (argc != 3)
                exit(EXIT_FAILURE);

        len = strlen(argv[2]);
        if (len > BINDERFS_MAX_NAME)
                exit(EXIT_FAILURE);

        memcpy(device.name, argv[2], len);

        fd = open(argv[1], O_RDONLY | O_CLOEXEC);
        if (fd < 0) {
                printf("%s - Failed to open binder-control device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        ret = ioctl(fd, BINDER_CTL_ADD, &device);
        saved_errno = errno;
        close(fd);
        errno = saved_errno;
        if (ret < 0) {
                printf("%s - Failed to allocate new binder device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        printf("Allocated new binder device with major %d, minor %d, and "
               "name %s\n", device.major, device.minor,
               device.name);

        exit(EXIT_SUCCESS);
}

What this program simply does is to open the binder-control device node and sending a BINDER_CTL_ADD request to the kernel. Users of binderfs need to tell the kernel which name the new binder device should get. By default a name can only contain up to 256 chars including the terminating zero byte. The struct which is used is:

/**
 * struct binderfs_device - retrieve information about a new binder device
 * @name:   the name to use for the new binderfs binder device
 * @major:  major number allocated for binderfs binder devices
 * @minor:  minor number allocated for the new binderfs binder device
 *
 */
struct binderfs_device {
       char name[BINDERFS_MAX_NAME + 1];
       __u32 major;
       __u32 minor;
};

and is defined in linux/android/binderfs.h. Once the request is made via an ioctl(2) passing a struct binder_device with the name to the kernel it will allocate a new binder device and return the major and minor number of the new device in the struct (This is necessary because binderfs allocated a major device number dynamically at boot.). After the ioctl(2) returns there will be a new binder device located under /dev/binderfs with the chosen name:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder
crw-------  1 root root 242, 2 Jan 10 15:19 my-binder1

binderfs: Deleting a binder Device

Deleting binder devices does not involve issuing another ioctl(2) request through binder-control. They can be deleted via unlink(2). This means that the rm(1) tool can be used to delete them:

root@edfu:~# rm /dev/binderfs/my-binder1
root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder

Note that the binder-control device cannot be deleted since this would make the binderfs instance unuseable. The binder-control device will be deleted when the binderfs instance is unmounted and all references to it have been dropped.

binderfs: Mounting Multiple Instances

Mounting another binderfs instance at a different location will create a new and separate instance from all other binderfs mounts. This is identical to the behavior of devpts, tmpfs, and also - even though never merged in the kernel - kdbusfs:

root@edfu:~# mkdir binderfs1
root@edfu:~# mount -t binder binder binderfs1
root@edfu:~# ls -al binderfs1/
total 4
drwxr-xr-x  2 root   root        0 Jan 10 15:23 .
drwxr-xr-x 72 ubuntu ubuntu   4096 Jan 10 15:23 ..
crw-------  1 root   root   242, 2 Jan 10 15:23 binder-control

There is no my-binder device in this new binderfs instance since its devices are not related to those in the binderfs instance at /dev/binderfs. This means users can easily get their private set of binder devices.

binderfs: Mounting binderfs in User Namespaces

The Android binderfs filesystem can be mounted and used to allocate new binder devices in user namespaces. This has the advantage that binderfs can be used in unprivileged containers or any user-namespace-based sandboxing solution:

ubuntu@edfu:~$ unshare --user --map-root --mount
root@edfu:~# mkdir binderfs-userns
root@edfu:~# mount -t binder binder binderfs-userns/
root@edfu:~# The "bfs" binary used here is the compiled program from above
root@edfu:~# ./bfs binderfs-userns/binder-control my-user-binder
Allocated new binder device with major 242, minor 4, and name my-user-binder
root@edfu:~# ls -al binderfs-userns/
total 4
drwxr-xr-x  2 root root      0 Jan 10 15:34 .
drwxr-xr-x 73 root root   4096 Jan 10 15:32 ..
crw-------  1 root root 242, 3 Jan 10 15:34 binder-control
crw-------  1 root root 242, 4 Jan 10 15:36 my-user-binder

Kernel Patchsets

The binderfs patchset is merged upstream and will be available when Linux 5.0 gets released. There are a few outstanding patches that are currently waiting in Greg’s tree (cf. binderfs: remove wrong kern_mount() call and binderfs: make each binderfs mount a new instancechar-misc-linus) and some others are queued for the 5.1 merge window. But overall it seems to be in decent shape.

Read more
Colin Ian King

Last year I wrote about kernel commits that are tagged with the "Fixes" tag. Kernel developers use the "Fixes" tag on a bug fix commit to reference an older commit that originally introduced the bug.   The adoption of the tag has been steadily increasing since v3.12 of the kernel:

The red line shows the number of commits per release of the kernel, and the blue line shows the number of commits that contain a "Fixes" tag.

In terms of % of commits that contain the "Fixes" tag, one can see it has been steadily increasing since v3.12 and almost 12.5% of kernel commits in v4.20 are tagged this way.

The fixes tag contains the commit SHA of the commit that was fixed, so one can look up the date of the fix and of the commit being fixed and determine the time taken to fix a bug.

As one can see, a lot of issues get fixed on the first few hundred days, and some bugs take years to get fixed.  Zooming into the first hundred days of fixes the distribution looks like:


..the modal point is at day 4, I suspect these are issues that get found quickly when commits land in linux-next and are found in early testing, integration builds and static analysis.

Out of the thousands of "Fixes" tagged commits and the time to fix an issue one can determine how long it takes to fix a specific percentage of the bugs:


In the graph above, 50% of fixes are made within 151 days of the original commit, ~69% of fixes are made within a year of the original commit and ~82% of fixes are made within 2 years.  The long tail indicates that there are some bugs that take a while to be found and fixed,  the final 10% of bugs take more than 3.5 years to be found and fixed.

Comparing the time to fix issues for kernel versions v4.0, v4.10 and v4.20 for bugs that are fixed in less than 50 days we have:


... the trends are similar, however it is worth noting that more bugs are getting found and fixed a little faster in v4.10 and v4.20 than v4.0.  It will be interesting to see how these trends develop over the next few years.

Read more
Colin Ian King

Analysis of Phoronix Test Suite Benchmarks

I've been recently investigating a wide range of bench marking tests to find suitable candidates to track down performance regressions in Ubuntu.  Over the past 3 weeks I have attempted to run the entire set of tests in the Phoronix Test Suite on a low-end Xeon server to determine a subset of reliable tests.

For benchmarks to be reliable they must show little variation in results when running the tests multiple times.  The tests also need to run in a timely manner too; waiting several days for a set of results is not very timely.

Linked here is a PDF of my set of results.  Some of the Phoronix tests are not listed as they either took way too long to complete or just didn't run successfully on the server.  Tests that have low variability in the results (that is, the standard deviation of the test runs is low compared to the average) are marked in green, high variability are marked in red.

My testing shows that a large proportion of tests have quite large variability > 5% (% standard deviation), so probably are not that trustworthy when comparing results of machines that have benchmark results that show little difference.   For regression testing, I'm going to only consider tests that have a variability of less than 2.5% as this seems like a good way to filter out the more jittery test results.

The bottom line is that some tests are just too variable to be deemed a solid benchmark.  It's always worth sanity checking tests before using them as a gold standard.

Read more
Colin Ian King

Linux I/O Schedulers

The Linux kernel I/O schedulers attempt to balance the need to get the best possible I/O performance while also trying to ensure the I/O requests are "fairly" shared among the I/O consumers.  There are several I/O schedulers in Linux, each try to solve the I/O scheduling issues using different mechanisms/heuristics and each has their own set of strengths and weaknesses.

For traditional spinning media it makes sense to try and order I/O operations so that they are close together to reduce read/write head movement and hence decrease latency.  However, this reordering means that some I/O requests may get delayed, and the usual solution is to schedule these delayed requests after a specific time.   Faster non-volatile memory devices can generally handle random I/O requests very easily and hence do not require reordering.

Balancing the fairness is also an interesting issue.  A greedy I/O consumer should not block other I/O consumers and there are various heuristics used to determine the fair sharing of I/O.  Generally, the more complex and "fairer" the solution the more compute is required, so selecting a very fair I/O scheduler with a fast I/O device and a slow CPU may not necessarily perform as well as a simpler I/O scheduler.

Finally, the types of I/O patterns on the I/O devices influence the I/O scheduler choice, for example, mixed random read/writes vs mainly sequential reads and occasional random writes.

Because of the mix of requirements, there is no such thing as a perfect all round I/O scheduler.  The defaults being used are chosen to be a good best choice for the general user, however, this may not match everyone's needs.   To clarify the choices, the Ubuntu Kernel Team has provided a Wiki page describing the choices and how to select and tune the various I/O schedulers.  Caveat emptor applies, these are just guidelines and should be used as a starting point to finding the best I/O scheduler for your particular need.

Read more
K. Tsakalozos

MicroK8s is a local deployment of Kubernetes. Let’s skip all the technical details and just accept that Kubernetes does not run natively on MacOS or Windows. You may be thinking “I have seen Kubernetes running on a MacOS laptop, what kind of sorcery was that?” It’s simple, Kubernetes is running inside a VM. You might not see the VM or it might not even be a full blown virtual system but some level of virtualisation is there. This is exactly what we will show here. We will setup a VM and inside there we will install MicroK8s. After the installation we will discuss how to use the in-VM-Kubernetes.

A multipass VM on MacOS

Arguably the easiest way to get an Ubuntu VM on MacOS is with multipass. Head to the releases page and grab the latest package. Installing it is as simple as double-clicking on the .pkg file.

To start a VM with MicroK8s we:

multipass launch --name microk8s-vm --mem 4G --disk 40G
multipass exec microk8s-vm -- sudo snap install microk8s --classic
multipass exec microk8s-vm -- sudo iptables -P FORWARD ACCEPT

Make sure you reserve enough resources to host your deployments; above, we got 4GB of RAM and 40GB of hard disk. We also make sure packets to/from the pod network interface can be forwarded to/from the default interface.

Our VM has an IP that you can check with:

> multipass list
Name State IPv4 Release
microk8s-vm RUNNING 10.72.145.216 Ubuntu 18.04 LTS

Take a note of this IP since our services will become available there.

Other multipass commands you may find handy:

  • Get a shell inside the VM:
multipass shell microk8s-vm
  • Shutdown the VM:
multipass stop microk8s-vm
  • Delete and cleanup the VM:
multipass delete microk8s-vm 
multipass purge

Using MicroK8s

To run a command in the VM we can get a multipass shell with:

multipass shell microk8s-vm

To execute a command without getting a shell we can use multipass exec like so:

multipass exec microk8s-vm -- /snap/bin/microk8s.status

A third way to interact with MicroK8s is via the Kubernetes API server listening on port 8080 of the VM. We can use microk8s’ kubeconfig file with a local installation of kubectl to access the in-VM-kubernetes. Here is how:

multipass exec microk8s-vm -- /snap/bin/microk8s.config > kubeconfig

Install kubectl on the host machine and then use the kubeconfig:

kubectl --kubeconfig=kubeconfig get all --all-namespaces

Accessing in-VM services — Enabling addons

Let’s first enable dns and the dashboard. In the rest of this blog we will be showing different methods of accessing Grafana:

multipass exec microk8s-vm -- /snap/bin/microk8s.enable dns dashboard

We check the deployment progress with:

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl get all --all-namespaces

After all services are running we can proceed into looking how to access the dashboard.

The Grafana of our dashboard

Accessing in-VM services — Use the Kubernetes API proxy

The API server is on port 8080 of our VM. Let’s see how the proxy path looks like:

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl cluster-info
...
Grafana is running at http://127.0.0.1:8080/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
...

By replacing 127.0.0.1 with the VM’s IP, 10.72.145.216 in this case, we can reach our service at:

http://10.72.145.216:8080/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy

Accessing in-VM services — Setup a proxy

In a very similar fashion to what we just did above, we can ask Kubernetes to create a proxy for us. We need to request the proxy to be available to all interfaces and to accept connections from everywhere so that the host can reach it.

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl proxy --address='0.0.0.0' --accept-hosts='.*'
Starting to serve on [::]:8001

Leave the terminal with the proxy open. Again, replacing 127.0.0.1 with the VMs IP we reach the dashboard through:

http://10.72.145.216:8080/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy

Make sure you go through the official docs on constructing the proxy paths.

Accessing in-VM services — Use a NodePort service

We can expose our service in a port on the VM and access it from there. This approach is using the NodePort service type. We start by spotting the deployment we want to expose:

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl get deployment -n kube-system  | grep grafana
monitoring-influxdb-grafana-v4 1 1 1 1 22h

Then we create the NodePort service:

multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl expose deployment.apps/monitoring-influxdb-grafana-v4 -n kube-system --type=NodePort

We have now a port for the Grafana service:

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl get services -n kube-system  | grep NodePort
monitoring-influxdb-grafana-v4 NodePort 10.152.183.188 <none> 8083:32580/TCP,8086:32152/TCP,3000:32720/TCP 13m

Grafana is on port 3000 mapped here to 32720. This port is randomly selected so it my vary for you. In our case, the service is available on 10.72.145.216:32720.

Conclusions

MicroK8s on MacOS (or Windows) will need a VM to run. This is no different that any other local Kubernetes solution and it comes with some nice benefits. The VM gives you an extra layer of isolation. Instead of using your host and potentially exposing the Kubernetes services to the outside world you have full control of what others can see. Of course, this isolation comes with some extra administrative overhead that may not be applicable for a dev environment. Give it a try and tell us what you think!

Links

CanonicalLtd/multipass


MicroK8s on MacOS was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
Colin Ian King

New features in Forkstat

Forkstat is a simple utility I wrote a while ago that can trace process activity using the rather useful Linux NETLINK_CONNECTOR API.   Recently I have added two extra features that may be of interest:

1.  Improved output using some UTF-8 glyphs.  These are used to show process parent/child relationships and various process events, such as termination, core dumping and renaming.   Use the new -g (glyph) option to enable this mode. For example:


In the above example, the program "wobble" was started and forks off a child process.  The parent then renames itself to wibble (indicated by a turning arrow). The child then segfaults and generates a core dump (indicated by a skull and crossbones), triggering apport to investigate the crash.  After this, we observe NetworkManager creating a thread that runs for a very short period of time.   This kind of activity is normally impossible to spot while running conventions tools such as ps or top.

2. By default, forkstat will show the process name using the contents of /proc/$PID/cmdline.  The new -c option allows one to instead use the 16 character task "comm" field, and this can be helpful for spotting process name changes on PROC_EVENT_COMM events.

These are small changes, but I think they make forkstat more useful.  The updated forkstat will be available in Ubuntu 19.04 "Disco Dingo".

Read more

Snapcraft 3.0

The release notes for snapcraft 3.0 have been long overdue. For convenience I will reproduce them here too. Presenting snapcraft 3.0 The arrival of snapcraft 3.0 brings fresh air into how snap development takes place! We took the learnings from the main pain points you had when creating snaps in the past several years, and we we introduced those lessons into a brand new release - snapcraft 3.0! Build Environments As the cornerstone for behavioral change, we are introducing the concept of build environments.

Read more
Colin Ian King

High-level tracing with bpftrace

Bpftrace is a new high-level tracing language for Linux using the extended Berkeley packet filter (eBPF).  It is a very powerful and flexible tracing front-end that enables systems to be analyzed much like DTrace.

The bpftrace tool is now installable as a snap. From the command line one can install it and enable it to use system tracing as follows:

 sudo snap install bpftrace  
sudo snap connect bpftrace:system-trace

To illustrate the power of bpftrace, here are some simple one-liners:

 # trace openat() system calls
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%d %s %s\n", pid, comm, str(args->filename)); }'
Attaching 1 probe...
1080 irqbalance /proc/interrupts
1080 irqbalance /proc/stat
2255 dmesg /etc/ld.so.cache
2255 dmesg /lib/x86_64-linux-gnu/libtinfo.so.5
2255 dmesg /lib/x86_64-linux-gnu/librt.so.1
2255 dmesg /lib/x86_64-linux-gnu/libc.so.6
2255 dmesg /lib/x86_64-linux-gnu/libpthread.so.0
2255 dmesg /usr/lib/locale/locale-archive
2255 dmesg /lib/terminfo/l/linux
2255 dmesg /home/king/.config/terminal-colors.d
2255 dmesg /etc/terminal-colors.d
2255 dmesg /dev/kmsg
2255 dmesg /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache

 # count system calls using tracepoints:  
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
@[tracepoint:syscalls:sys_enter_getsockname]: 1
@[tracepoint:syscalls:sys_enter_kill]: 1
@[tracepoint:syscalls:sys_enter_prctl]: 1
@[tracepoint:syscalls:sys_enter_epoll_wait]: 1
@[tracepoint:syscalls:sys_enter_signalfd4]: 2
@[tracepoint:syscalls:sys_enter_utimensat]: 2
@[tracepoint:syscalls:sys_enter_set_robust_list]: 2
@[tracepoint:syscalls:sys_enter_poll]: 2
@[tracepoint:syscalls:sys_enter_socket]: 3
@[tracepoint:syscalls:sys_enter_getrandom]: 3
@[tracepoint:syscalls:sys_enter_setsockopt]: 3
...

Note that it is recommended to use bpftrace with Linux 4.9 or higher.

The bpftrace github project page has an excellent README guide with some worked examples and is a very good place to start.  There is also a very useful reference guide and one-liner tutorial too.

If you have any useful btftrace one-liners, it would be great to share them. This is an amazingly powerful tool, and it would be interesting to see how it will be used.

Read more
Tim McNamara

Hello world!

Welcome to Canonical Voices. This is your first post. Edit or delete it, then start blogging!

Read more
K. Tsakalozos

Looking at the configuration of a Kubernetes node sounds like a simple thing, yet it not so obvious.

The arguments kubelet takes come either as command line parameters or from a configuration file you pass with --config. Seems, straight forward to do a ps -ex | grep kubelet and look in the file you see after the --config parameter. Simple, right? But… are you sure you got all the arguments right? What if Kubernetes defaulted to a value you did not want? What if you do not have shell access to a node?

There is a way to query the Kubernetes API for the configuration a node is running with: api/vi/nodes/<node_name>/proxy/cofigz. Lets see this in a real deployment.

Deploy a Kubernetes Cluster

I am using the Canonical Distribution of Kubernetes (CDK) on AWS here but you can use whichever cloud and Kubernetes installation method you like.

juju bootstrap aws
juju deploy canonical-kubernetes

..and wait for the deployment to finish

watch juju status 

Change a Configuration

CDK allows for configuring both the command line arguments and the extra arguments of the config file. Here we add arguments to the config file:

juju config kubernetes-worker kubelet-extra-config='{imageGCHighThresholdPercent: 60, imageGCLowThresholdPercent: 39}'

A great question is how we got the imageGCHighThreshholdPercent literal. At the time of this writing the official upstream docs point you to the type definitions; a rather ugly approach. There is an EvictionHard property in the type definitions, however, if you look at the example of the upstream docs you see the same property is with lowercase.

Check the Configuration

We will need two shells. On the first one we will start the API proxy and on the second we will query the API. On the first shell:

juju ssh kubernetes-master/0
kubectl proxy

Now that we have the proxy at 127.0.0.1:8001 on the kubernetes-master, use a second shell to get a node name and we query the API:

juju ssh kubernetes-master/0
kubectl get no
curl -sSL "http://localhost:8001/api/v1/nodes/<node_name>/proxy/configz" | python3 -m json.tool

Here is a full run:

juju ssh kubernetes-master/0
Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-1023-aws x86_64)
* Documentation:  https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Mon Oct 22 10:40:40 UTC 2018
System load:  0.11               Processes:              115
Usage of /: 13.7% of 15.45GB Users logged in: 1
Memory usage: 20% IP address for ens5: 172.31.0.48
Swap usage: 0% IP address for fan-252: 252.0.48.1
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
0 packages can be updated.
0 updates are security updates.
Last login: Mon Oct 22 10:38:14 2018 from 2.86.54.15
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
ubuntu@ip-172-31-0-48:~$ kubectl get no
NAME STATUS ROLES AGE VERSION
ip-172-31-14-174 Ready <none> 41m v1.12.1
ip-172-31-24-80 Ready <none> 41m v1.12.1
ip-172-31-63-34 Ready <none> 41m v1.12.1
ubuntu@ip-172-31-0-48:~$ curl -sSL "http://localhost:8001/api/v1/nodes/ip-172-31-14-174/proxy/configz" | python3 -m json.tool
{
"kubeletconfig": {
"syncFrequency": "1m0s",
"fileCheckFrequency": "20s",
"httpCheckFrequency": "20s",
"address": "0.0.0.0",
"port": 10250,
"tlsCertFile": "/root/cdk/server.crt",
"tlsPrivateKeyFile": "/root/cdk/server.key",
"authentication": {
"x509": {
"clientCAFile": "/root/cdk/ca.crt"
},
"webhook": {
"enabled": true,
"cacheTTL": "2m0s"
},
"anonymous": {
"enabled": false
}
},
"authorization": {
"mode": "Webhook",
"webhook": {
"cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s"
}
},
"registryPullQPS": 5,
"registryBurst": 10,
"eventRecordQPS": 5,
"eventBurst": 10,
"enableDebuggingHandlers": true,
"healthzPort": 10248,
"healthzBindAddress": "127.0.0.1",
"oomScoreAdj": -999,
"clusterDomain": "cluster.local",
"clusterDNS": [
"10.152.183.93"
],
"streamingConnectionIdleTimeout": "4h0m0s",
"nodeStatusUpdateFrequency": "10s",
"nodeLeaseDurationSeconds": 40,
"imageMinimumGCAge": "2m0s",
"imageGCHighThresholdPercent": 60,
"imageGCLowThresholdPercent": 39,
"volumeStatsAggPeriod": "1m0s",
"cgroupsPerQOS": true,
"cgroupDriver": "cgroupfs",
"cpuManagerPolicy": "none",
"cpuManagerReconcilePeriod": "10s",
"runtimeRequestTimeout": "2m0s",
"hairpinMode": "promiscuous-bridge",
"maxPods": 110,
"podPidsLimit": -1,
"resolvConf": "/run/systemd/resolve/resolv.conf",
"cpuCFSQuota": true,
"cpuCFSQuotaPeriod": "100ms",
"maxOpenFiles": 1000000,
"contentType": "application/vnd.kubernetes.protobuf",
"kubeAPIQPS": 5,
"kubeAPIBurst": 10,
"serializeImagePulls": true,
"evictionHard": {
"imagefs.available": "15%",
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%"
},
"evictionPressureTransitionPeriod": "5m0s",
"enableControllerAttachDetach": true,
"makeIPTablesUtilChains": true,
"iptablesMasqueradeBit": 14,
"iptablesDropBit": 15,
"failSwapOn": false,
"containerLogMaxSize": "10Mi",
"containerLogMaxFiles": 5,
"configMapAndSecretChangeDetectionStrategy": "Watch",
"enforceNodeAllocatable": [
"pods"
]
}
}

Summing Up

There is a way to get the configuration of an online Kubernetes node through the Kubernetes API (api/v1/nodes/<node_name>/proxy/configz). This might be handy if you want to code against Kubernetes or you do not want to get into the intricacies of your particular cluster setup.

References


How to Inspect the Configuration of a Kubernetes Node was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more