Canonical Voices

abeato

Ubuntu Core (UC) is Canonical’s take in the IoT space. There are pre-built images for officially supported devices, like Raspberry Pi or Intel NUCs, but if we have something else and there is no community port, we need to create the UC image ourselves. High level instructions on how to do this are found in the official docs. The process is straightforward once we have two critical components: the kernel and the gadget snap.

Creating these snaps is not necessarily complex, but there can be bumps in the road if you are new to the task. In this post I explain how I created them for the Jetson TX1 developer kit board, and how they were used to create a UC image for said device, hoping this will provide new tricks to hackers working on ports for other devices. All the sources for the snaps and the build scripts are available in github:
https://github.com/alfonsosanchezbeato/jetson-kernel-snap
https://github.com/alfonsosanchezbeato/jetson-gadget-snap
https://github.com/alfonsosanchezbeato/jetson-ubuntu-core

So, let’s start with…

The kernel snap

The Linux kernel that we will use needs some kernel configuration options to be activated, and it is also especially important that it has a modern version of apparmor so snaps can be properly confined. The official Jetson kernel is the 4.4 release, which is quite old, but fortunately Canonical has a reference 4.4 kernel with all the needed patches for snaps backported. Knowing this, we are a git format-patch command away to obtain the patches we will use on top of the nvidia kernel. The patches include also files with the configuration options that we need for snaps, plus some changes so the snap could be successfully compiled on Ubuntu 18.04 desktop.

Once we have the sources, we need, of course, to create a snapcraft.yaml file that will describe how to build the kernel snap. We will walk through it, highlighting the parts more specific to the Jetson device.

Starting with the kernel part, it turns out that we cannot use easily the kernel plugin, due to the special way in which the kernel needs to be built: nvidia distributes part of the needed drivers as separate repositories to the one used by the main kernel tree. Therefore, I resorted to using the nil plugin so I could hand-write the commands to do the build.

The pull stage that resulted is

override-pull: |
  snapcraftctl pull
  # Get kernel sources, which are distributed across different repos
  ./source_sync.sh -k tegra-l4t-r28.2.1
  # Apply canonical patches - apparmor stuff essentially
  cd sources/kernel/display
  git am ../../../patch-display/*
  cd -
  cd sources/kernel/kernel-4.4
  git am ../../../patch/*

which runs a script to retrieve the sources (I pulled this script from nvidia Linux for Tegra -L4T- distribution) and applies Canonical patches.

The build stage is a few more lines, so I decided to use an external script to implement it. We will analyze now parts of it. For the kernel configuration we add all the necessary Ubuntu bits:

make "$JETSON_KERNEL_CONFIG" \
    snappy/containers.config \
    snappy/generic.config \
    snappy/security.config \
    snappy/snappy.config \
    snappy/systemd.config

Then, to do the build we run

make -j"$num_cpu" Image modules dtbs

An interesting catch here is that zImage files are not supported due to lack of a decompressor implementation in the arm64 kernel. So we have to build an uncompressed Image instead.

After some code that stages the built files so they are included in the snap later, we retrieve the initramfs from the core snap. This step is usually hidden from us by the kernel plugin, but this time we have to code it ourselves:

# Get initramfs from core snap, which we need to download
core_url=$(curl -s -H "X-Ubuntu-Series: 16" -H "X-Ubuntu-Architecture: arm64" \
                "https://search.apps.ubuntu.com/api/v1/snaps/details/core?channel=stable" \
               | jq -r ".anon_download_url")
curl -L "$core_url" > core.snap
# Glob so we get both link and regular file
unsquashfs core.snap "boot/initrd.img-core*"
cp squashfs-root/boot/initrd.img-core "$SNAPCRAFT_PART_INSTALL"/initrd.img
ln "$SNAPCRAFT_PART_INSTALL"/initrd.img "$SNAPCRAFT_PART_INSTALL"/initrd-"$KERNEL_RELEASE".img

Moving back to the snapcraft recipe we also have an initramfs part, which takes care of doing some changes to the default initramfs shipped by UC:

initramfs:
  after: [ kernel ]
  plugin: nil
  source: ../initramfs
  override-build: |
    find . | cpio --quiet -o -H newc | lzma >> "$SNAPCRAFT_STAGE"/initrd.img

Here we are taking advantage of the fact that the initramfs can be built as a concatenation of compressed cpio archives. When the kernel decompresses it, the files included in the later archives overwrite the files from the first ones, which allows us to modify easily files in the initramfs without having to change the one shipped with core. The change that we are doing here is a modification to the resize script that allows UC to get all the free space in the disk on first boot. The modification makes sure this happens in the case when the partition is already taken all available space but the filesystem does not. We could remove this modification when these changes reach the core snap, thing that will happen eventually.

The last part of this snap is the firmware part:

firmware:
  plugin: nil
  override-build: |
    set -xe
    wget https://developer.nvidia.com/embedded/dlc/l4t-jetson-tx1-driver-package-28-2-ga -O Tegra210_Linux_R28.2.0_aarch64.tbz2
    tar xf Tegra210_Linux_R28.2.0_aarch64.tbz2 Linux_for_Tegra/nv_tegra/nvidia_drivers.tbz2
    tar xf Linux_for_Tegra/nv_tegra/nvidia_drivers.tbz2 lib/firmware/
    cd lib; cp -r firmware/ "$SNAPCRAFT_PART_INSTALL"
    mkdir -p "$SNAPCRAFT_PART_INSTALL"/firmware/gm20b
    cd "$SNAPCRAFT_PART_INSTALL"/firmware/gm20b
    ln -sf "../tegra21x/acr_ucode.bin" "acr_ucode.bin"
    ln -sf "../tegra21x/gpmu_ucode.bin" "gpmu_ucode.bin"
    ln -sf "../tegra21x/gpmu_ucode_desc.bin" "gpmu_ucode_desc.bin"
    ln -sf "../tegra21x/gpmu_ucode_image.bin" "gpmu_ucode_image.bin"
    ln -sf "../tegra21x/gpu2cde.bin" "gpu2cde.bin"
    ln -sf "../tegra21x/NETB_img.bin" "NETB_img.bin"
    ln -sf "../tegra21x/fecs_sig.bin" "fecs_sig.bin"
    ln -sf "../tegra21x/pmu_sig.bin" "pmu_sig.bin"
    ln -sf "../tegra21x/pmu_bl.bin" "pmu_bl.bin"
    ln -sf "../tegra21x/fecs.bin" "fecs.bin"
    ln -sf "../tegra21x/gpccs.bin" "gpccs.bin"

Here we download some files so we can add firmware blobs to the snap. These files come separate from nvidia kernel sources.

So this is it for the kernel snap, now you just need to follow the instructions to get it built.

The gadget snap

Time now to take a look at the gadget snap. First, I recommend to start by reading great ogra’s post on gadget snaps for devices with u-boot bootloader before going through this section. Now, same as for the kernel one, we will go through the different parts that are defined in the snapcraft.yaml file. The first one builds the u-boot binary:

uboot:
  plugin: nil
  source: git://nv-tegra.nvidia.com/3rdparty/u-boot.git
  source-type: git
  source-tag: tegra-l4t-r28.2
  override-pull: |
    snapcraftctl pull
    # Apply UC patches + bug fixes
    git am ../../../uboot-patch/*.patch
  override-build: |
    export ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
    make p2371-2180_defconfig
    nice make -j$(nproc)
    cp "$SNAPCRAFT_PART_BUILD"/u-boot.bin $SNAPCRAFT_PART_INSTALL"/

We decided again to use the nil plugin as we need to do some special quirks. The sources are pulled from nvidia’s u-boot repository, but we apply some patches on top. These patches, along with the uboot environment, provide

  • Support for loading the UC kernel and initramfs from disk
  • Support for the revert functionality in case a core or kernel snap installation goes wrong
  • Bug fixes for u-boot’s ext4 subsystem – required because the just mentioned revert functionality needs to call u-boot’s command saveenv, which happened to be broken for ext4 filesystems in tegra’s u-boot

More information on the specifics of u-boot patches for UC can be found in this great blog post.

The only other part that the snap has is uboot-env:

uboot-env:
  plugin: nil
  source: uboot-env
  override-build: |
    mkenvimage -r -s 131072 -o uboot.env uboot.env.in
    cp "$SNAPCRAFT_PART_BUILD"/uboot.env "$SNAPCRAFT_PART_INSTALL"/
    # Link needed for ubuntu-image to work properly
    cd "$SNAPCRAFT_PART_INSTALL"/; ln -s uboot.env uboot.conf
  build-packages:
    - u-boot-tools

This simply encodes the uboot.env.in file into a format that is readable by u-boot. The resulting file, uboot.env, is included in the snap.

This environment is where most of the support for UC is encoded. I will not delve too much into the details, but just want to mention that the variables that need to be edited usually for new devices are

  • devnum, partition, and devtype to set the system boot partition, from which we load the kernel and initramfs
  • fdtfile, fdt_addr_r, and fdt_high to determine the name of the device tree and where in memory it should be loaded
  • ramdisk_addr_r and initrd_high to set the loading location for the initramfs
  • kernel_addr_r to set where the kernel needs to be loaded
  • args contains kernel arguments and needs to be adapted to the device specifics
  • Finally, for this device, snappy_boot was changed so it used booti instead of bootz, as we could not use a compressed kernel as explained above

Besides the snapcraft recipe, the other mandatory file when defining a gadget snap is the gadget.yaml file. This file defines, among other things, the image partitioning layout. There is more to it, but in this case, partitioning is the only thing we have defined:

volumes:
  jetson:
    bootloader: u-boot
    schema: gpt
    structure:
      - name: system-boot
        role: system-boot
        type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
        filesystem: ext4
        filesystem-label: system-boot
        offset: 17408
        size: 67108864
      - name: TBC
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: EBT
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: BPF
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: WB0
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: RP1
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: TOS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: EKS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: FX
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: BMP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 134217728
      - name: SOS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 20971520
      - name: EXI
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 67108864
      - name: LNX
        type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
        size: 67108864
        content:
          - image: u-boot.bin
      - name: DTB
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: NXT
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: MXB
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: MXP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: USP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
size: 2097152

The Jetson TX1 has a complex partitioning layout, with many partitions being allocated for the first stage bootloader, and many others that are undocumented. So, to minimize the risk of touching a critical partition, I preferred to keep most of them untouched and do just the minor amount of changes to fit UC into the device. Therefore, the gadget.yaml volumes entry mainly describes the TX1 defaults, with the main differences comparing to the original being:

  1. The APP partition is renamed to system-boot and reduced to only 64MB. It will contain the uboot environment file plus the kernel and initramfs, as usual in UC systems with u-boot bootloader.
  2. The LNX partition will contain our u-boot binary
  3. If a partition with role: system-data is not defined explicitly (which is the case here), a partition which such role and with label “writable” is implicitly defined at the end of the volume. This will take all the available space left aside by the reduction of the APP partition, and will contain the UC root filesystem. This will replace the UDA partition that is the last in nvidia partitioning scheme.

Now, it is time to build the gadget snap by following the repository instructions.

Building & flashing the image

Now that we have the snaps, it is time to build the image. There is not much to it, you just need an Ubuntu One account and to follow the instructions to create a key to be able to sign a model assertion. With that just follow the README.md file in the jetson-ubuntu-core repository. You can also download the latest tarball from the repository if you prefer.

The build script will generate not only a full image file, but also a tarball that will contain separate files for each partition that needs to be flashed in the device. This is needed because unfortunately there is no way we can fully flash the Jetson device with a GPT image, instead we can flash only individual partitions with the tools nvidia provides.

Once the build finishes, we can take the resulting tarball and follow the instructions to get the necessary partitions flashed. As can be read there, we have to download the nvidia L4T package. Also, note that to be able to change the partition sizes and files to flash, a couple of patches have to be applied on top of the L4T scripts.

Summary

After this, you should have a working Ubuntu Core 18 device. You can use the serial port or an external monitor to configure it with your launchpad account so you can ssh into it. Enjoy!

Read more
Colin Watson

Here’s a brief changelog for this month.

Build farm

  • Allow dispatching builds with base images selected based on the pocket and/or using LXD images instead of chroot tarballs where appropriate (#1811677)

Code

  • Store bzr-svn‘s cache in the import data store
  • Allow project owners to use the Bazaar branch rescan view
  • Canonicalise expected rule ordering in GitRepository.setRules (#1815431)
  • Upgrade to pygit2 0.27.4 (#1815517)

Infrastructure

  • Use full gpg key fingerprints in rocketfuel-setup (contributed by Andy Brody; #1814206)

Snappy

Soyuz (package management)

  • Allow source .changes files to omit the Binary field (#1813037)

Read more
Colin Watson

Here’s a brief changelog of what we’ve been up to since our last general update.

Bugs

  • Parse a few more possible Savane URL formats (#197250)
  • Compare Bugzilla versions properly when checking whether they support the Bugzilla API (part of #1802798)

Build farm

  • Configure snap proxy settings for Subversion (#1668358)
  • Support passing IMAGE_TARGETS, REPO_SNAPSHOT_STAMP, and COHORT_KEY variables into live filesystem builds
  • Set SNAPCRAFT_BUILD_ENVIRONMENT=host when building snaps (#1791201)
  • Prevent gathering results of large builds from blocking responses to XML-RPC requests (#1795877)
  • Add missing indexes on LiveFSFile(libraryfile) and SnapFile(libraryfile)
  • Direct build failure support to Launchpad Answers rather than to the launchpad-buildd-admins team (#1810001)

Code

  • Allow proposing merges between different branches of the same personal Git repository
  • Fix OOPS when trying to look up ~user/project:branch as a unique Git repository name (#1771118)
  • Optimise GitRepository.fetchRefCommits if there are no commits to fetch
  • Handle the case where a Bazaar branch and a Git repository have the same identity URL when creating a recipe (#1623924)
  • Push code imports over bzr+ssh rather than sftp (#1779572)
  • Fix crash when emailing inline comments on a diff with non-ASCII characters in hunk headers (#1787587)
  • Percent-encode reference names in GitRef URLs (#1787965)
  • Add active reviews link to Git-based project pages (#1777102)
  • Fix handling of non-ASCII ref names
  • Include the appropriate username in git+ssh:// URLs in the UI
  • Add instructions on creating personal Git repositories to people’s “View Git repositories” pages (#1590560)
  • Add available review targets and proposals to the Git repository overview page (#1789847)
  • Fix incorrect visibility check that broke code imports targeted at private Git repositories (#1789424)
  • Allow anonymous users to view votes for public merge proposals (#1786474)
  • Make Git ref scan jobs for repositories with large numbers of refs take much less memory
  • Tolerate backend timeouts while fetching commit information for GitRef:+index (#1804395)
  • Add Git per-branch permissions (#1517559)
  • Add rescan buttons when various kinds of code scanning jobs fail (#1808320)

Infrastructure

  • Convert all remaining code to use explicit proxy configuration settings rather than picking up a proxy from the environment, making the effective production settings easier to understand
  • Add support for ECDSA SSH keys (#907675)

Libraries

Registry

  • Add a suspend-bot-account.py script to suspend an account by email address
  • Weaken type of key_text in Person.deleteSSHKeysFromSSO so that more existing keys can be deleted (#1780411)
  • Fix SSHKey.getFullKeyText to not crash on some corrupt keys (#1798046)
  • Various improvements to the close-account script

Snappy

  • Extract initial Snap.store_name from snapcraft.yaml for Bazaar as well as Git
  • Add support for Snapcraft’s architectures keyword (#1770400)
  • Bump SnapStoreUploadJob.max_retries to 30 to allow for longer store scan times
  • Include the registered store package name for a snap recipe in its builds’ titles if it exists and differs from the snap recipe name
  • Move some metadata from SnapStoreUploadJob to SnapBuild, to prevent store upload jobs getting into states that cannot be retried

Soyuz (package management)

  • Add Archive.getSigningKeyData, currently just proxying through to the keyserver (#1667725)
  • Add extendedKeyUsage information to kmod signing keys so that they can only be used to sign modules, not boot loaders or kernels (#1774746)

Read more
K. Tsakalozos

We have been quiet for a few months just because we have been busy. We were working mainly on two features that we intend to ship in the v1.14 release:

The entailed changes will affect the backwards compatibility and user experience of MicroK8s and this is the reason we time them with the upcoming upstream Kubernetes release. Here we will provide a) a short description of these features, b) a way for you to test drive the new MicroK8s, and c) the steps on how to hold back on the release in case this is a major show stopper for you.

The transition to Containerd

We replace Dockerd with Containerd mainly for two reasons.

  • The setup of having two dockerd on the same host has proven problematic. MicroK8s brings its own dockerd that may clash with a local dockerd users may want to have. With moving to containerd users can apt-get install docker.io without affecting MicroK8s. This switch also means that microk8s.docker will not be available anymore, you will have to use a docker client shipped with your distribution.
  • Performance. It is shown that there is a performance benefit from using containerd. This should not be a surprise since dockerd itself uses containerd internally. With the switch to containerd we are essentially removing a layer that is docker specific.

Hardening MicroK8s security

MicroK8s is a developer’s tool. It is not meant to be deployed in production or in hostile environments. Having said that we tried to make MicroK8s more secure by:

  • Exposing as few services as we can. Here is a table with what we left open and the access restrictions involved:
https://medium.com/media/4dac105e741261ca58799b0b8d101dae/href
  • A CA and certificates are created once at deployment time.

Test drive the upcoming patches

We have prepared a temporary branch you could use to evaluate the above changes:

snap install microk8s --classic --channel=1.13/edge/secure-containerd

If you have MicroK8s already installed you can switch the channel your MicroK8s is following:

snap refresh --channel=1.13/edge/secure-containerd microk8s

Try it out and let us know if we missed anything.

“Thanks, I’ll pass”

All release series up until now will not be affected by this change. This means you can have your MicroK8s deployment follow the 1.13 track:

snap refresh --channel=1.13/stable microk8s

Summing up

An important update is coming. Make sure you give it a try with:

snap install microk8s --classic --channel=1.13/edge/secure-containerd

If you do not like what you see tell us what breaks by filing an issue and keep using the 1.13 track.

References


Containerd on a more secure MicroK8s was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
K. Tsakalozos

MicroK8s in the Wild

As the popularity of MicroK8s grows I would like to take the time to mention some projects that use this micro Kubernetes distribution. But before that, let me do some introductions. For those unfamiliar with Kubernetes, Kubernetes is an open source container orchestrator. It shows you how to deploy, upgrade, and provision your application. This is one of the rare occasions where all the major players (Google, Microsoft, IBM, Amazon etc) have flocked around a single framework making it an unofficial standard.

MicroK8s is a distribution of Kubernetes. It is a snap package that sets up a Kubernetes cluster on your machine. You can have a Kubernetes cluster for local development, CI/CD or just for getting to know Kubernetes with just a:

sudo snap install microk8s --classic

If you are on a Mac or Windows you will need a Linux VM.

In what follows you will find some examples on how people are using MicroK8s. Note that this is not a complete list of MicroK8s usages, it is just some efforts I happen to be aware of.

Spring Cloud Kubernetes

This project is using CircleCI for CI/CD. MicroK8s provides a local Kubernetes cluster where integration tests are run. The addons enabled are dns, the docker registry and Istio. The integration tests need to plug into the Kubernetes cluster using the kubeconfig file and the socket to dockerd. This work was introduced in this Pull Request (thanks George) and it gave us the incentive to add a microk8s.status command that would wait for the cluster to come online. For example we can wait up to 5 minutes for MicroK8s to come up with:

microk8s.status --wait-ready --timeout=300

OpenFaaS on MicroK8s

It was this year’s Config Management Camp where I met Joe McCobe the author of “Deploy OpenFaaS with MicroK8s”. I will just repeat his words “was blown away by the speed and ease with which I could get a basic lab environment up and running”.

What about Kubeless?

It seems the ease of deploying MicroK8s goes well with the ease of software development of serverless frameworks. Users of Kubeless are also kicking the tires on MicroK8s. Have a look at “Files upload from Kubeless on MicroK8s to Minio” and “Serverless MicroK8s Kubernetes.”

SUSE Cloud Application Platform (CAP) on Microk8s

In his blog post Dimitris describes in detail all the configuration he had to do to get the software from SUSE to run on MicroK8s. The most interesting part is the motivation behind this effort. As he says “… MicroK8s… use your machine’s resources without you having to decide on a VM size beforehand.” As he explained to me his application puts significant memory pressure only during bootstrap. MicroK8s enabled him to reclaim the unused memory after the initialization phase.

Kubeflow

Kubeflow is the missing link between Kubernetes and AI/ML. Canonical is actively involved in this so…. you should definitely check it out. Sure, I am biased but let me tell you a true story. I have a friend who was given three machines to deploy Tensorflow and run some experiments. She did not have any prior experience at the time so… none of the three node clusters were setup in exactly the same way. There was always something off. This head-scratching situation is just one reason to use Kubeflow.

Transcrobes

Transcrobes comes from an active member of the MicroK8s community. It serves as a language learning aid. “The system knows what you know, so can give you just the right amount of help to be able to understand the words you don’t know but gets out of the way for the stuff you do know.” Here MicroK8s is used for quick prototyping. We wish you all the best Anton, good luck!

Summing Up

We have seen a number of interesting use cases that include CI/CD, Serverless programming, lab setup, rapid prototyping and application development. If you have a MicroK8s use case do let us know. Come and say hi at #microk8s on the Kubernetes slack and/or issue a Pull Request against our MicroK8s In The Wild page.

References


MicroK8s in the Wild was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
Christian Brauner

Runtimes And the Curse of the Privileged Container

Introduction (CVE-2019-5736)

Today, Monday, 2019-02-11, 14:00:00 CET CVE-2019-5736 was released:

The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host. The level of user interaction is being able to run any command (it doesn’t matter if the command is not attacker-controlled) as root within a container in either of these contexts:

  • Creating a new container using an attacker-controlled image.
  • Attaching (docker exec) into an existing container which the attacker had previous write access to.

I’ve been working on a fix for this issue over the last couple of weeks together with Aleksa a friend of mine and maintainer of runC. When he notified me about the issue in runC we tried to come up with an exploit for LXC as well and though harder it is doable. I was interested in the issue for technical reasons and figuring out how to reliably fix it was quite fun (with a proper dose of pure hatred). It also caused me to finally write down some personal thoughts I had for a long time about how we are running containers.

What are Privileged Containers?

At a first glance this is a question that is probably trivial to anyone who has a decent low-level understanding of containers. Maybe even most users by now will know what a privileged container is. A first pass at defining it would be to say that a privileged container is a container that is owned by root. Looking closer this seems an insufficient definition. What about containers using user namespaces that are started as root? It seems we need to distinguish between what ids a container is running with. So we could say a privileged container is a container that is running as root. However, this is still wrong. Because “running as root” can either be seen as meaning “running as root as seen from the outside” or “running as root from the inside” where “outside” means “as seen from a task outside the container” and “inside” means “as seen from a task inside the container”.

What we really mean by a privileged container is a container where the semantics for id 0 are the same inside and outside of the container ceteris paribus. I say “ceteris paribus” because using LSMs, seccomp or any other security mechanism will not cause a change in the meaning of id 0 inside and outside the container. For example, a breakout caused by a bug in the runtime implementation will give you root access on the host.

An unprivileged container then simply is any container in which the semantics for id 0 inside the container are different from id 0 outside the container. For example, a breakout caused by a bug in the runtime implementation will not give you root access on the host by default. This should only be possible if the kernel’s user namespace implementation has a bug.

The reason why I like to define privileged containers this way is that it also lets us handle edge cases. Specifically, the case where a container is using a user namespace but a hole is punched into the idmapping at id 0 aka where id 0 is mapped through. Consider a container that uses the following idmappings:

id: 0 100000 100000

This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(100000)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(100000) -> host_id(200000)

With this mapping it’s evident that container_id(0) != host_id(0). But now consider the following mapping:

id: 0 0 1
id: 1 100001 99999

This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(0)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(99999) -> host_id(199999)

In contrast to the first example this has the consequence that container_id(0) == host_id(0). I would argue that any container that at least punches a hole for id 0 into its idmapping up to specifying an identity mapping is to be considered a privileged container.

As a sidenote, Docker containers run as privileged containers by default. There is usually some confusion where people think because they do not use the --privileged flag that Docker containers run unprivileged. This is wrong. What the --privileged flag does is to give you even more permissions by e.g. not dropping (specific or even any) capabilities. One could say that such containers are almost “super-privileged”.

The Trouble with Privileged Containers

The problem I see with privileged containers is essentially captured by LXC’s and LXD’s upstream security position which we have held since at least 2015 but probably even earlier. I’m quoting from our notes about privileged containers:

Privileged containers are defined as any container where the container uid 0 is mapped to the host’s uid 0. In such containers, protection of the host and prevention of escape is entirely done through Mandatory Access Control (apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.

Those technologies combined will typically prevent any accidental damage of the host, where damage is defined as things like reconfiguring host hardware, reconfiguring the host kernel or accessing the host filesystem.

LXC upstream’s position is that those containers aren’t and cannot be root-safe.

They are still valuable in an environment where you are running trusted workloads or where no untrusted task is running as root in the container.

We are aware of a number of exploits which will let you escape such containers and get full root privileges on the host. Some of those exploits can be trivially blocked and so we do update our different policies once made aware of them. Some others aren’t blockable as they would require blocking so many core features that the average container would become completely unusable.

[…]

As privileged containers are considered unsafe, we typically will not consider new container escape exploits to be security issues worthy of a CVE and quick fix. We will however try to mitigate those issues so that accidental damage to the host is prevented.

LXC’s upstream position for a long time has been that privileged containers are not and cannot be root safe. For something to be considered root safe it should be safe to hand root access to third parties or tasks.

Running Untrusted Workloads in Privileged Containers

is insane. That’s about everything that this paragraph should contain. The fact that the semantics for id 0 inside and outside the container are identical entails that any meaningful container escape will have the attacker gain root on the host.

CVE-2019-5736 Is a Very Very Very Bad Privilege Escalation to Host Root

CVE-2019-5736 is an excellent illustration of such an attack. Think about it: a process running inside a privileged container can rather trivially corrupt the binary that is used to attach to the container. This allows an attacker to create a custom ELF binary on the host. That binary could do anything it wants:

  • could just be a binary that calls poweroff
  • could be a binary that spawns a root shell
  • could be a binary that kills other containers when called again to attach
  • could be suid cat
  • .
  • .
  • .

The attack vector is actually slightly worse for runC due to its architecture. Since runC exits after spawning the container it can also be attacked through a malicious container image. Which is super bad given that a lot of container workload workflows rely on downloading images from the web.

LXC cannot be attacked through a malicious image since the monitor process (a singleton per-container) never exits during the containers life cycle. Since the kernel does not allow modifications to running binaries it is not possible for the attacker to corrupt it. When the container is shutdown or killed the attacking task will be killed before it can do any harm. Only when the last process running inside the container has exited will the monitor itself exit. This has the consequence, that if you run privileged OCI containers via our oci template with LXC your are not vulnerable to malicious images. Only the vector through the attaching binary still applies.

The Lie that Privileged Containers can be safe

Aside from mostly working on the Kernel I’m also a maintainer of LXC and LXD alongside Stéphane Graber. We are responsible for LXC - the low-level container runtime - and LXD - the container management daemon using LXC. We have made a very conscious decision to consider privileged containers not root safe. Two main corollaries follow from this:

  1. Privileged containers should never be used to run untrusted workloads.
  2. Breakouts from privileged containers are not considered CVEs by our security policy. It still seems a common belief that if we all just try hard enough using privileged containers for untrusted workloads is safe. This is not a promise that can be made good upon. A privileged container is not a security boundary. The reason for this is simply what we looked at above: container_id(0) == host_id(0). It is therefore deeply troubling that this industry is happy to let users believe that they are safe and secure using privileged containers.

Unprivileged Containers as Default

As upstream for LXC and LXD we have been advocating the use of unprivileged containers by default for years. Way ahead before anyone else did. Our low-level library LXC has supported unprivileged containers since 2013 when user namespaces were merged into the kernel. With LXD we have taken it one step further and made unprivileged containers the default and privileged containers opt-in for that very matter: privileged containers aren’t safe. We even allow you to have per-container idmappings to make sure that not just each container is isolated from the host but also all containers from each other.

For years we have been advocating for unprivileged containers on conferences, in blogposts, and whenever we have spoken to people but somehow this whole industry has chosen to rely on privileged containers.

The good news is that we are seeing changes as people become more familiar with the perils of privileged containers. Let this recent CVE be another reminder that unprivileged containers need to be the default.

Are LXC and LXD affected?

I have seen this question asked all over the place so I guess I should add a section about this too:

  • Unprivileged LXC and LXD containers are not affected.

  • Any privileged LXC and LXD container running on a read-only rootfs is not affected.

  • Privileged LXC containers in the definition provided above are affected. Though the attack is more difficult than for runC. The reason for this is that the lxc-attach binary does not exit before the program in the container has finished executing. This means an attacker would need to open an O_PATH file descriptor to /proc/self/exe, fork() itself into the background and re-open the O_PATH file descriptor through /proc/self/fd/<O_PATH-nr> in a loop as O_WRONLY and keep trying to write to the binary until such time as lxc-attach exits. Before that it will not succeed since the kernel will not allow modification of a running binary.

  • Privileged LXD containers are only affected if the daemon is restarted other than for upgrade reasons. This should basically never happen. The LXD daemon never exits so any write will fail because the kernel does not allow modification of a running binary. If the LXD daemon is restarted because of an upgrade the binary will be swapped out and the file descriptor used for the attack will write to the old in-memory binary and not to the new binary.

Chromebooks with Crostini using LXD are not affected

Chromebooks use LXD as their default container runtime are not affected. First of all, all binaries reside on a read-only filesystem and second, LXD does not allow running privileged containers on Chromebooks through the LXD_UNPRIVILEGED_ONLY flag. For more details see this link.

Fixing CVE-2019-5736

To prevent this attack, LXC has been patched to create a temporary copy of the calling binary itself when it attaches to containers (cf. 6400238d08cdf1ca20d49bafb85f4e224348bf9d). To do this LXC can be instructed to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. LXC then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host LXC binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed, writes to this will also fail. To not break downstream users of the shared library this is opt-in by setting LXC_MEMFD_REXEC in the environment. For our lxc-attach binary which is the only attack vector this is now done by default.

Workloads that place the LXC binaries on a read-only filesystem or prevent running privileged containers can disable this feature by passing --disable-memfd-rexec during the configure stage when compiling LXC.

Read more

Snapcraft 3.1

snapcraft 3.1 is now available on the stable channel of the Snap Store. This is a new minor release building on top of the foundations laid out from the snapcraft 3.0 release. If you are already on the stable channel for snapcraft then all you need to do is wait for the snap to be refreshed. The full release notes are replicated here below Build Environments It is now possible, when using the base keyword, to once again clean parts.

Read more
jdstrand

Some time ago we started alerting publishers when their stage-packages received a security update since the last time they built a snap. We wanted to create the right balance for the alerts and so the service currently will only alert you when there are new security updates against your stage-packages. In this manner, you can choose not to rebuild your snap (eg, since it doesn’t use the affected functionality of the vulnerable package) and not be nagged every day that you are out of date.

As nice as that is, sometimes you want to check these things yourself or perhaps hook the alerts into some form of automation or tool. While the review-tools had all of the pieces so you could do this, it wasn’t as straightforward as it could be. Now with the latest stable revision of the review-tools, this is easy:

$ sudo snap install review-tools
$ review-tools.check-notices \
  ~/snap/review-tools/common/review-tools_656.snap
{'review-tools': {'656': {'libapt-inst2.0': ['3863-1'],
                          'libapt-pkg5.0': ['3863-1'],
                          'libssl1.0.0': ['3840-1'],
                          'openssl': ['3840-1'],
                          'python3-lxml': ['3841-1']}}}

The review-tools are a strict mode snap and while it plugs the home interface, that is only for convenience, so I typically disconnect the interface and put things in its SNAP_USER_COMMON directory, like I did above.

Since now it is super easy to check a snap on disk, with a little scripting and a cron job, you can generate a machine readable report whenever you want. Eg, can do something like the following:

$ cat ~/bin/check-snaps
#!/bin/sh
set -e

snaps="review-tools/stable rsync-jdstrand/edge"

tmpdir=$(mktemp -d -p "$HOME/snap/review-tools/common")
cleanup() {
    rm -fr "$tmpdir"
}
trap cleanup EXIT HUP INT QUIT TERM

cd "$tmpdir" || exit 1
for i in $snaps ; do
    snap=$(echo "$i" | cut -d '/' -f 1)
    channel=$(echo "$i" | cut -d '/' -f 2)
    snap download "$snap" "--$channel" >/dev/null
done
cd - >/dev/null || exit 1

/snap/bin/review-tools.check-notices "$tmpdir"/*.snap

or if  you already have the snaps on disk somewhere, just do:

$ /snap/bin/review-tools.check-notices /path/to/snaps/*.snap

Now can add the above to cron or some automation tool as a reminder of what needs updates. Enjoy!

Read more
K. Tsakalozos

No.

No.

Read more
Ghost

Welcome to Ghost

Welcome to Ghost

Read more
Ghost

Writing posts with Ghost ✍️

Ghost has a powerful visual editor with familiar formatting options, as well as the ability to seamlessly add dynamic content.

Select the text to add formatting, headers or create links, or use Markdown shortcuts to do the work for you - if that's your thing.

Writing posts with Ghost ✍️

Rich editing at your fingertips

The editor can also handle rich media objects, called cards.

You can insert a card either by clicking the  +  button on a new line, or typing  /  on a new line to search for a particular card. This allows you to efficiently insert images, markdown, html and embeds.

For Example:

  • Insert a video from YouTube directly into your content by pasting the URL
  • Create unique content like a button or content opt-in using the HTML card
  • Need to share some code? Embed code blocks directly
<header class="site-header outer">
    <div class="inner">
        {{> "site-nav"}}
    </div>
</header>

Working with images in posts

You can add images to your posts in many ways:

  • Upload from your computer
  • Click and drag an image into the browser
  • Paste directly into the editor from your clipboard
  • Insert using a URL

Once inserted you can blend images beautifully into your content at different sizes and add captions wherever needed.

Writing posts with Ghost ✍️

The post settings menu and publishing options can be found in the top right hand corner. For more advanced tips on post settings check out the publishing options post!

Read more
Ghost

Publishing options

Publishing options

The Ghost editor has everything you need to fully optimise your content. This is where you can add tags and authors, feature a post, or turn a post into a page.

Access the post settings menu in the top right hand corner of the editor.

Post feature image

Insert your post feature image from the very top of the post settings menu. Consider resizing or optimising your image first to ensure it's an appropriate size.

Structured data & SEO

Customise your social media sharing cards for Facebook and Twitter, enabling you to add custom images, titles and descriptions for social media.

There’s no need to hard code your meta data. You can set your meta title and description using the post settings tool, which has a handy character guide and SERP preview.

Ghost will automatically implement structured data for your publication using JSON-LD to further optimise your content.

{
    "@context": "https://schema.org",
    "@type": "Article",
    "publisher": {
        "@type": "Organization",
        "name": "Publishing options",
        "logo": "https://static.ghost.org/ghost-logo.svg"
    },
    "author": {
        "@type": "Person",
        "name": "Ghost",
        "url": "http://demo.ghost.io/author/ghost/",
        "sameAs": []
    },
    "headline": "Publishing options",
    "url": "http://demo.ghost.io/publishing-options",
    "datePublished": "2018-08-08T11:44:00.000Z",
    "dateModified": "2018-08-09T12:06:21.000Z",
    "keywords": "Getting Started",
    "description": "The Ghost editor has everything you need to fully optimise your content. This is where you can add tags and authors, feature a post, or turn a post into a page.",
    }
}
    

You can test that the structured data schema on your site is working as it should using Google’s structured data tool.

Code Injection

This tool allows you to inject code on a per post or page basis, or across your entire site. This means you can modify CSS, add unique tracking codes, or add other scripts to the head or foot of your publication without making edits to your theme files.

To add code site-wide, use the code injection tool in the main admin menu. This is useful for adding a Facebook Pixel, a Google Analytics tracking code, or to start tracking with any other analytics tool.

To add code to a post or page, use the code injection tool within the post settings menu. This is useful if you want to add art direction, scripts or styles that are only applicable to one post or page.

From here, you might be interested in managing some more specific admin settings!

Read more
Ghost

Managing admin settings

Managing admin settings

There are a couple of things to do next while you're getting set up:

Make your site private

If you've got a publication that you don't want the world to see yet because it's not ready to launch, you can hide your Ghost site behind a basic shared pass-phrase.

You can toggle this preference on at the bottom of Ghost's General Settings:

Managing admin settings

Ghost will give you a short, randomly generated pass-phrase which you can share with anyone who needs access to the site while you're working on it. While this setting is enabled, all search engine optimisation features will be switched off to help keep your site under the radar.

Do remember though, this is not secure authentication. You shouldn't rely on this feature for protecting important private data. It's just a simple, shared pass-phrase for some very basic privacy.


Invite your team

Ghost has a number of different user roles for your team:

Contributors
This is the base user level in Ghost. Contributors can create and edit their own draft posts, but they are unable to edit drafts of others or publish posts. Contributors are untrusted users with the most basic access to your publication.

Authors
Authors are the 2nd user level in Ghost. Authors can write, edit  and publish their own posts. Authors are trusted users. If you don't trust users to be allowed to publish their own posts, they should be set as Contributors.

Editors
Editors are the 3rd user level in Ghost. Editors can do everything that an Author can do, but they can also edit and publish the posts of others - as well as their own. Editors can also invite new Contributors+Authors to the site.

Administrators
The top user level in Ghost is Administrator. Again, administrators can do everything that Authors and Editors can do, but they can also edit all site settings and data, not just content. Additionally, administrators have full access to invite, manage or remove any other user of the site.

The Owner
There is only ever one owner of a Ghost site. The owner is a special user which has all the same permissions as an Administrator, but with two exceptions: The Owner can never be deleted. And in some circumstances the owner will have access to additional special settings if applicable. For example: billing details, if using Ghost(Pro).

It's a good idea to ask all of your users to fill out their user profiles, including bio and social links. These will populate rich structured data for posts and generally create more opportunities for themes to fully populate their design.

Next up: Organising your content

Read more
Ghost

Organising your content

Organising your content

Ghost has a flexible organisational taxonomy called tags which can be used to configure your site structure using dynamic routing.

Basic Tagging

You can think of tags like Gmail labels. By tagging posts with one or more keyword, you can organise articles into buckets of related content.

When you create content for your publication you can assign tags to help differentiate between categories of content.

For example you may tag some content with  News and other content with Podcast, which would create two distinct categories of content listed on /tag/news/ and /tag/weather/, respectively.

If you tag a post with both News and Weather - then it appears in both sections. Tag archives are like dedicated home-pages for each category of content that you have. They have their own pages, their own RSS feeds, and can support their own cover images and meta data.

The primary tag

Inside the Ghost editor, you can drag and drop tags into a specific order. The first tag in the list is always given the most importance, and some themes will only display the primary tag (the first tag in the list) by default.

News, Technology, Startup

So you can add the most important tag which you want to show up in your theme, but also add related tags which are less important.

Private tags

Sometimes you may want to assign a post a specific tag, but you don't necessarily want that tag appearing in the theme or creating an archive page. In Ghost, hashtags are private and can be used for special styling.

For example, if you sometimes publish posts with video content - you might want your theme to adapt and get rid of the sidebar for these posts, to give more space for an embedded video to fill the screen. In this case, you could use private tags to tell your theme what to do.

News, #video

Here, the theme would assign the post publicly displayed tags of News - but it would also keep a private record of the post being tagged with #video. In your theme, you could then look for private tags conditionally and give them special formatting.

You can find documentation for theme development techniques like this and many more over on Ghost's extensive theme documentation.

Dynamic Routing

Dynamic routing gives you the ultimate freedom to build a custom publication to suit your needs. Routes are rules that map URL patterns to your content and templates.

For example, you may not want content tagged with News to exist on: example.com/tag/news. Instead, you want it to exist on example.com/news .

In this case you can use dynamic routes to create customised collections of content on your site. It's also possible to use multiple templates in your theme to render each content type differently.

There are lots of use cases for dynamic routing with Ghost, here are a few common examples:

  • Setting a custom home page with its own template
  • Having separate content hubs for blog and podcast, that render differently, and have custom RSS feeds to support two types of content
  • Creating a founders column as a unique view, by filtering content created by specific authors
  • Including dates in permalinks for your posts
  • Setting posts to have a URL relative to their primary tag like example.com/europe/story-title/
Dynamic routing can be configured in Ghost using YAML files. Read our dynamic routing documentation for further details.

You can further customise your site using Apps & Integrations.

Read more
Ghost

Apps & integrations

Apps & integrations

There are three primary ways to work with third-party services in Ghost: using Zapier, editing your theme, or using the Ghost API.

Zapier

You can connect your Ghost site to over 1,000 external services using the official integration with Zapier.

Zapier sets up automations with Triggers and Actions, which allows you to create and customise a wide range of connected applications.

Example: When someone new subscribes to a newsletter on a Ghost site (Trigger) then the contact information is automatically pushed into MailChimp (Action).

Here are the most popular Ghost<>Zapier automation templates:

Editing your theme

One of the biggest advantages of using Ghost over centralised platforms is that you have total control over the front end of your site. Either customise your existing theme, or create a new theme from scratch with our Theme SDK.

You can integrate any front end code into a Ghost theme without restriction, and it will work just fine. No restrictions!

Here are some common examples:

  • Include comments on a Ghost blog with Disqus or Discourse
  • Implement MathJAX with a little bit of JavaScript
  • Add syntax highlighting to your code snippets using Prism.js
  • Integrate any dynamic forms from Google or Typeform to capture data
  • Just about anything which uses JavaScript, APIs and Markup.

Using the Public API

Ghost itself is driven by a set of core APIs, and so you can access the Public Ghost JSON API from external webpages or applications in order to pull data and display it in other places.

The Ghost API is thoroughly documented and straightforward to work with for developers of almost any level.

Alright, the last post in our welcome-series! If you're curious about creating your own Ghost theme from scratch, here are some more details on how that works.

Read more
Ghost

Creating a custom theme

Creating a custom theme

Ghost comes with a beautiful default theme called Casper, which is designed to be a clean, readable publication layout and can be adapted for most purposes. However, Ghost can also be completely themed to suit your needs. Rather than just giving you a few basic settings which act as a poor proxy for code, we just let you write code.

There are a huge range of both free and premium pre-built themes which you can get from the Ghost Theme Marketplace, or you can create your own from scratch.

Creating a custom theme
Anyone can write a completely custom Ghost theme with some solid knowledge of HTML and CSS

Ghost themes are written with a templating language called handlebars, which has a set of dynamic helpers to insert your data into template files. For example: {{author.name}} outputs the name of the current author.

The best way to learn how to write your own Ghost theme is to have a look at the source code for Casper, which is heavily commented and should give you a sense of how everything fits together.

  • default.hbs is the main template file, all contexts will load inside this file unless specifically told to use a different template.
  • post.hbs is the file used in the context of viewing a post.
  • index.hbs is the file used in the context of viewing the home page.
  • and so on

We've got full and extensive theme documentation which outlines every template file, context and helper that you can use.

If you want to chat with other people making Ghost themes to get any advice or help, there's also a themes section on our public Ghost forum.

Read more
Colin Watson

Git per-branch permissions

We’ve had Git hosting support in Launchpad for a few years now. One thing that some users asked for, particularly larger users such as the Ubuntu kernel team, was the ability to set up per-branch push permissions for their repositories. Today we rolled out the last piece of this work.

Launchpad’s default behaviour is that repository owners may push anything to their own repositories, including creating new branches, force-pushing (rewriting history), and deleting branches, while nobody else may push anything. Repository owners can now also choose to protect branches or tags, either individually or using wildcard rules. If a branch is protected, then by default repository owners can only create or push it but cannot force-push or delete; if a tag is protected, then by default repository owners can create it but cannot move or delete it.

You can also allow selected contributors to push to protected branches or tags, so if you’re collaborating with somebody on a branch and just want to be able to quickly pair-program via git push, or you want a merge robot to be able to land merge proposals in your repository without having to add it to the team that owns the repository and thus give it privileges it doesn’t need, then this feature may be for you.

There’s some initial documentation on our help site, and here’s a screenshot of a repository that’s been set up to give a contributor push access to a single branch:

Read more
Christian Brauner

Android Binderfs

asciicast

Introduction

Android Binder is an inter-process communication (IPC) mechanism. It is heavily used in all Android devices. The binder kernel driver has been present in the upstream Linux kernel for quite a while now.

Binder has been a controversial patchset (see this lwn article as an example). Its design was considered wrong and to violate certain core kernel design principles (e.g. a task should never touch another tasks file descriptor table). Most kernel developers were not a fan of binder.

Recently, the upstream binder code has fortunately been reworked significantly (e.g. it does not touch another tasks file descriptor table anymore, the locking is very fine-grained now, etc.).

With Android being one of the major operating systems (OS) for a vast number of devices there is simply no way around binder.

The Android Service Manager

The binder IPC mechanism is accessible from userspace through device nodes located at /dev. A modern Android system will allocate three device nodes:

  • /dev/binder
  • /dev/hwbinder
  • /dev/vndbinder

serving different purposes. However, the logic is the same for all three of them. A process can call open(2) on those device nodes to receive an fd which it can then use to issue requests via ioctl(2)s. Android has a service manager which is used to translate addresses to bus names and only the address of the service manager itself is well-known. The service manager is registered through an ioctl(2) and there can only be a single service manager. This means once a service manager has grabbed hold of binder devices they cannot be (easily) reused by a second service manager.

Running Android in Containers

This matters as soon as multiple instances of Android are supposed to be run. Since they will all need their own private binder devices. This is a use-case that arises pretty naturally when running Android in system containers. People have been doing this for a long time with LXC. A project that has set out to make running Android in LXC containers very easy is Anbox. Anbox makes it possible to run hundreds of Android containers.

To properly run Android in a container it is necessary that each container has a set of private binder devices.

Statically Allocating binder Devices

Binder devices are currently statically allocated at compile time. Before compiling a kernel the CONFIG_ANDROID_BINDER_DEVICES option needs to bet set in the kernel config (Kconfig) containing the names of the binder devices to allocate at boot. By default it is set as:

CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"

To allocate additional binder devices the user needs to specify them with this Kconfig option. This is problematic since users need to know how many containers they will run at maximum and then to calculate the number of devices they need so they can specify them in the Kconfig. When the maximum number of needed binder devices changes after kernel compilation the only way to get additional devices is to recompile the kernel.

Problem 1: Using the misc major Device Number

This situation is aggravated by the fact that binder devices use the misc major number in the kernel. Each device node in the Linux kernel is identified by a major and minor number. A device can request its own major number. If it does it will have an exclusive range of minor numbers it doesn’t share with anything else and is free to hand out. Or it can use the misc major number. The misc major number is shared amongst different devices. However, that also means the number of minor devices that can be handed out is limited by all users of misc major. So if a user requests a very large number of binder devices in their Kconfig they might make it impossible for anyone else to allocate minor numbers. Or there simply might not be enough to allocate for itself.

Problem 2: Containers and IPC namespaces

All of those binder devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES will be allocated at boot and be placed in the hosts devtmpfs mount usually located at /dev or - depending on the udev(7) implementation - will be created via mknod(2) - by udev(7) at boot. That means all of those devices initially belong to the host IPC namespace. However, containers usually run in their own IPC namespace separate from the host’s. But when binder devices located in /dev are handed to containers (e.g. with a bind-mount) the kernel driver will not know that these devices are now used in a different IPC namespace since the driver is not IPC namespace aware. This is not a serious technical issue but a serious conceptual one. There should be a way to have per-IPC namespace binder devices.

Enter binderfs

To solve both problems we came up with a solution that I presented at the Linux Plumbers Conference in Vancouver this year. There’s a video of that presentation available on Youtube:

Android binderfs is a tiny filesystem that allows users to dynamically allocate binder devices, i.e. it allows to add and remove binder devices at runtime. Which means it solves problem 1. Additionally, binder devices located in a new binderfs instance are independent of binder devices located in another binderfs instance. All binder devices in binderfs instances are also independent of the binder devices allocated during boot specified in CONFIG_ANDROID_BINDER_DEVICES. This means, binderfs solves problem 2.

Android binderfs can be mounted via:

mount -t binder binder /dev/binderfs

at which point a new instance of binderfs will show up at /dev/binderfs. In a fresh instance of binderfs no binder devices will be present. There will only be a binder-control device which serves as the request handler for binderfs:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:07 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 6 Jan 10 15:07 binder-control

binderfs: Dynamically Allocating a New binder Device

To allocate a new binder device in a binderfs instance a request needs to be sent through the binder-control device node. A request is sent in the form of an ioctl(2). Here’s an example program:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <linux/android/binder.h>
#include <linux/android/binderfs.h>

int main(int argc, char *argv[])
{
        int fd, ret, saved_errno;
        size_t len;
        struct binderfs_device device = { 0 };

        if (argc != 3)
                exit(EXIT_FAILURE);

        len = strlen(argv[2]);
        if (len > BINDERFS_MAX_NAME)
                exit(EXIT_FAILURE);

        memcpy(device.name, argv[2], len);

        fd = open(argv[1], O_RDONLY | O_CLOEXEC);
        if (fd < 0) {
                printf("%s - Failed to open binder-control device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        ret = ioctl(fd, BINDER_CTL_ADD, &device);
        saved_errno = errno;
        close(fd);
        errno = saved_errno;
        if (ret < 0) {
                printf("%s - Failed to allocate new binder device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        printf("Allocated new binder device with major %d, minor %d, and "
               "name %s\n", device.major, device.minor,
               device.name);

        exit(EXIT_SUCCESS);
}

What this program simply does is to open the binder-control device node and sending a BINDER_CTL_ADD request to the kernel. Users of binderfs need to tell the kernel which name the new binder device should get. By default a name can only contain up to 256 chars including the terminating zero byte. The struct which is used is:

/**
 * struct binderfs_device - retrieve information about a new binder device
 * @name:   the name to use for the new binderfs binder device
 * @major:  major number allocated for binderfs binder devices
 * @minor:  minor number allocated for the new binderfs binder device
 *
 */
struct binderfs_device {
       char name[BINDERFS_MAX_NAME + 1];
       __u32 major;
       __u32 minor;
};

and is defined in linux/android/binderfs.h. Once the request is made via an ioctl(2) passing a struct binder_device with the name to the kernel it will allocate a new binder device and return the major and minor number of the new device in the struct (This is necessary because binderfs allocated a major device number dynamically at boot.). After the ioctl(2) returns there will be a new binder device located under /dev/binderfs with the chosen name:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder
crw-------  1 root root 242, 2 Jan 10 15:19 my-binder1

binderfs: Deleting a binder Device

Deleting binder devices does not involve issuing another ioctl(2) request through binder-control. They can be deleted via unlink(2). This means that the rm(1) tool can be used to delete them:

root@edfu:~# rm /dev/binderfs/my-binder1
root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder

Note that the binder-control device cannot be deleted since this would make the binderfs instance unuseable. The binder-control device will be deleted when the binderfs instance is unmounted and all references to it have been dropped.

binderfs: Mounting Multiple Instances

Mounting another binderfs instance at a different location will create a new and separate instance from all other binderfs mounts. This is identical to the behavior of devpts, tmpfs, and also - even though never merged in the kernel - kdbusfs:

root@edfu:~# mkdir binderfs1
root@edfu:~# mount -t binder binder binderfs1
root@edfu:~# ls -al binderfs1/
total 4
drwxr-xr-x  2 root   root        0 Jan 10 15:23 .
drwxr-xr-x 72 ubuntu ubuntu   4096 Jan 10 15:23 ..
crw-------  1 root   root   242, 2 Jan 10 15:23 binder-control

There is no my-binder device in this new binderfs instance since its devices are not related to those in the binderfs instance at /dev/binderfs. This means users can easily get their private set of binder devices.

binderfs: Mounting binderfs in User Namespaces

The Android binderfs filesystem can be mounted and used to allocate new binder devices in user namespaces. This has the advantage that binderfs can be used in unprivileged containers or any user-namespace-based sandboxing solution:

ubuntu@edfu:~$ unshare --user --map-root --mount
root@edfu:~# mkdir binderfs-userns
root@edfu:~# mount -t binder binder binderfs-userns/
root@edfu:~# The "bfs" binary used here is the compiled program from above
root@edfu:~# ./bfs binderfs-userns/binder-control my-user-binder
Allocated new binder device with major 242, minor 4, and name my-user-binder
root@edfu:~# ls -al binderfs-userns/
total 4
drwxr-xr-x  2 root root      0 Jan 10 15:34 .
drwxr-xr-x 73 root root   4096 Jan 10 15:32 ..
crw-------  1 root root 242, 3 Jan 10 15:34 binder-control
crw-------  1 root root 242, 4 Jan 10 15:36 my-user-binder

Kernel Patchsets

The binderfs patchset is merged upstream and will be available when Linux 5.0 gets released. There are a few outstanding patches that are currently waiting in Greg’s tree (cf. binderfs: remove wrong kern_mount() call and binderfs: make each binderfs mount a new instancechar-misc-linus) and some others are queued for the 5.1 merge window. But overall it seems to be in decent shape.

Read more
Colin Ian King

Last year I wrote about kernel commits that are tagged with the "Fixes" tag. Kernel developers use the "Fixes" tag on a bug fix commit to reference an older commit that originally introduced the bug.   The adoption of the tag has been steadily increasing since v3.12 of the kernel:

The red line shows the number of commits per release of the kernel, and the blue line shows the number of commits that contain a "Fixes" tag.

In terms of % of commits that contain the "Fixes" tag, one can see it has been steadily increasing since v3.12 and almost 12.5% of kernel commits in v4.20 are tagged this way.

The fixes tag contains the commit SHA of the commit that was fixed, so one can look up the date of the fix and of the commit being fixed and determine the time taken to fix a bug.

As one can see, a lot of issues get fixed on the first few hundred days, and some bugs take years to get fixed.  Zooming into the first hundred days of fixes the distribution looks like:


..the modal point is at day 4, I suspect these are issues that get found quickly when commits land in linux-next and are found in early testing, integration builds and static analysis.

Out of the thousands of "Fixes" tagged commits and the time to fix an issue one can determine how long it takes to fix a specific percentage of the bugs:


In the graph above, 50% of fixes are made within 151 days of the original commit, ~69% of fixes are made within a year of the original commit and ~82% of fixes are made within 2 years.  The long tail indicates that there are some bugs that take a while to be found and fixed,  the final 10% of bugs take more than 3.5 years to be found and fixed.

Comparing the time to fix issues for kernel versions v4.0, v4.10 and v4.20 for bugs that are fixed in less than 50 days we have:


... the trends are similar, however it is worth noting that more bugs are getting found and fixed a little faster in v4.10 and v4.20 than v4.0.  It will be interesting to see how these trends develop over the next few years.

Read more