Canonical Voices

K. Tsakalozos

If you take a look at MicroK8s’ channel information with snap info microk8s you will see all available Kubernetes releases:

channels:
stable: v1.14.1 2019-04-18 (522) 214MB classic
candidate: v1.14.1 2019-04-15 (522) 214MB classic
beta: v1.14.1 2019-04-15 (522) 214MB classic
edge: v1.14.1 2019-05-10 (587) 217MB classic
1.15/stable: –
1.15/candidate: –
1.15/beta: –
1.15/edge: v1.15.0-alpha.3 2019-05-08 (578) 215MB classic
1.14/stable: v1.14.1 2019-04-18 (521) 214MB classic
1.14/candidate: v1.14.1 2019-04-15 (521) 214MB classic
1.14/beta: v1.14.1 2019-04-15 (521) 214MB classic
1.14/edge: v1.14.1 2019-05-11 (590) 217MB classic
1.13/stable: v1.13.5 2019-04-22 (526) 237MB classic
1.13/candidate: v1.13.6 2019-05-09 (581) 237MB classic
1.13/beta: v1.13.6 2019-05-09 (581) 237MB classic
1.13/edge: v1.13.6 2019-05-08 (581) 237MB classic
1.12/stable: v1.12.8 2019-05-02 (547) 259MB classic
1.12/candidate: v1.12.8 2019-05-01 (547) 259MB classic
1.12/beta: v1.12.8 2019-05-01 (547) 259MB classic
1.12/edge: v1.12.8 2019-04-24 (547) 259MB classic
1.11/stable: v1.11.10 2019-05-10 (557) 258MB classic
1.11/candidate: v1.11.10 2019-05-02 (557) 258MB classic
1.11/beta: v1.11.10 2019-05-02 (557) 258MB classic
1.11/edge: v1.11.10 2019-05-01 (557) 258MB classic
1.10/stable: v1.10.13 2019-04-22 (546) 222MB classic
1.10/candidate: v1.10.13 2019-04-22 (546) 222MB classic
1.10/beta: v1.10.13 2019-04-22 (546) 222MB classic
1.10/edge: v1.10.13 2019-04-22 (546) 222MB classic

If you want to follow the v1.14 Kubernetes releases you would:

sudo snap install microk8s --classic --channel=1.14/stable

Whereas if you always want to be on the latest stable release you would:

sudo snap install microk8s --classic

What is new in the channels list above is the pre-stable releases found under the 1.15 track (at the time of this writing the latest stable release is v1.14).

Following the pre-stable releases

We are committed to shipping MicroK8s with pre-stable releases under the following scheme.

  • The edge channel (eg 1.15/edge) holds the alpha upstream releases.
  • The beta channel (eg 1.15/beta) holds the beta upstream releases.
  • The candidate channel (eg 1.15/candidate) holds the release candidate of upstream releases.

Pre-stable releases will be available the same day they are released upstream.

If you want to test your work against the alpha 1.15 release simply do:

sudo snap install microk8s --classic --channel=1.15/edge

However, be aware that pre-stable releases may change before the stable release. Be sure to test any work against the stable release once it becomes available.

Tracks with stable releases

Tracks are meant to serve specific Kubernetes releases. For example the 1.15 track with its four channels, 1.15/edge, 1.15/beta, 1.15/candidate, 1.15/stable, serves the v1.15 K8s release. As soon as a new K8s stable release is made, all channels of the corresponding track are updated. In our example, as soon as v1.15 stable is released the corresponding track channels are updated in the following way:

  • The 1.15/edge channel is updated on every commit merged on the MicroK8s repository paired with the v1.15 stable K8s release.
  • The 1.15/beta and 1.15/candidate channels are updated on every upstream patch release. They hold whatever the 1.15/edge channel has at the time of the patch release.
  • The 1.15/stable channel gets updated with what 1.15/candidate holds a week after a new revision is put into 1.15/candidate.

I am confused. Which channel is right for me?

The single question you need to answer is what to put in the channel argument below:

sudo snap install microk8s --classic --channel=<What_to_use_here?>

Here are some suggestions for the channel to use based on your needs:

  • I want to always be on the latest stable Kubernetes.
    Use --channel=latest
  • I want to always be on the latest release in a specific upstream K8s release.
    Use --channel=<release>/stable eg --channel=1.14/stable.
  • I want to test-drive a pre-stable release.
    Use --channel=<next_release>/edge for alpha releases
    Use --channel=<next_release>/beta for beta releases
    Use --channel=<next_release>/candidate for candidate releases
  • I am waiting for a bug fix on MicroK8s:
    Use --channel=<release>/edge
  • I am waiting for a bug fix on upstream Kubernetes:
    Use --channel=<release>/candidate

Developing K8s core services with MicroK8s

One of the purposes of pre-stable releases is to assist K8s core service developers in their task. Let’s see how we can hook a local build of kubelet to a MicroK8s deployment.

Following the build instructions for Kubernetes we:

git clone https://github.com/kubernetes/kubernetes
cd kubernetes
build/run.sh make kubelet

The kubelet binary should be available under:

_output/dockerized/bin/linux/amd64/kubelet

Let’s grab a MicroK8s deployment:

sudo snap install microk8s --classic --channel=1.15/edge

To see what arguments the kubelet is running with we:

> ps -ef | grep kubelet
root 24184 1 2 17:28 ? 00:00:54 /snap/microk8s/578/kubelet
--kubeconfig=/snap/microk8s/578/configs/kubelet.config
--cert-dir=/var/snap/microk8s/578/certs
--client-ca-file=/var/snap/microk8s/578/certs/ca.crt
--anonymous-auth=false
--network-plugin=kubenet
--root-dir=/var/snap/microk8s/common/var/lib/kubelet
--fail-swap-on=false
--pod-cidr=10.1.1.0/24
--non-masquerade-cidr=10.152.183.0/24
--cni-bin-dir=/snap/microk8s/578/opt/cni/bin/
--feature-gates=DevicePlugins=true
--eviction-hard=memory.available<100Mi,nodefs.available<1Gi,imagefs.available<1Gi
--container-runtime=remote
--container-runtime-endpoint=/var/snap/microk8s/common/run/containerd.sock
--node-labels=microk8s.io/cluster=true

We now need to stop the kubelet that comes with MicroK8s and start our own build:

sudo systemctl stop snap.microk8s.daemon-kubelet.service
sudo _output/dockerized/bin/linux/amd64/kubelet 
--kubeconfig=/snap/microk8s/578/configs/kubelet.config
--cert-dir=/var/snap/microk8s/578/certs
--clit-ca-file=/var/snap/microk8s/578/certs/ca.crt
--anonymous-auth=false --network-plugin=kubenet
--root-dir=/var/snap/microk8s/common/var/lib/kubelet
--fail-swap-on=false --pod-cidr=10.1.1.0/24
--container-runtime=remote
--container-runtime-endpoint=/var/snap/microk8s/common/run/containerd.sock
--node-labels=microk8s.io/cluster=true --eviction-hard='memory.available<100Mi,nodefs.available<1Gi,imagefs.available<1Gi'

That’s it! Your kubelet now runs in place of the one in MicroK8s! You have to admit it is as simple as it gets.

What you should be aware is that some microk8s commands will restart services through systemd. For example, microk8s.enable dns will initiate a services restart including the kubelet shipped with MicroK8s.

Happy coding!

Further reading


Kubernetes pre-stable releases now available with MicroK8s was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
K. Tsakalozos

Kubernetes manages containerised applications. The container images are found either locally, or fetched from a remote registry. We recently released MicroK8s and noticed that some of our users were not comfortable with configuring containerd with image registries. In this blog we go through a few workflows most people are following.

What you should know already

We discuss how to consume local images, or images fetched from public and private registries in Kubernetes configured with containerd.

To get one such cluster simply:

sudo snap install microk8s --classic

Familiarity with building, pushing and tagging container images will be helpful. The examples that follow use Docker but you can use your preferred container tool chain.

To install Docker on Ubuntu 18.04:

sudo apt-get install docker.io

Add the user to the docker group:

sudo usermod -aG docker ${USER}

Open a new shell for the user, with updated group membership:

su - ${USER}

The Dockerfile we will be using is:

FROM nginx:alpine

To build the image tagged with mynginx:local, go to the directory where the Dockerfile is and run:

docker build . -t mynginx:local

Working with locally built images without a registry

When an image is built it is cached on the Docker daemon used during the build. Having run the docker build . -t mynginx:localcommand, you can see the newly built image by running:

docker images

This will list the images currently known to Docker, for example:

REPOSITORY          TAG                 IMAGE ID             SIZE
mynginx local 0be75340bd9b 16.1MB

The image we created is known to Docker. Kubernetes is not aware of the newly built image as your local Docker daemon is not part of the MicroK8s Kubernetes cluster. We can export the built image from the local Docker daemon and “inject” it into the MicroK8s image cache like this:

docker save mynginx > myimage.tar
microk8s.ctr -n k8s.io image import myimage.tar

Note that when we import the image to MicroK8s we do so under the k8s.io namespace (the -n k8s.io argument).

Now we can list the images present in MicroK8s:

microk8s.ctr -n k8s.io images ls

At this point we are ready to microk8s.kubectl apply -f a deployment with this image:

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: mynginx:local
ports:
- containerPort: 80

We reference the image with image: mynginx:local. Kubernetes will behave as if there is an image in docker.io (the Dockerhub registry) for which it already has a cached copy. Note here, that containerd will not cache images with the latest tag so make sure you do not use that.

Working with public registries

After building an image with docker build . -t mynginx:local, it can be pushed to one of the mainstream public registries. You will need to create an account and register a username with the registry provider. For this example, we created an account with https://hub.docker.com/ and we log in as kjackal.

First we run the login command:

docker login

Docker will ask for a Docker ID and password to complete the login.

Login with your Docker ID to push and pull images from Docker Hub. If you don’t have a Docker ID, head over to https://hub.docker.com to create one.
Username: kjackal
Password: *******

Pushing to the registry requires that the image is tagged with your-hub-username/image-name:tag. We can either add proper tagging during build:

docker build . -t kjackal/mynginx:public

Or tag an already existing image using the image ID. Obtain the ID by running:

docker images

The ID is listed in the output:

REPOSITORY          TAG                 IMAGE ID            SIZE
mynginx local 0be75340bd9b 16.1MB

Then use the tag command:

docker tag 0be75340bd9b kjackal/mynginx:public

Now that the image is tagged correctly, it can be pushed to the registry:

docker push kjackal/mynginx

At this point we are ready to microk8s.kubectl apply -f a deployment with our image:

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: kjackal/mynginx:public
ports:
- containerPort: 80

We refer to the image as image:kjackal/mynginx:public. Kubernetes will search for the image in its default registry, docker.io.

Working with MicroK8s’ registry add-on

Having a private Docker registry can significantly improve your productivity by reducing the time spent in uploading and downloading images. The registry shipped with MicroK8s is hosted within the Kubernetes cluster and is exposed as a NodePort service on port 32000 of the localhost. Note that this is an insecure registry and you may need to take extra steps to limit access to it.

You can install the registry with:

microk8s.enable registry

The add-on registry is backed up by a 20Gi persistent volume claimed for storing images. To satisfy this claim the storage add-on is also enabled along with the registry.

The containerd daemon used by MicroK8s is configured to trust this insecure registry. To upload images we have to tag them with localhost:32000/your-mage before pushing them:

We can either add proper tagging during build:

docker build . -t localhost:32000/mynginx:registry

Or tag an already existing image using the image ID. Obtain the ID by running:

docker images

The ID is listed in the output:

REPOSITORY                TAG             IMAGE ID       SIZE
localhost:32000/mynginx registry 0be75340bd9b 16.1MB

Then use the tag command:

docker tag 0be75340bd9b localhost:32000/mynginx:registry

Now that the image is tagged correctly, it can be pushed to the registry:

docker push localhost:32000/mynginx

Pushing to this insecure registry may fail in some versions of Docker unless the daemon is explicitly configured to trust it. To address this we need to edit /etc/docker/daemon.json and add:

{
"insecure-registries" : ["localhost:32000"]
}

The new configuration should be loaded with a Docker daemon restart:

sudo systemctl restart docker

At this point we are ready to microk8s.kubectl apply -f a deployment with our image:

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: localhost:32000/mynginx:registry
ports:
- containerPort: 80

What if MicroK8s runs inside a VM?
Often MicroK8s is placed in a VM while the development process takes place on the host machine. In this setup pushing container images to the in-VM registry requires some extra configuration.

Let’s assume the IP of the VM running MicroK8s is 10.141.241.175. When we are on the host the Docker registry is not on localhost:32000 but on 10.141.241.175:32000. As a result the first thing we need to do is to tag the image we are building on the host with the right registry endpoint:

docker build . -t 10.141.241.175:32000/mynginx:registry

If we immediately try to push the mynginx image we will fail because the local Docker does not trust the in-VM registry. Here is what happens if we try a push:

> docker push 10.141.241.175:32000/mynginx
The push refers to repository [10.141.241.175:32000/mynginx]
Get https://10.141.241.175:32000/v2/: http: server gave HTTP response to HTTPS client

We need to be explicit and configure the Docker daemon running on the host to trust the in-VM insecure registry. Add the registry endpoint in /etc/docker/daemon.json:

{
"insecure-registries" : ["10.141.241.175:32000"]
}

Then restart the docker daemon on the host to load the new configuration:

sudo systemctl restart docker

We can now docker push 10.141.241.175:32000/mynginx and see the image getting uploaded. During the push our Docker client instructs the in-host Docker daemon to upload the newly built image to the 10.141.241.175:32000 endpoint as marked by the tag on the image. The Docker daemon sees (on /etc/docker/daemon.jason) that it trusts the registry and proceeds with uploading the image.

Consuming the image from inside the VM involves no changes:

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: localhost:32000/mynginx:registry
ports:
- containerPort: 80

We reference the image with localhost:32000/mynginx:registry since the registry runs inside the VM so it is on localhost:32000.

Working with a private registry

Often organisations have their own private registry to assist collaboration and accelerate development. Kubernetes (and thus MicroK8s) need to be aware of the registry endpoints before being able to pull container images.

Insecure registry
Let’s assume the private insecure registry is at 10.141.241.175 on port 32000. The images we build need to be tagged with the registry endpoint:

docker build . -t 10.141.241.175:32000/mynginx:registry

Pushing the mynginx image at this point will fail because the local Docker does not trust the private insecure registry. The docker daemon used for building images should be configured to trust the private insecure registry. This is done by marking the registry endpoint in /etc/docker/daemon.json:

{
"insecure-registries" : ["10.141.241.175:32000"]
}

Restart the Docker daemon on the host to load the new configuration:

sudo systemctl restart docker

Now running

docker push 10.141.241.175:32000/mynginx

…should succeed in uploading the image to the registry.

Attempting to pull an image in MicroK8s at this point will result in an error like this:

Warning Failed 1s (x2 over 16s) kubelet, jackal-vgn-fz11m Failed to pull image “10.141.241.175:32000/mynginx:registry”: rpc error: code = Unknown desc = failed to resolve image “10.141.241.175:32000/mynginx:registry”: no available registry endpoint: failed to do request: Head https://10.141.241.175:32000/v2/mynginx/manifests/registry: http: server gave HTTP response to HTTPS client

We need to edit /var/snap/microk8s/current/args/containerd-template.toml and add the following under [plugins] -> [plugins.cri.registry] -> [plugins.cri.registry.mirrors]:

  [plugins.cri.registry.mirrors.”10.141.241.175:32000"]
endpoint = [“http://10.141.241.175:32000"]

See the full file here.

Restart MicroK8s to have the new configuration loaded:

microk8s.stop
microk8s.start

The image can now be deployed with:

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: 10.141.241.175:32000/mynginx:registry
ports:
- containerPort: 80

Note that the image is referenced with 10.141.241.175:32000/mynginx:registry.

Secure registry
There are a lot of ways to setup a private secure registry that may slightly change the way you interact with it. Instead of diving into the specifics of each setup we provide here two pointers on how you can approach the integration with Kubernetes.

  • The official Kubernetes documentation describe how to create a secret from the Docker login credentials and use it to access the secure registry. To achieve this, imagePullSecrets is used as part of the container spec.
  • MicroK8s v1.14 and onwards uses containerd. As described here configuring containerd involves editing /var/snap/microk8s/current/args/containerd-template.toml and reloading the new configuration via a microk8s.stop, microk8s.start cycle.

Further Reading


Working with image registries and containerd in Kubernetes was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
jdstrand

It is useful for testing to want to work with official cloud images as local VMs. Eg, when I work on snapd, I like to have different images available to work with its spread tests.

The autopkgtest package makes working with Ubuntu images quite easy:

$ sudo apt-get install qemu-kvm autopkgtest
$ autopkgtest-buildvm-ubuntu-cloud -r bionic # -a i386
Downloading https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img...
# and to integrate into spread
$ mkdir -p ~/.spread/qemu
$ mv ./autopkgtest-bionic-amd64.img ~/.spread/qemu/ubuntu-18.04-64.img
# now can run any test from 'spread -list' starting with
# 'qemu:ubuntu-18.04-64:'

This post isn’t really about autopkgtest, snapd or spread specifically though….

I found myself wanting an official Debian unstable cloud image so I could use it in spread while testing snapd. I learned it is easy enough to create the images yourself but then I found that Debian started providing raw and qcow2 cloud images for use in OpenStack and so I started exploring how to use them and generalize how to use arbitrary cloud images.

General procedure

The basic steps are:

  1. obtain a cloud image
  2. make copy of the cloud image for safekeeping
  3. resize the copy
  4. create a seed.img with cloud-init to set the username/password
  5. boot with networking and the seed file
  6. login, update, etc
  7. cleanly shutdown
  8. use normally (ie, without seed file)

In this case, I grabbed the ‘debian-testing-openstack-amd64.qcow2’ image from http://cdimage.debian.org/cdimage/openstack/testing/ and verified it. Since this is based on Debian ‘testing’ (current stable images are also available), when I copied it I named it accordingly. Eg, I knew for spread it needed to be ‘debian-sid-64.img’ so I did:

$ cp ./debian-testing-openstack-amd64.qcow2 ./debian-sid-64.img

I then resized it. I picked 20G since I recalled that is what autopkgtest uses:

$ qemu-img resize debian-sid-64.img 20G

These are already setup for cloud-init, so I created a cloud-init data file (note, the ‘#cloud-config’ comment at the top is important):

$ cat ./debian-data
#cloud-config
password: debian
chpasswd: { expire: false }
ssh_pwauth: true

and a cloud-init meta-data file:

$ cat ./debian-meta-data
instance-id: i-debian-sid-64
local-hostname: debian-sid-64

and fed that into cloud-localds to create a seed file:

$ cloud-localds -v ./debian-seed.img ./debian-data ./debian-meta-data

Then start the image with:

$ kvm -M pc -m 1024 -smp 1 -monitor pty -nographic -hda ./debian-sid-64.img -drive "file=./debian-seed.img,if=virtio,format=raw" -net nic -net user,hostfwd=tcp:127.0.0.1:59355-:22

(I’m using the invocation that is reminiscent of how spread invokes it; feel free to use a virtio invocation as described by Scott Moser if that better suits your environment.)

Here, the “59355” can be any unused high port. The idea is after the image boots, you can login with ssh using:

$ ssh -p 59355 debian@127.0.0.1

Once logged in, perform any updates, etc that you want in place when tests are run, then disable cloud-init for the next boot and cleanly shutdown with:

$ sudo touch /etc/cloud/cloud-init.disabled
$ sudo shutdown -h now

The above is the generalized procedure which can hopefully be adapted for other distros that provide cloud images, etc.

For integrating into spread, just copy the image to ‘~/.spread/qemu’, naming it how spread expects. spread will use ‘-snapshot’ with the VM as part of its tests, so if you want to update the images later since they might be out of date, omit the seed file (and optionally ‘-net nic -net user,hostfwd=tcp:127.0.0.1:59355-:22’ if you don’t need port forwarding), and use:

$ kvm -M pc -m 1024 -smp 1 -monitor pty -nographic -hda ./debian-sid-64.img

UPDATE 2019-04-23: the above is confirmed to work with Fedora 28 and 29 (though, if using the resulting image to test snapd, be sure to configure the password as ‘fedora’ and then be sure to ‘yum update ; yum install kernel-modules nc strace’ in the image).

UPDATE 2019-04-22: the above is confirmed to work with CentOS 7 (though, if using the resulting image to test snapd, be sure to configure the password as ‘centos’ and then be sure to ‘yum update ; yum install epel-release ; yum install golang nc strace’ in the image).

Extra steps for Debian cloud images without default e1000 networking

Unfortunately, for the Debian cloud images, there were additional steps because spread doesn’t use virtio, but instead the default the e1000 driver, and the Debian cloud kernel doesn’t include this:

$ grep E1000 /boot/config-4.19.0-4-cloud-amd64
# CONFIG_E1000 is not set
# CONFIG_E1000E is not set

So… when the machine booted, there was no networking. To adjust for this, I blew away the image, copied from the safely kept downloaded image, resized then started it with:

$ kvm -M pc -m 1024 -smp 1 -monitor pty -nographic -hda $HOME/.spread/qemu/debian-sid-64.img -drive "file=$HOME/.spread/qemu/debian-seed.img,if=virtio,format=raw" -device virtio-net-pci,netdev=eth0 -netdev type=user,id=eth0

This allowed the VM to start with networking, at which point I adjusted /etc/apt/sources.list to refer to ‘sid’ instead of ‘buster’ then ran apt-get update then apt-get dist-upgrade to upgrade to sid. I then installed the Debian distro kernel with:

$ sudo apt-get install linux-image-amd64

Then uninstalled the currently running kernel with:

$ sudo apt-get remove --purge linux-image-cloud-amd64 linux-image-4.19.0-4-cloud-amd64

(I used ‘dpkg -l | grep linux-image’ to see the cloud kernels I wanted to remove). Removing the package that provides the currently running kernel is a dangerous operation for most systems, so there is a scary message to abort the operation. In our case, it isn’t so scary (we can just try again ;) and this is exactly what we want to do.

Next I cleanly shutdown the VM with:

$ sudo shutdown -h now

and try to start it again like with the ‘general procedures’, above (I’m keeping the seed file here because I want cloud-init to be re-run with the e1000 driver):

$ kvm -M pc -m 1024 -smp 1 -monitor pty -nographic -hda ./debian-sid-64.img -drive "file=./debian-seed.img,if=virtio,format=raw" -net nic -net user,hostfwd=tcp:127.0.0.1:59355-:22

Now I try to login via ssh:
$ ssh -p 59355 debian@127.0.0.1
...
debian@127.0.0.1's password:
...
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Apr 16 16:13:15 2019
debian@debian:~$ sudo touch /etc/cloud/cloud-init.disabled
debian@debian:~$ sudo shutdown -h now
Connection to 127.0.0.1 closed.

While this VM is no longer the official cloud image, it is still using the Debian distro kernel and Debian archive, which is good enough for my purposes and at this point I’m ready to use this VM in my testing (eg, for me, copy ‘debian-sid-64.img’ to ‘~/.spread/qemu’).

Read more

Snapcraft 3.3

snapcraft 3.3 is now available on the stable channel of the Snap Store. This is a new minor release building on top of the foundations laid out from the snapcraft 3.0 release. If you are already on the stable channel for snapcraft then all you need to do is wait for the snap to be refreshed. The full release notes are replicated here below Core base: core In order to use the new features of snapcraft, introduced with 3.

Read more
James Henstridge

Last week I gave a talk at Perth Linux Users Group about building IoT projects using Ubuntu Core and Snapcraft. The video is now available online. Unfortunately there were some problems with the audio setup leading to some background noise in the video, but it is still intelligible:

The slides used in the talk can be found here.

The talk was focused on how Ubuntu Core could be used to help with the ongoing security and maintenance of IoT projects. While it might be easy to buy a Raspberry Pi, install Linux and your application, how do you make sure the device remains up to date with security updates? How do you push out updates to your application in a reliable fashion?

I outlined a way of deploy a project using Ubuntu Core, including:

  1. Packaging a simple web server app using the snapcraft tool.
  2. Configuring automatic builds from git, published to the edge channel on the Snap Store. This is also an easy way to get ARM builds for a package, rather than trying to deal with cross compilation tool chains.
  3. Using the ubuntu-image command to create an Ubuntu Core image with the application preinstalled

I gave a demo booting such an image in a virtual machine. This showed the application up and running ready to use. I also demonstrated how promoting a build from the edge channel to stable on the store would make it available to the system wide automatic update mechanism on the device.

Read more
abeato

Ubuntu Core (UC) is Canonical’s take in the IoT space. There are pre-built images for officially supported devices, like Raspberry Pi or Intel NUCs, but if we have something else and there is no community port, we need to create the UC image ourselves. High level instructions on how to do this are found in the official docs. The process is straightforward once we have two critical components: the kernel and the gadget snap.

Creating these snaps is not necessarily complex, but there can be bumps in the road if you are new to the task. In this post I explain how I created them for the Jetson TX1 developer kit board, and how they were used to create a UC image for said device, hoping this will provide new tricks to hackers working on ports for other devices. All the sources for the snaps and the build scripts are available in github:
https://github.com/alfonsosanchezbeato/jetson-kernel-snap
https://github.com/alfonsosanchezbeato/jetson-gadget-snap
https://github.com/alfonsosanchezbeato/jetson-ubuntu-core

So, let’s start with…

The kernel snap

The Linux kernel that we will use needs some kernel configuration options to be activated, and it is also especially important that it has a modern version of apparmor so snaps can be properly confined. The official Jetson kernel is the 4.4 release, which is quite old, but fortunately Canonical has a reference 4.4 kernel with all the needed patches for snaps backported. Knowing this, we are a git format-patch command away to obtain the patches we will use on top of the nvidia kernel. The patches include also files with the configuration options that we need for snaps, plus some changes so the snap could be successfully compiled on Ubuntu 18.04 desktop.

Once we have the sources, we need, of course, to create a snapcraft.yaml file that will describe how to build the kernel snap. We will walk through it, highlighting the parts more specific to the Jetson device.

Starting with the kernel part, it turns out that we cannot use easily the kernel plugin, due to the special way in which the kernel needs to be built: nvidia distributes part of the needed drivers as separate repositories to the one used by the main kernel tree. Therefore, I resorted to using the nil plugin so I could hand-write the commands to do the build.

The pull stage that resulted is

override-pull: |
  snapcraftctl pull
  # Get kernel sources, which are distributed across different repos
  ./source_sync.sh -k tegra-l4t-r28.2.1
  # Apply canonical patches - apparmor stuff essentially
  cd sources/kernel/display
  git am ../../../patch-display/*
  cd -
  cd sources/kernel/kernel-4.4
  git am ../../../patch/*

which runs a script to retrieve the sources (I pulled this script from nvidia Linux for Tegra -L4T- distribution) and applies Canonical patches.

The build stage is a few more lines, so I decided to use an external script to implement it. We will analyze now parts of it. For the kernel configuration we add all the necessary Ubuntu bits:

make "$JETSON_KERNEL_CONFIG" \
    snappy/containers.config \
    snappy/generic.config \
    snappy/security.config \
    snappy/snappy.config \
    snappy/systemd.config

Then, to do the build we run

make -j"$num_cpu" Image modules dtbs

An interesting catch here is that zImage files are not supported due to lack of a decompressor implementation in the arm64 kernel. So we have to build an uncompressed Image instead.

After some code that stages the built files so they are included in the snap later, we retrieve the initramfs from the core snap. This step is usually hidden from us by the kernel plugin, but this time we have to code it ourselves:

# Get initramfs from core snap, which we need to download
core_url=$(curl -s -H "X-Ubuntu-Series: 16" -H "X-Ubuntu-Architecture: arm64" \
                "https://search.apps.ubuntu.com/api/v1/snaps/details/core?channel=stable" \
               | jq -r ".anon_download_url")
curl -L "$core_url" > core.snap
# Glob so we get both link and regular file
unsquashfs core.snap "boot/initrd.img-core*"
cp squashfs-root/boot/initrd.img-core "$SNAPCRAFT_PART_INSTALL"/initrd.img
ln "$SNAPCRAFT_PART_INSTALL"/initrd.img "$SNAPCRAFT_PART_INSTALL"/initrd-"$KERNEL_RELEASE".img

Moving back to the snapcraft recipe we also have an initramfs part, which takes care of doing some changes to the default initramfs shipped by UC:

initramfs:
  after: [ kernel ]
  plugin: nil
  source: ../initramfs
  override-build: |
    find . | cpio --quiet -o -H newc | lzma >> "$SNAPCRAFT_STAGE"/initrd.img

Here we are taking advantage of the fact that the initramfs can be built as a concatenation of compressed cpio archives. When the kernel decompresses it, the files included in the later archives overwrite the files from the first ones, which allows us to modify easily files in the initramfs without having to change the one shipped with core. The change that we are doing here is a modification to the resize script that allows UC to get all the free space in the disk on first boot. The modification makes sure this happens in the case when the partition is already taken all available space but the filesystem does not. We could remove this modification when these changes reach the core snap, thing that will happen eventually.

The last part of this snap is the firmware part:

firmware:
  plugin: nil
  override-build: |
    set -xe
    wget https://developer.nvidia.com/embedded/dlc/l4t-jetson-tx1-driver-package-28-2-ga -O Tegra210_Linux_R28.2.0_aarch64.tbz2
    tar xf Tegra210_Linux_R28.2.0_aarch64.tbz2 Linux_for_Tegra/nv_tegra/nvidia_drivers.tbz2
    tar xf Linux_for_Tegra/nv_tegra/nvidia_drivers.tbz2 lib/firmware/
    cd lib; cp -r firmware/ "$SNAPCRAFT_PART_INSTALL"
    mkdir -p "$SNAPCRAFT_PART_INSTALL"/firmware/gm20b
    cd "$SNAPCRAFT_PART_INSTALL"/firmware/gm20b
    ln -sf "../tegra21x/acr_ucode.bin" "acr_ucode.bin"
    ln -sf "../tegra21x/gpmu_ucode.bin" "gpmu_ucode.bin"
    ln -sf "../tegra21x/gpmu_ucode_desc.bin" "gpmu_ucode_desc.bin"
    ln -sf "../tegra21x/gpmu_ucode_image.bin" "gpmu_ucode_image.bin"
    ln -sf "../tegra21x/gpu2cde.bin" "gpu2cde.bin"
    ln -sf "../tegra21x/NETB_img.bin" "NETB_img.bin"
    ln -sf "../tegra21x/fecs_sig.bin" "fecs_sig.bin"
    ln -sf "../tegra21x/pmu_sig.bin" "pmu_sig.bin"
    ln -sf "../tegra21x/pmu_bl.bin" "pmu_bl.bin"
    ln -sf "../tegra21x/fecs.bin" "fecs.bin"
    ln -sf "../tegra21x/gpccs.bin" "gpccs.bin"

Here we download some files so we can add firmware blobs to the snap. These files come separate from nvidia kernel sources.

So this is it for the kernel snap, now you just need to follow the instructions to get it built.

The gadget snap

Time now to take a look at the gadget snap. First, I recommend to start by reading great ogra’s post on gadget snaps for devices with u-boot bootloader before going through this section. Now, same as for the kernel one, we will go through the different parts that are defined in the snapcraft.yaml file. The first one builds the u-boot binary:

uboot:
  plugin: nil
  source: git://nv-tegra.nvidia.com/3rdparty/u-boot.git
  source-type: git
  source-tag: tegra-l4t-r28.2
  override-pull: |
    snapcraftctl pull
    # Apply UC patches + bug fixes
    git am ../../../uboot-patch/*.patch
  override-build: |
    export ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
    make p2371-2180_defconfig
    nice make -j$(nproc)
    cp "$SNAPCRAFT_PART_BUILD"/u-boot.bin $SNAPCRAFT_PART_INSTALL"/

We decided again to use the nil plugin as we need to do some special quirks. The sources are pulled from nvidia’s u-boot repository, but we apply some patches on top. These patches, along with the uboot environment, provide

  • Support for loading the UC kernel and initramfs from disk
  • Support for the revert functionality in case a core or kernel snap installation goes wrong
  • Bug fixes for u-boot’s ext4 subsystem – required because the just mentioned revert functionality needs to call u-boot’s command saveenv, which happened to be broken for ext4 filesystems in tegra’s u-boot

More information on the specifics of u-boot patches for UC can be found in this great blog post.

The only other part that the snap has is uboot-env:

uboot-env:
  plugin: nil
  source: uboot-env
  override-build: |
    mkenvimage -r -s 131072 -o uboot.env uboot.env.in
    cp "$SNAPCRAFT_PART_BUILD"/uboot.env "$SNAPCRAFT_PART_INSTALL"/
    # Link needed for ubuntu-image to work properly
    cd "$SNAPCRAFT_PART_INSTALL"/; ln -s uboot.env uboot.conf
  build-packages:
    - u-boot-tools

This simply encodes the uboot.env.in file into a format that is readable by u-boot. The resulting file, uboot.env, is included in the snap.

This environment is where most of the support for UC is encoded. I will not delve too much into the details, but just want to mention that the variables that need to be edited usually for new devices are

  • devnum, partition, and devtype to set the system boot partition, from which we load the kernel and initramfs
  • fdtfile, fdt_addr_r, and fdt_high to determine the name of the device tree and where in memory it should be loaded
  • ramdisk_addr_r and initrd_high to set the loading location for the initramfs
  • kernel_addr_r to set where the kernel needs to be loaded
  • args contains kernel arguments and needs to be adapted to the device specifics
  • Finally, for this device, snappy_boot was changed so it used booti instead of bootz, as we could not use a compressed kernel as explained above

Besides the snapcraft recipe, the other mandatory file when defining a gadget snap is the gadget.yaml file. This file defines, among other things, the image partitioning layout. There is more to it, but in this case, partitioning is the only thing we have defined:

volumes:
  jetson:
    bootloader: u-boot
    schema: gpt
    structure:
      - name: system-boot
        role: system-boot
        type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
        filesystem: ext4
        filesystem-label: system-boot
        offset: 17408
        size: 67108864
      - name: TBC
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: EBT
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: BPF
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: WB0
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: RP1
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: TOS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: EKS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: FX
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: BMP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 134217728
      - name: SOS
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 20971520
      - name: EXI
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 67108864
      - name: LNX
        type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
        size: 67108864
        content:
          - image: u-boot.bin
      - name: DTB
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 4194304
      - name: NXT
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 2097152
      - name: MXB
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: MXP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
        size: 6291456
      - name: USP
        type: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7
size: 2097152

The Jetson TX1 has a complex partitioning layout, with many partitions being allocated for the first stage bootloader, and many others that are undocumented. So, to minimize the risk of touching a critical partition, I preferred to keep most of them untouched and do just the minor amount of changes to fit UC into the device. Therefore, the gadget.yaml volumes entry mainly describes the TX1 defaults, with the main differences comparing to the original being:

  1. The APP partition is renamed to system-boot and reduced to only 64MB. It will contain the uboot environment file plus the kernel and initramfs, as usual in UC systems with u-boot bootloader.
  2. The LNX partition will contain our u-boot binary
  3. If a partition with role: system-data is not defined explicitly (which is the case here), a partition which such role and with label “writable” is implicitly defined at the end of the volume. This will take all the available space left aside by the reduction of the APP partition, and will contain the UC root filesystem. This will replace the UDA partition that is the last in nvidia partitioning scheme.

Now, it is time to build the gadget snap by following the repository instructions.

Building & flashing the image

Now that we have the snaps, it is time to build the image. There is not much to it, you just need an Ubuntu One account and to follow the instructions to create a key to be able to sign a model assertion. With that just follow the README.md file in the jetson-ubuntu-core repository. You can also download the latest tarball from the repository if you prefer.

The build script will generate not only a full image file, but also a tarball that will contain separate files for each partition that needs to be flashed in the device. This is needed because unfortunately there is no way we can fully flash the Jetson device with a GPT image, instead we can flash only individual partitions with the tools nvidia provides.

Once the build finishes, we can take the resulting tarball and follow the instructions to get the necessary partitions flashed. As can be read there, we have to download the nvidia L4T package. Also, note that to be able to change the partition sizes and files to flash, a couple of patches have to be applied on top of the L4T scripts.

Summary

After this, you should have a working Ubuntu Core 18 device. You can use the serial port or an external monitor to configure it with your launchpad account so you can ssh into it. Enjoy!

Read more
Colin Watson

Here’s a brief changelog for this month.

Build farm

  • Allow dispatching builds with base images selected based on the pocket and/or using LXD images instead of chroot tarballs where appropriate (#1811677)

Code

  • Store bzr-svn‘s cache in the import data store
  • Allow project owners to use the Bazaar branch rescan view
  • Canonicalise expected rule ordering in GitRepository.setRules (#1815431)
  • Upgrade to pygit2 0.27.4 (#1815517)

Infrastructure

  • Use full gpg key fingerprints in rocketfuel-setup (contributed by Andy Brody; #1814206)

Snappy

Soyuz (package management)

  • Allow source .changes files to omit the Binary field (#1813037)

Read more
Colin Watson

Here’s a brief changelog of what we’ve been up to since our last general update.

Bugs

  • Parse a few more possible Savane URL formats (#197250)
  • Compare Bugzilla versions properly when checking whether they support the Bugzilla API (part of #1802798)

Build farm

  • Configure snap proxy settings for Subversion (#1668358)
  • Support passing IMAGE_TARGETS, REPO_SNAPSHOT_STAMP, and COHORT_KEY variables into live filesystem builds
  • Set SNAPCRAFT_BUILD_ENVIRONMENT=host when building snaps (#1791201)
  • Prevent gathering results of large builds from blocking responses to XML-RPC requests (#1795877)
  • Add missing indexes on LiveFSFile(libraryfile) and SnapFile(libraryfile)
  • Direct build failure support to Launchpad Answers rather than to the launchpad-buildd-admins team (#1810001)

Code

  • Allow proposing merges between different branches of the same personal Git repository
  • Fix OOPS when trying to look up ~user/project:branch as a unique Git repository name (#1771118)
  • Optimise GitRepository.fetchRefCommits if there are no commits to fetch
  • Handle the case where a Bazaar branch and a Git repository have the same identity URL when creating a recipe (#1623924)
  • Push code imports over bzr+ssh rather than sftp (#1779572)
  • Fix crash when emailing inline comments on a diff with non-ASCII characters in hunk headers (#1787587)
  • Percent-encode reference names in GitRef URLs (#1787965)
  • Add active reviews link to Git-based project pages (#1777102)
  • Fix handling of non-ASCII ref names
  • Include the appropriate username in git+ssh:// URLs in the UI
  • Add instructions on creating personal Git repositories to people’s “View Git repositories” pages (#1590560)
  • Add available review targets and proposals to the Git repository overview page (#1789847)
  • Fix incorrect visibility check that broke code imports targeted at private Git repositories (#1789424)
  • Allow anonymous users to view votes for public merge proposals (#1786474)
  • Make Git ref scan jobs for repositories with large numbers of refs take much less memory
  • Tolerate backend timeouts while fetching commit information for GitRef:+index (#1804395)
  • Add Git per-branch permissions (#1517559)
  • Add rescan buttons when various kinds of code scanning jobs fail (#1808320)

Infrastructure

  • Convert all remaining code to use explicit proxy configuration settings rather than picking up a proxy from the environment, making the effective production settings easier to understand
  • Add support for ECDSA SSH keys (#907675)

Libraries

Registry

  • Add a suspend-bot-account.py script to suspend an account by email address
  • Weaken type of key_text in Person.deleteSSHKeysFromSSO so that more existing keys can be deleted (#1780411)
  • Fix SSHKey.getFullKeyText to not crash on some corrupt keys (#1798046)
  • Various improvements to the close-account script

Snappy

  • Extract initial Snap.store_name from snapcraft.yaml for Bazaar as well as Git
  • Add support for Snapcraft’s architectures keyword (#1770400)
  • Bump SnapStoreUploadJob.max_retries to 30 to allow for longer store scan times
  • Include the registered store package name for a snap recipe in its builds’ titles if it exists and differs from the snap recipe name
  • Move some metadata from SnapStoreUploadJob to SnapBuild, to prevent store upload jobs getting into states that cannot be retried

Soyuz (package management)

  • Add Archive.getSigningKeyData, currently just proxying through to the keyserver (#1667725)
  • Add extendedKeyUsage information to kmod signing keys so that they can only be used to sign modules, not boot loaders or kernels (#1774746)

Read more
K. Tsakalozos

We have been quiet for a few months just because we have been busy. We were working mainly on two features that we intend to ship in the v1.14 release:

The entailed changes will affect the backwards compatibility and user experience of MicroK8s and this is the reason we time them with the upcoming upstream Kubernetes release. Here we will provide a) a short description of these features, b) a way for you to test drive the new MicroK8s, and c) the steps on how to hold back on the release in case this is a major show stopper for you.

The transition to Containerd

We replace Dockerd with Containerd mainly for two reasons.

  • The setup of having two dockerd on the same host has proven problematic. MicroK8s brings its own dockerd that may clash with a local dockerd users may want to have. With moving to containerd users can apt-get install docker.io without affecting MicroK8s. This switch also means that microk8s.docker will not be available anymore, you will have to use a docker client shipped with your distribution.
  • Performance. It is shown that there is a performance benefit from using containerd. This should not be a surprise since dockerd itself uses containerd internally. With the switch to containerd we are essentially removing a layer that is docker specific.

Hardening MicroK8s security

MicroK8s is a developer’s tool. It is not meant to be deployed in production or in hostile environments. Having said that we tried to make MicroK8s more secure by:

  • Exposing as few services as we can. Here is a table with what we left open and the access restrictions involved:
https://medium.com/media/4dac105e741261ca58799b0b8d101dae/href
  • A CA and certificates are created once at deployment time.

Test drive the upcoming patches

We have prepared a temporary branch you could use to evaluate the above changes:

snap install microk8s --classic --channel=1.13/edge/secure-containerd

If you have MicroK8s already installed you can switch the channel your MicroK8s is following:

snap refresh --channel=1.13/edge/secure-containerd microk8s

Try it out and let us know if we missed anything.

“Thanks, I’ll pass”

All release series up until now will not be affected by this change. This means you can have your MicroK8s deployment follow the 1.13 track:

snap refresh --channel=1.13/stable microk8s

Summing up

An important update is coming. Make sure you give it a try with:

snap install microk8s --classic --channel=1.13/edge/secure-containerd

If you do not like what you see tell us what breaks by filing an issue and keep using the 1.13 track.

References


Containerd on a more secure MicroK8s was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
K. Tsakalozos

MicroK8s in the Wild

As the popularity of MicroK8s grows I would like to take the time to mention some projects that use this micro Kubernetes distribution. But before that, let me do some introductions. For those unfamiliar with Kubernetes, Kubernetes is an open source container orchestrator. It shows you how to deploy, upgrade, and provision your application. This is one of the rare occasions where all the major players (Google, Microsoft, IBM, Amazon etc) have flocked around a single framework making it an unofficial standard.

MicroK8s is a distribution of Kubernetes. It is a snap package that sets up a Kubernetes cluster on your machine. You can have a Kubernetes cluster for local development, CI/CD or just for getting to know Kubernetes with just a:

sudo snap install microk8s --classic

If you are on a Mac or Windows you will need a Linux VM.

In what follows you will find some examples on how people are using MicroK8s. Note that this is not a complete list of MicroK8s usages, it is just some efforts I happen to be aware of.

Spring Cloud Kubernetes

This project is using CircleCI for CI/CD. MicroK8s provides a local Kubernetes cluster where integration tests are run. The addons enabled are dns, the docker registry and Istio. The integration tests need to plug into the Kubernetes cluster using the kubeconfig file and the socket to dockerd. This work was introduced in this Pull Request (thanks George) and it gave us the incentive to add a microk8s.status command that would wait for the cluster to come online. For example we can wait up to 5 minutes for MicroK8s to come up with:

microk8s.status --wait-ready --timeout=300

OpenFaaS on MicroK8s

It was this year’s Config Management Camp where I met Joe McCobe the author of “Deploy OpenFaaS with MicroK8s”. I will just repeat his words “was blown away by the speed and ease with which I could get a basic lab environment up and running”.

What about Kubeless?

It seems the ease of deploying MicroK8s goes well with the ease of software development of serverless frameworks. Users of Kubeless are also kicking the tires on MicroK8s. Have a look at “Files upload from Kubeless on MicroK8s to Minio” and “Serverless MicroK8s Kubernetes.”

SUSE Cloud Application Platform (CAP) on Microk8s

In his blog post Dimitris describes in detail all the configuration he had to do to get the software from SUSE to run on MicroK8s. The most interesting part is the motivation behind this effort. As he says “… MicroK8s… use your machine’s resources without you having to decide on a VM size beforehand.” As he explained to me his application puts significant memory pressure only during bootstrap. MicroK8s enabled him to reclaim the unused memory after the initialization phase.

Kubeflow

Kubeflow is the missing link between Kubernetes and AI/ML. Canonical is actively involved in this so…. you should definitely check it out. Sure, I am biased but let me tell you a true story. I have a friend who was given three machines to deploy Tensorflow and run some experiments. She did not have any prior experience at the time so… none of the three node clusters were setup in exactly the same way. There was always something off. This head-scratching situation is just one reason to use Kubeflow.

Transcrobes

Transcrobes comes from an active member of the MicroK8s community. It serves as a language learning aid. “The system knows what you know, so can give you just the right amount of help to be able to understand the words you don’t know but gets out of the way for the stuff you do know.” Here MicroK8s is used for quick prototyping. We wish you all the best Anton, good luck!

Summing Up

We have seen a number of interesting use cases that include CI/CD, Serverless programming, lab setup, rapid prototyping and application development. If you have a MicroK8s use case do let us know. Come and say hi at #microk8s on the Kubernetes slack and/or issue a Pull Request against our MicroK8s In The Wild page.

References


MicroK8s in the Wild was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
Christian Brauner

Runtimes And the Curse of the Privileged Container

Introduction (CVE-2019-5736)

Today, Monday, 2019-02-11, 14:00:00 CET CVE-2019-5736 was released:

The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host. The level of user interaction is being able to run any command (it doesn’t matter if the command is not attacker-controlled) as root within a container in either of these contexts:

  • Creating a new container using an attacker-controlled image.
  • Attaching (docker exec) into an existing container which the attacker had previous write access to.

I’ve been working on a fix for this issue over the last couple of weeks together with Aleksa a friend of mine and maintainer of runC. When he notified me about the issue in runC we tried to come up with an exploit for LXC as well and though harder it is doable. I was interested in the issue for technical reasons and figuring out how to reliably fix it was quite fun (with a proper dose of pure hatred). It also caused me to finally write down some personal thoughts I had for a long time about how we are running containers.

What are Privileged Containers?

At a first glance this is a question that is probably trivial to anyone who has a decent low-level understanding of containers. Maybe even most users by now will know what a privileged container is. A first pass at defining it would be to say that a privileged container is a container that is owned by root. Looking closer this seems an insufficient definition. What about containers using user namespaces that are started as root? It seems we need to distinguish between what ids a container is running with. So we could say a privileged container is a container that is running as root. However, this is still wrong. Because “running as root” can either be seen as meaning “running as root as seen from the outside” or “running as root from the inside” where “outside” means “as seen from a task outside the container” and “inside” means “as seen from a task inside the container”.

What we really mean by a privileged container is a container where the semantics for id 0 are the same inside and outside of the container ceteris paribus. I say “ceteris paribus” because using LSMs, seccomp or any other security mechanism will not cause a change in the meaning of id 0 inside and outside the container. For example, a breakout caused by a bug in the runtime implementation will give you root access on the host.

An unprivileged container then simply is any container in which the semantics for id 0 inside the container are different from id 0 outside the container. For example, a breakout caused by a bug in the runtime implementation will not give you root access on the host by default. This should only be possible if the kernel’s user namespace implementation has a bug.

The reason why I like to define privileged containers this way is that it also lets us handle edge cases. Specifically, the case where a container is using a user namespace but a hole is punched into the idmapping at id 0 aka where id 0 is mapped through. Consider a container that uses the following idmappings:

id: 0 100000 100000

This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(100000)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(100000) -> host_id(200000)

With this mapping it’s evident that container_id(0) != host_id(0). But now consider the following mapping:

id: 0 0 1
id: 1 100001 99999

This instructs the kernel to setup the following mapping:

id: container_id(0) -> host_id(0)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.

container_id(99999) -> host_id(199999)

In contrast to the first example this has the consequence that container_id(0) == host_id(0). I would argue that any container that at least punches a hole for id 0 into its idmapping up to specifying an identity mapping is to be considered a privileged container.

As a sidenote, Docker containers run as privileged containers by default. There is usually some confusion where people think because they do not use the --privileged flag that Docker containers run unprivileged. This is wrong. What the --privileged flag does is to give you even more permissions by e.g. not dropping (specific or even any) capabilities. One could say that such containers are almost “super-privileged”.

The Trouble with Privileged Containers

The problem I see with privileged containers is essentially captured by LXC’s and LXD’s upstream security position which we have held since at least 2015 but probably even earlier. I’m quoting from our notes about privileged containers:

Privileged containers are defined as any container where the container uid 0 is mapped to the host’s uid 0. In such containers, protection of the host and prevention of escape is entirely done through Mandatory Access Control (apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.

Those technologies combined will typically prevent any accidental damage of the host, where damage is defined as things like reconfiguring host hardware, reconfiguring the host kernel or accessing the host filesystem.

LXC upstream’s position is that those containers aren’t and cannot be root-safe.

They are still valuable in an environment where you are running trusted workloads or where no untrusted task is running as root in the container.

We are aware of a number of exploits which will let you escape such containers and get full root privileges on the host. Some of those exploits can be trivially blocked and so we do update our different policies once made aware of them. Some others aren’t blockable as they would require blocking so many core features that the average container would become completely unusable.

[…]

As privileged containers are considered unsafe, we typically will not consider new container escape exploits to be security issues worthy of a CVE and quick fix. We will however try to mitigate those issues so that accidental damage to the host is prevented.

LXC’s upstream position for a long time has been that privileged containers are not and cannot be root safe. For something to be considered root safe it should be safe to hand root access to third parties or tasks.

Running Untrusted Workloads in Privileged Containers

is insane. That’s about everything that this paragraph should contain. The fact that the semantics for id 0 inside and outside the container are identical entails that any meaningful container escape will have the attacker gain root on the host.

CVE-2019-5736 Is a Very Very Very Bad Privilege Escalation to Host Root

CVE-2019-5736 is an excellent illustration of such an attack. Think about it: a process running inside a privileged container can rather trivially corrupt the binary that is used to attach to the container. This allows an attacker to create a custom ELF binary on the host. That binary could do anything it wants:

  • could just be a binary that calls poweroff
  • could be a binary that spawns a root shell
  • could be a binary that kills other containers when called again to attach
  • could be suid cat
  • .
  • .
  • .

The attack vector is actually slightly worse for runC due to its architecture. Since runC exits after spawning the container it can also be attacked through a malicious container image. Which is super bad given that a lot of container workload workflows rely on downloading images from the web.

LXC cannot be attacked through a malicious image since the monitor process (a singleton per-container) never exits during the containers life cycle. Since the kernel does not allow modifications to running binaries it is not possible for the attacker to corrupt it. When the container is shutdown or killed the attacking task will be killed before it can do any harm. Only when the last process running inside the container has exited will the monitor itself exit. This has the consequence, that if you run privileged OCI containers via our oci template with LXC your are not vulnerable to malicious images. Only the vector through the attaching binary still applies.

The Lie that Privileged Containers can be safe

Aside from mostly working on the Kernel I’m also a maintainer of LXC and LXD alongside Stéphane Graber. We are responsible for LXC - the low-level container runtime - and LXD - the container management daemon using LXC. We have made a very conscious decision to consider privileged containers not root safe. Two main corollaries follow from this:

  1. Privileged containers should never be used to run untrusted workloads.
  2. Breakouts from privileged containers are not considered CVEs by our security policy. It still seems a common belief that if we all just try hard enough using privileged containers for untrusted workloads is safe. This is not a promise that can be made good upon. A privileged container is not a security boundary. The reason for this is simply what we looked at above: container_id(0) == host_id(0). It is therefore deeply troubling that this industry is happy to let users believe that they are safe and secure using privileged containers.

Unprivileged Containers as Default

As upstream for LXC and LXD we have been advocating the use of unprivileged containers by default for years. Way ahead before anyone else did. Our low-level library LXC has supported unprivileged containers since 2013 when user namespaces were merged into the kernel. With LXD we have taken it one step further and made unprivileged containers the default and privileged containers opt-in for that very matter: privileged containers aren’t safe. We even allow you to have per-container idmappings to make sure that not just each container is isolated from the host but also all containers from each other.

For years we have been advocating for unprivileged containers on conferences, in blogposts, and whenever we have spoken to people but somehow this whole industry has chosen to rely on privileged containers.

The good news is that we are seeing changes as people become more familiar with the perils of privileged containers. Let this recent CVE be another reminder that unprivileged containers need to be the default.

Are LXC and LXD affected?

I have seen this question asked all over the place so I guess I should add a section about this too:

  • Unprivileged LXC and LXD containers are not affected.

  • Any privileged LXC and LXD container running on a read-only rootfs is not affected.

  • Privileged LXC containers in the definition provided above are affected. Though the attack is more difficult than for runC. The reason for this is that the lxc-attach binary does not exit before the program in the container has finished executing. This means an attacker would need to open an O_PATH file descriptor to /proc/self/exe, fork() itself into the background and re-open the O_PATH file descriptor through /proc/self/fd/<O_PATH-nr> in a loop as O_WRONLY and keep trying to write to the binary until such time as lxc-attach exits. Before that it will not succeed since the kernel will not allow modification of a running binary.

  • Privileged LXD containers are only affected if the daemon is restarted other than for upgrade reasons. This should basically never happen. The LXD daemon never exits so any write will fail because the kernel does not allow modification of a running binary. If the LXD daemon is restarted because of an upgrade the binary will be swapped out and the file descriptor used for the attack will write to the old in-memory binary and not to the new binary.

Chromebooks with Crostini using LXD are not affected

Chromebooks use LXD as their default container runtime are not affected. First of all, all binaries reside on a read-only filesystem and second, LXD does not allow running privileged containers on Chromebooks through the LXD_UNPRIVILEGED_ONLY flag. For more details see this link.

Fixing CVE-2019-5736

To prevent this attack, LXC has been patched to create a temporary copy of the calling binary itself when it attaches to containers (cf. 6400238d08cdf1ca20d49bafb85f4e224348bf9d). To do this LXC can be instructed to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. LXC then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host LXC binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed, writes to this will also fail. To not break downstream users of the shared library this is opt-in by setting LXC_MEMFD_REXEC in the environment. For our lxc-attach binary which is the only attack vector this is now done by default.

Workloads that place the LXC binaries on a read-only filesystem or prevent running privileged containers can disable this feature by passing --disable-memfd-rexec during the configure stage when compiling LXC.

Read more

Snapcraft 3.1

snapcraft 3.1 is now available on the stable channel of the Snap Store. This is a new minor release building on top of the foundations laid out from the snapcraft 3.1 release. If you are already on the stable channel for snapcraft then all you need to do is wait for the snap to be refreshed. The full release notes are replicated here below Build Environments It is now possible, when using the base keyword, to once again clean parts.

Read more
jdstrand

Some time ago we started alerting publishers when their stage-packages received a security update since the last time they built a snap. We wanted to create the right balance for the alerts and so the service currently will only alert you when there are new security updates against your stage-packages. In this manner, you can choose not to rebuild your snap (eg, since it doesn’t use the affected functionality of the vulnerable package) and not be nagged every day that you are out of date.

As nice as that is, sometimes you want to check these things yourself or perhaps hook the alerts into some form of automation or tool. While the review-tools had all of the pieces so you could do this, it wasn’t as straightforward as it could be. Now with the latest stable revision of the review-tools, this is easy:

$ sudo snap install review-tools
$ review-tools.check-notices \
  ~/snap/review-tools/common/review-tools_656.snap
{'review-tools': {'656': {'libapt-inst2.0': ['3863-1'],
                          'libapt-pkg5.0': ['3863-1'],
                          'libssl1.0.0': ['3840-1'],
                          'openssl': ['3840-1'],
                          'python3-lxml': ['3841-1']}}}

The review-tools are a strict mode snap and while it plugs the home interface, that is only for convenience, so I typically disconnect the interface and put things in its SNAP_USER_COMMON directory, like I did above.

Since now it is super easy to check a snap on disk, with a little scripting and a cron job, you can generate a machine readable report whenever you want. Eg, can do something like the following:

$ cat ~/bin/check-snaps
#!/bin/sh
set -e

snaps="review-tools/stable rsync-jdstrand/edge"

tmpdir=$(mktemp -d -p "$HOME/snap/review-tools/common")
cleanup() {
    rm -fr "$tmpdir"
}
trap cleanup EXIT HUP INT QUIT TERM

cd "$tmpdir" || exit 1
for i in $snaps ; do
    snap=$(echo "$i" | cut -d '/' -f 1)
    channel=$(echo "$i" | cut -d '/' -f 2)
    snap download "$snap" "--$channel" >/dev/null
done
cd - >/dev/null || exit 1

/snap/bin/review-tools.check-notices "$tmpdir"/*.snap

or if  you already have the snaps on disk somewhere, just do:

$ /snap/bin/review-tools.check-notices /path/to/snaps/*.snap

Now can add the above to cron or some automation tool as a reminder of what needs updates. Enjoy!

Read more
K. Tsakalozos

No.

No.

Read more
Colin Watson

Git per-branch permissions

We’ve had Git hosting support in Launchpad for a few years now. One thing that some users asked for, particularly larger users such as the Ubuntu kernel team, was the ability to set up per-branch push permissions for their repositories. Today we rolled out the last piece of this work.

Launchpad’s default behaviour is that repository owners may push anything to their own repositories, including creating new branches, force-pushing (rewriting history), and deleting branches, while nobody else may push anything. Repository owners can now also choose to protect branches or tags, either individually or using wildcard rules. If a branch is protected, then by default repository owners can only create or push it but cannot force-push or delete; if a tag is protected, then by default repository owners can create it but cannot move or delete it.

You can also allow selected contributors to push to protected branches or tags, so if you’re collaborating with somebody on a branch and just want to be able to quickly pair-program via git push, or you want a merge robot to be able to land merge proposals in your repository without having to add it to the team that owns the repository and thus give it privileges it doesn’t need, then this feature may be for you.

There’s some initial documentation on our help site, and here’s a screenshot of a repository that’s been set up to give a contributor push access to a single branch:

Read more
Christian Brauner

Android Binderfs

asciicast

Introduction

Android Binder is an inter-process communication (IPC) mechanism. It is heavily used in all Android devices. The binder kernel driver has been present in the upstream Linux kernel for quite a while now.

Binder has been a controversial patchset (see this lwn article as an example). Its design was considered wrong and to violate certain core kernel design principles (e.g. a task should never touch another tasks file descriptor table). Most kernel developers were not a fan of binder.

Recently, the upstream binder code has fortunately been reworked significantly (e.g. it does not touch another tasks file descriptor table anymore, the locking is very fine-grained now, etc.).

With Android being one of the major operating systems (OS) for a vast number of devices there is simply no way around binder.

The Android Service Manager

The binder IPC mechanism is accessible from userspace through device nodes located at /dev. A modern Android system will allocate three device nodes:

  • /dev/binder
  • /dev/hwbinder
  • /dev/vndbinder

serving different purposes. However, the logic is the same for all three of them. A process can call open(2) on those device nodes to receive an fd which it can then use to issue requests via ioctl(2)s. Android has a service manager which is used to translate addresses to bus names and only the address of the service manager itself is well-known. The service manager is registered through an ioctl(2) and there can only be a single service manager. This means once a service manager has grabbed hold of binder devices they cannot be (easily) reused by a second service manager.

Running Android in Containers

This matters as soon as multiple instances of Android are supposed to be run. Since they will all need their own private binder devices. This is a use-case that arises pretty naturally when running Android in system containers. People have been doing this for a long time with LXC. A project that has set out to make running Android in LXC containers very easy is Anbox. Anbox makes it possible to run hundreds of Android containers.

To properly run Android in a container it is necessary that each container has a set of private binder devices.

Statically Allocating binder Devices

Binder devices are currently statically allocated at compile time. Before compiling a kernel the CONFIG_ANDROID_BINDER_DEVICES option needs to bet set in the kernel config (Kconfig) containing the names of the binder devices to allocate at boot. By default it is set as:

CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"

To allocate additional binder devices the user needs to specify them with this Kconfig option. This is problematic since users need to know how many containers they will run at maximum and then to calculate the number of devices they need so they can specify them in the Kconfig. When the maximum number of needed binder devices changes after kernel compilation the only way to get additional devices is to recompile the kernel.

Problem 1: Using the misc major Device Number

This situation is aggravated by the fact that binder devices use the misc major number in the kernel. Each device node in the Linux kernel is identified by a major and minor number. A device can request its own major number. If it does it will have an exclusive range of minor numbers it doesn’t share with anything else and is free to hand out. Or it can use the misc major number. The misc major number is shared amongst different devices. However, that also means the number of minor devices that can be handed out is limited by all users of misc major. So if a user requests a very large number of binder devices in their Kconfig they might make it impossible for anyone else to allocate minor numbers. Or there simply might not be enough to allocate for itself.

Problem 2: Containers and IPC namespaces

All of those binder devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES will be allocated at boot and be placed in the hosts devtmpfs mount usually located at /dev or - depending on the udev(7) implementation - will be created via mknod(2) - by udev(7) at boot. That means all of those devices initially belong to the host IPC namespace. However, containers usually run in their own IPC namespace separate from the host’s. But when binder devices located in /dev are handed to containers (e.g. with a bind-mount) the kernel driver will not know that these devices are now used in a different IPC namespace since the driver is not IPC namespace aware. This is not a serious technical issue but a serious conceptual one. There should be a way to have per-IPC namespace binder devices.

Enter binderfs

To solve both problems we came up with a solution that I presented at the Linux Plumbers Conference in Vancouver this year. There’s a video of that presentation available on Youtube:

Android binderfs is a tiny filesystem that allows users to dynamically allocate binder devices, i.e. it allows to add and remove binder devices at runtime. Which means it solves problem 1. Additionally, binder devices located in a new binderfs instance are independent of binder devices located in another binderfs instance. All binder devices in binderfs instances are also independent of the binder devices allocated during boot specified in CONFIG_ANDROID_BINDER_DEVICES. This means, binderfs solves problem 2.

Android binderfs can be mounted via:

mount -t binder binder /dev/binderfs

at which point a new instance of binderfs will show up at /dev/binderfs. In a fresh instance of binderfs no binder devices will be present. There will only be a binder-control device which serves as the request handler for binderfs:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:07 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 6 Jan 10 15:07 binder-control

binderfs: Dynamically Allocating a New binder Device

To allocate a new binder device in a binderfs instance a request needs to be sent through the binder-control device node. A request is sent in the form of an ioctl(2). Here’s an example program:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <linux/android/binder.h>
#include <linux/android/binderfs.h>

int main(int argc, char *argv[])
{
        int fd, ret, saved_errno;
        size_t len;
        struct binderfs_device device = { 0 };

        if (argc != 3)
                exit(EXIT_FAILURE);

        len = strlen(argv[2]);
        if (len > BINDERFS_MAX_NAME)
                exit(EXIT_FAILURE);

        memcpy(device.name, argv[2], len);

        fd = open(argv[1], O_RDONLY | O_CLOEXEC);
        if (fd < 0) {
                printf("%s - Failed to open binder-control device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        ret = ioctl(fd, BINDER_CTL_ADD, &device);
        saved_errno = errno;
        close(fd);
        errno = saved_errno;
        if (ret < 0) {
                printf("%s - Failed to allocate new binder device\n",
                       strerror(errno));
                exit(EXIT_FAILURE);
        }

        printf("Allocated new binder device with major %d, minor %d, and "
               "name %s\n", device.major, device.minor,
               device.name);

        exit(EXIT_SUCCESS);
}

What this program simply does is to open the binder-control device node and sending a BINDER_CTL_ADD request to the kernel. Users of binderfs need to tell the kernel which name the new binder device should get. By default a name can only contain up to 256 chars including the terminating zero byte. The struct which is used is:

/**
 * struct binderfs_device - retrieve information about a new binder device
 * @name:   the name to use for the new binderfs binder device
 * @major:  major number allocated for binderfs binder devices
 * @minor:  minor number allocated for the new binderfs binder device
 *
 */
struct binderfs_device {
       char name[BINDERFS_MAX_NAME + 1];
       __u32 major;
       __u32 minor;
};

and is defined in linux/android/binderfs.h. Once the request is made via an ioctl(2) passing a struct binder_device with the name to the kernel it will allocate a new binder device and return the major and minor number of the new device in the struct (This is necessary because binderfs allocated a major device number dynamically at boot.). After the ioctl(2) returns there will be a new binder device located under /dev/binderfs with the chosen name:

root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder
crw-------  1 root root 242, 2 Jan 10 15:19 my-binder1

binderfs: Deleting a binder Device

Deleting binder devices does not involve issuing another ioctl(2) request through binder-control. They can be deleted via unlink(2). This means that the rm(1) tool can be used to delete them:

root@edfu:~# rm /dev/binderfs/my-binder1
root@edfu:~# ls -al /dev/binderfs/
total 0
drwxr-xr-x  2 root root      0 Jan 10 15:19 .
drwxr-xr-x 20 root root   4260 Jan 10 15:07 ..
crw-------  1 root root 242, 0 Jan 10 15:19 binder-control
crw-------  1 root root 242, 1 Jan 10 15:19 my-binder

Note that the binder-control device cannot be deleted since this would make the binderfs instance unuseable. The binder-control device will be deleted when the binderfs instance is unmounted and all references to it have been dropped.

binderfs: Mounting Multiple Instances

Mounting another binderfs instance at a different location will create a new and separate instance from all other binderfs mounts. This is identical to the behavior of devpts, tmpfs, and also - even though never merged in the kernel - kdbusfs:

root@edfu:~# mkdir binderfs1
root@edfu:~# mount -t binder binder binderfs1
root@edfu:~# ls -al binderfs1/
total 4
drwxr-xr-x  2 root   root        0 Jan 10 15:23 .
drwxr-xr-x 72 ubuntu ubuntu   4096 Jan 10 15:23 ..
crw-------  1 root   root   242, 2 Jan 10 15:23 binder-control

There is no my-binder device in this new binderfs instance since its devices are not related to those in the binderfs instance at /dev/binderfs. This means users can easily get their private set of binder devices.

binderfs: Mounting binderfs in User Namespaces

The Android binderfs filesystem can be mounted and used to allocate new binder devices in user namespaces. This has the advantage that binderfs can be used in unprivileged containers or any user-namespace-based sandboxing solution:

ubuntu@edfu:~$ unshare --user --map-root --mount
root@edfu:~# mkdir binderfs-userns
root@edfu:~# mount -t binder binder binderfs-userns/
root@edfu:~# The "bfs" binary used here is the compiled program from above
root@edfu:~# ./bfs binderfs-userns/binder-control my-user-binder
Allocated new binder device with major 242, minor 4, and name my-user-binder
root@edfu:~# ls -al binderfs-userns/
total 4
drwxr-xr-x  2 root root      0 Jan 10 15:34 .
drwxr-xr-x 73 root root   4096 Jan 10 15:32 ..
crw-------  1 root root 242, 3 Jan 10 15:34 binder-control
crw-------  1 root root 242, 4 Jan 10 15:36 my-user-binder

Kernel Patchsets

The binderfs patchset is merged upstream and will be available when Linux 5.0 gets released. There are a few outstanding patches that are currently waiting in Greg’s tree (cf. binderfs: remove wrong kern_mount() call and binderfs: make each binderfs mount a new instancechar-misc-linus) and some others are queued for the 5.1 merge window. But overall it seems to be in decent shape.

Read more
Colin Ian King

Last year I wrote about kernel commits that are tagged with the "Fixes" tag. Kernel developers use the "Fixes" tag on a bug fix commit to reference an older commit that originally introduced the bug.   The adoption of the tag has been steadily increasing since v3.12 of the kernel:

The red line shows the number of commits per release of the kernel, and the blue line shows the number of commits that contain a "Fixes" tag.

In terms of % of commits that contain the "Fixes" tag, one can see it has been steadily increasing since v3.12 and almost 12.5% of kernel commits in v4.20 are tagged this way.

The fixes tag contains the commit SHA of the commit that was fixed, so one can look up the date of the fix and of the commit being fixed and determine the time taken to fix a bug.

As one can see, a lot of issues get fixed on the first few hundred days, and some bugs take years to get fixed.  Zooming into the first hundred days of fixes the distribution looks like:


..the modal point is at day 4, I suspect these are issues that get found quickly when commits land in linux-next and are found in early testing, integration builds and static analysis.

Out of the thousands of "Fixes" tagged commits and the time to fix an issue one can determine how long it takes to fix a specific percentage of the bugs:


In the graph above, 50% of fixes are made within 151 days of the original commit, ~69% of fixes are made within a year of the original commit and ~82% of fixes are made within 2 years.  The long tail indicates that there are some bugs that take a while to be found and fixed,  the final 10% of bugs take more than 3.5 years to be found and fixed.

Comparing the time to fix issues for kernel versions v4.0, v4.10 and v4.20 for bugs that are fixed in less than 50 days we have:


... the trends are similar, however it is worth noting that more bugs are getting found and fixed a little faster in v4.10 and v4.20 than v4.0.  It will be interesting to see how these trends develop over the next few years.

Read more
Colin Ian King

Analysis of Phoronix Test Suite Benchmarks

I've been recently investigating a wide range of bench marking tests to find suitable candidates to track down performance regressions in Ubuntu.  Over the past 3 weeks I have attempted to run the entire set of tests in the Phoronix Test Suite on a low-end Xeon server to determine a subset of reliable tests.

For benchmarks to be reliable they must show little variation in results when running the tests multiple times.  The tests also need to run in a timely manner too; waiting several days for a set of results is not very timely.

Linked here is a PDF of my set of results.  Some of the Phoronix tests are not listed as they either took way too long to complete or just didn't run successfully on the server.  Tests that have low variability in the results (that is, the standard deviation of the test runs is low compared to the average) are marked in green, high variability are marked in red.

My testing shows that a large proportion of tests have quite large variability > 5% (% standard deviation), so probably are not that trustworthy when comparing results of machines that have benchmark results that show little difference.   For regression testing, I'm going to only consider tests that have a variability of less than 2.5% as this seems like a good way to filter out the more jittery test results.

The bottom line is that some tests are just too variable to be deemed a solid benchmark.  It's always worth sanity checking tests before using them as a gold standard.

Read more
Colin Ian King

Linux I/O Schedulers

The Linux kernel I/O schedulers attempt to balance the need to get the best possible I/O performance while also trying to ensure the I/O requests are "fairly" shared among the I/O consumers.  There are several I/O schedulers in Linux, each try to solve the I/O scheduling issues using different mechanisms/heuristics and each has their own set of strengths and weaknesses.

For traditional spinning media it makes sense to try and order I/O operations so that they are close together to reduce read/write head movement and hence decrease latency.  However, this reordering means that some I/O requests may get delayed, and the usual solution is to schedule these delayed requests after a specific time.   Faster non-volatile memory devices can generally handle random I/O requests very easily and hence do not require reordering.

Balancing the fairness is also an interesting issue.  A greedy I/O consumer should not block other I/O consumers and there are various heuristics used to determine the fair sharing of I/O.  Generally, the more complex and "fairer" the solution the more compute is required, so selecting a very fair I/O scheduler with a fast I/O device and a slow CPU may not necessarily perform as well as a simpler I/O scheduler.

Finally, the types of I/O patterns on the I/O devices influence the I/O scheduler choice, for example, mixed random read/writes vs mainly sequential reads and occasional random writes.

Because of the mix of requirements, there is no such thing as a perfect all round I/O scheduler.  The defaults being used are chosen to be a good best choice for the general user, however, this may not match everyone's needs.   To clarify the choices, the Ubuntu Kernel Team has provided a Wiki page describing the choices and how to select and tune the various I/O schedulers.  Caveat emptor applies, these are just guidelines and should be used as a starting point to finding the best I/O scheduler for your particular need.

Read more