Canonical Voices

Colin Ian King

Linux I/O Schedulers

The Linux kernel I/O schedulers attempt to balance the need to get the best possible I/O performance while also trying to ensure the I/O requests are "fairly" shared among the I/O consumers.  There are several I/O schedulers in Linux, each try to solve the I/O scheduling issues using different mechanisms/heuristics and each has their own set of strengths and weaknesses.

For traditional spinning media it makes sense to try and order I/O operations so that they are close together to reduce read/write head movement and hence decrease latency.  However, this reordering means that some I/O requests may get delayed, and the usual solution is to schedule these delayed requests after a specific time.   Faster non-volatile memory devices can generally handle random I/O requests very easily and hence do not require reordering.

Balancing the fairness is also an interesting issue.  A greedy I/O consumer should not block other I/O consumers and there are various heuristics used to determine the fair sharing of I/O.  Generally, the more complex and "fairer" the solution the more compute is required, so selecting a very fair I/O scheduler with a fast I/O device and a slow CPU may not necessarily perform as well as a simpler I/O scheduler.

Finally, the types of I/O patterns on the I/O devices influence the I/O scheduler choice, for example, mixed random read/writes vs mainly sequential reads and occasional random writes.

Because of the mix of requirements, there is no such thing as a perfect all round I/O scheduler.  The defaults being used are chosen to be a good best choice for the general user, however, this may not match everyone's needs.   To clarify the choices, the Ubuntu Kernel Team has provided a Wiki page describing the choices and how to select and tune the various I/O schedulers.  Caveat emptor applies, these are just guidelines and should be used as a starting point to finding the best I/O scheduler for your particular need.

Read more
K. Tsakalozos

MicroK8s is a local deployment of Kubernetes. Let’s skip all the technical details and just accept that Kubernetes does not run natively on MacOS or Windows. You may be thinking “I have seen Kubernetes running on a MacOS laptop, what kind of sorcery was that?” It’s simple, Kubernetes is running inside a VM. You might not see the VM or it might not even be a full blown virtual system but some level of virtualisation is there. This is exactly what we will show here. We will setup a VM and inside there we will install MicroK8s. After the installation we will discuss how to use the in-VM-Kubernetes.

A multipass VM on MacOS

Arguably the easiest way to get an Ubuntu VM on MacOS is with multipass. Head to the releases page and grab the latest package. Installing it is as simple as double-clicking on the .pkg file.

To start a VM with MicroK8s we:

multipass launch --name microk8s-vm --mem 4G --disk 40G
multipass exec microk8s-vm -- sudo snap install microk8s --classic
multipass exec microk8s-vm -- sudo iptables -P FORWARD ACCEPT

Make sure you reserve enough resources to host your deployments; above, we got 4GB of RAM and 40GB of hard disk. We also make sure packets to/from the pod network interface can be forwarded to/from the default interface.

Our VM has an IP that you can check with:

> multipass list
Name State IPv4 Release
microk8s-vm RUNNING Ubuntu 18.04 LTS

Take a note of this IP since our services will become available there.

Other multipass commands you may find handy:

  • Get a shell inside the VM:
multipass shell microk8s-vm
  • Shutdown the VM:
multipass stop microk8s-vm
  • Delete and cleanup the VM:
multipass delete microk8s-vm 
multipass purge

Using MicroK8s

To run a command in the VM we can get a multipass shell with:

multipass shell microk8s-vm

To execute a command without getting a shell we can use multipass exec like so:

multipass exec microk8s-vm -- /snap/bin/microk8s.status

A third way to interact with MicroK8s is via the Kubernetes API server listening on port 8080 of the VM. We can use microk8s’ kubeconfig file with a local installation of kubectl to access the in-VM-kubernetes. Here is how:

multipass exec microk8s-vm -- /snap/bin/microk8s.config > kubeconfig

Install kubectl on the host machine and then use the kubeconfig:

kubectl --kubeconfig=kubeconfig get all --all-namespaces

Accessing in-VM services — Enabling addons

Let’s first enable dns and the dashboard. In the rest of this blog we will be showing different methods of accessing Grafana:

multipass exec microk8s-vm -- /snap/bin/microk8s.enable dns dashboard

We check the deployment progress with:

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl get all --all-namespaces

After all services are running we can proceed into looking how to access the dashboard.

The Grafana of our dashboard

Accessing in-VM services — Use the Kubernetes API proxy

The API server is on port 8080 of our VM. Let’s see how the proxy path looks like:

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl cluster-info
Grafana is running at

By replacing with the VM’s IP, in this case, we can reach our service at:

Accessing in-VM services — Setup a proxy

In a very similar fashion to what we just did above, we can ask Kubernetes to create a proxy for us. We need to request the proxy to be available to all interfaces and to accept connections from everywhere so that the host can reach it.

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl proxy --address='' --accept-hosts='.*'
Starting to serve on [::]:8001

Leave the terminal with the proxy open. Again, replacing with the VMs IP we reach the dashboard through:

Make sure you go through the official docs on constructing the proxy paths.

Accessing in-VM services — Use a NodePort service

We can expose our service in a port on the VM and access it from there. This approach is using the NodePort service type. We start by spotting the deployment we want to expose:

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl get deployment -n kube-system  | grep grafana
monitoring-influxdb-grafana-v4 1 1 1 1 22h

Then we create the NodePort service:

multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl expose deployment.apps/monitoring-influxdb-grafana-v4 -n kube-system --type=NodePort

We have now a port for the Grafana service:

> multipass exec microk8s-vm -- /snap/bin/microk8s.kubectl get services -n kube-system  | grep NodePort
monitoring-influxdb-grafana-v4 NodePort <none> 8083:32580/TCP,8086:32152/TCP,3000:32720/TCP 13m

Grafana is on port 3000 mapped here to 32720. This port is randomly selected so it my vary for you. In our case, the service is available on


MicroK8s on MacOS (or Windows) will need a VM to run. This is no different that any other local Kubernetes solution and it comes with some nice benefits. The VM gives you an extra layer of isolation. Instead of using your host and potentially exposing the Kubernetes services to the outside world you have full control of what others can see. Of course, this isolation comes with some extra administrative overhead that may not be applicable for a dev environment. Give it a try and tell us what you think!



MicroK8s on MacOS was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
Colin Ian King

New features in Forkstat

Forkstat is a simple utility I wrote a while ago that can trace process activity using the rather useful Linux NETLINK_CONNECTOR API.   Recently I have added two extra features that may be of interest:

1.  Improved output using some UTF-8 glyphs.  These are used to show process parent/child relationships and various process events, such as termination, core dumping and renaming.   Use the new -g (glyph) option to enable this mode. For example:

In the above example, the program "wobble" was started and forks off a child process.  The parent then renames itself to wibble (indicated by a turning arrow). The child then segfaults and generates a core dump (indicated by a skull and crossbones), triggering apport to investigate the crash.  After this, we observe NetworkManager creating a thread that runs for a very short period of time.   This kind of activity is normally impossible to spot while running conventions tools such as ps or top.

2. By default, forkstat will show the process name using the contents of /proc/$PID/cmdline.  The new -c option allows one to instead use the 16 character task "comm" field, and this can be helpful for spotting process name changes on PROC_EVENT_COMM events.

These are small changes, but I think they make forkstat more useful.  The updated forkstat will be available in Ubuntu 19.04 "Disco Dingo".

Read more

Snapcraft 3.0

The release notes for snapcraft 3.0 have been long overdue. For convenience I will reproduce them here too. Presenting snapcraft 3.0 The arrival of snapcraft 3.0 brings fresh air into how snap development takes place! We took the learnings from the main pain points you had when creating snaps in the past several years, and we we introduced those lessons into a brand new release - snapcraft 3.0! Build Environments As the cornerstone for behavioral change, we are introducing the concept of build environments.

Read more
Colin Ian King

High-level tracing with bpftrace

Bpftrace is a new high-level tracing language for Linux using the extended Berkeley packet filter (eBPF).  It is a very powerful and flexible tracing front-end that enables systems to be analyzed much like DTrace.

The bpftrace tool is now installable as a snap. From the command line one can install it and enable it to use system tracing as follows:

 sudo snap install bpftrace  
sudo snap connect bpftrace:system-trace

To illustrate the power of bpftrace, here are some simple one-liners:

 # trace openat() system calls
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%d %s %s\n", pid, comm, str(args->filename)); }'
Attaching 1 probe...
1080 irqbalance /proc/interrupts
1080 irqbalance /proc/stat
2255 dmesg /etc/
2255 dmesg /lib/x86_64-linux-gnu/
2255 dmesg /lib/x86_64-linux-gnu/
2255 dmesg /lib/x86_64-linux-gnu/
2255 dmesg /lib/x86_64-linux-gnu/
2255 dmesg /usr/lib/locale/locale-archive
2255 dmesg /lib/terminfo/l/linux
2255 dmesg /home/king/.config/terminal-colors.d
2255 dmesg /etc/terminal-colors.d
2255 dmesg /dev/kmsg
2255 dmesg /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache

 # count system calls using tracepoints:  
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
@[tracepoint:syscalls:sys_enter_getsockname]: 1
@[tracepoint:syscalls:sys_enter_kill]: 1
@[tracepoint:syscalls:sys_enter_prctl]: 1
@[tracepoint:syscalls:sys_enter_epoll_wait]: 1
@[tracepoint:syscalls:sys_enter_signalfd4]: 2
@[tracepoint:syscalls:sys_enter_utimensat]: 2
@[tracepoint:syscalls:sys_enter_set_robust_list]: 2
@[tracepoint:syscalls:sys_enter_poll]: 2
@[tracepoint:syscalls:sys_enter_socket]: 3
@[tracepoint:syscalls:sys_enter_getrandom]: 3
@[tracepoint:syscalls:sys_enter_setsockopt]: 3

Note that it is recommended to use bpftrace with Linux 4.9 or higher.

The bpftrace github project page has an excellent README guide with some worked examples and is a very good place to start.  There is also a very useful reference guide and one-liner tutorial too.

If you have any useful btftrace one-liners, it would be great to share them. This is an amazingly powerful tool, and it would be interesting to see how it will be used.

Read more
Tim McNamara

Hello world!

Welcome to Canonical Voices. This is your first post. Edit or delete it, then start blogging!

Read more
K. Tsakalozos

Looking at the configuration of a Kubernetes node sounds like a simple thing, yet it not so obvious.

The arguments kubelet takes come either as command line parameters or from a configuration file you pass with --config. Seems, straight forward to do a ps -ex | grep kubelet and look in the file you see after the --config parameter. Simple, right? But… are you sure you got all the arguments right? What if Kubernetes defaulted to a value you did not want? What if you do not have shell access to a node?

There is a way to query the Kubernetes API for the configuration a node is running with: api/vi/nodes/<node_name>/proxy/cofigz. Lets see this in a real deployment.

Deploy a Kubernetes Cluster

I am using the Canonical Distribution of Kubernetes (CDK) on AWS here but you can use whichever cloud and Kubernetes installation method you like.

juju bootstrap aws
juju deploy canonical-kubernetes

..and wait for the deployment to finish

watch juju status 

Change a Configuration

CDK allows for configuring both the command line arguments and the extra arguments of the config file. Here we add arguments to the config file:

juju config kubernetes-worker kubelet-extra-config='{imageGCHighThresholdPercent: 60, imageGCLowThresholdPercent: 39}'

A great question is how we got the imageGCHighThreshholdPercent literal. At the time of this writing the official upstream docs point you to the type definitions; a rather ugly approach. There is an EvictionHard property in the type definitions, however, if you look at the example of the upstream docs you see the same property is with lowercase.

Check the Configuration

We will need two shells. On the first one we will start the API proxy and on the second we will query the API. On the first shell:

juju ssh kubernetes-master/0
kubectl proxy

Now that we have the proxy at on the kubernetes-master, use a second shell to get a node name and we query the API:

juju ssh kubernetes-master/0
kubectl get no
curl -sSL "http://localhost:8001/api/v1/nodes/<node_name>/proxy/configz" | python3 -m json.tool

Here is a full run:

juju ssh kubernetes-master/0
Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-1023-aws x86_64)
* Documentation:
* Management:
* Support:
System information as of Mon Oct 22 10:40:40 UTC 2018
System load:  0.11               Processes:              115
Usage of /: 13.7% of 15.45GB Users logged in: 1
Memory usage: 20% IP address for ens5:
Swap usage: 0% IP address for fan-252:
Get cloud support with Ubuntu Advantage Cloud Guest:
0 packages can be updated.
0 updates are security updates.
Last login: Mon Oct 22 10:38:14 2018 from
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
ubuntu@ip-172-31-0-48:~$ kubectl get no
ip-172-31-14-174 Ready <none> 41m v1.12.1
ip-172-31-24-80 Ready <none> 41m v1.12.1
ip-172-31-63-34 Ready <none> 41m v1.12.1
ubuntu@ip-172-31-0-48:~$ curl -sSL "http://localhost:8001/api/v1/nodes/ip-172-31-14-174/proxy/configz" | python3 -m json.tool
"kubeletconfig": {
"syncFrequency": "1m0s",
"fileCheckFrequency": "20s",
"httpCheckFrequency": "20s",
"address": "",
"port": 10250,
"tlsCertFile": "/root/cdk/server.crt",
"tlsPrivateKeyFile": "/root/cdk/server.key",
"authentication": {
"x509": {
"clientCAFile": "/root/cdk/ca.crt"
"webhook": {
"enabled": true,
"cacheTTL": "2m0s"
"anonymous": {
"enabled": false
"authorization": {
"mode": "Webhook",
"webhook": {
"cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s"
"registryPullQPS": 5,
"registryBurst": 10,
"eventRecordQPS": 5,
"eventBurst": 10,
"enableDebuggingHandlers": true,
"healthzPort": 10248,
"healthzBindAddress": "",
"oomScoreAdj": -999,
"clusterDomain": "cluster.local",
"clusterDNS": [
"streamingConnectionIdleTimeout": "4h0m0s",
"nodeStatusUpdateFrequency": "10s",
"nodeLeaseDurationSeconds": 40,
"imageMinimumGCAge": "2m0s",
"imageGCHighThresholdPercent": 60,
"imageGCLowThresholdPercent": 39,
"volumeStatsAggPeriod": "1m0s",
"cgroupsPerQOS": true,
"cgroupDriver": "cgroupfs",
"cpuManagerPolicy": "none",
"cpuManagerReconcilePeriod": "10s",
"runtimeRequestTimeout": "2m0s",
"hairpinMode": "promiscuous-bridge",
"maxPods": 110,
"podPidsLimit": -1,
"resolvConf": "/run/systemd/resolve/resolv.conf",
"cpuCFSQuota": true,
"cpuCFSQuotaPeriod": "100ms",
"maxOpenFiles": 1000000,
"contentType": "application/vnd.kubernetes.protobuf",
"kubeAPIQPS": 5,
"kubeAPIBurst": 10,
"serializeImagePulls": true,
"evictionHard": {
"imagefs.available": "15%",
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%"
"evictionPressureTransitionPeriod": "5m0s",
"enableControllerAttachDetach": true,
"makeIPTablesUtilChains": true,
"iptablesMasqueradeBit": 14,
"iptablesDropBit": 15,
"failSwapOn": false,
"containerLogMaxSize": "10Mi",
"containerLogMaxFiles": 5,
"configMapAndSecretChangeDetectionStrategy": "Watch",
"enforceNodeAllocatable": [

Summing Up

There is a way to get the configuration of an online Kubernetes node through the Kubernetes API (api/v1/nodes/<node_name>/proxy/configz). This might be handy if you want to code against Kubernetes or you do not want to get into the intricacies of your particular cluster setup.


How to Inspect the Configuration of a Kubernetes Node was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
K. Tsakalozos

Istio almost immediately strikes you as enterprise grade software. Not so much because of the complexity it introduces, but more because of the features it adds to your service mesh. Must-have features packaged together in a coherent framework:

  • Traffic Management
  • Security Policies
  • Telemetry
  • Performance Tuning

Since microk8s positions itself as the local Kubernetes cluster developers prototype on, it is no surprise that deployment of Istio is made dead simple. Let’s start with the microk8s deployment itself:

> sudo snap install microk8s --classic

Istio deployment available with:

> microk8s.enable istio

There is a single question that we need to respond to at this point. Do we want to enforce mutual TLS authentication among sidecars? Istio places a proxy to your services so as to take control over routing, security etc. If we know we have a mixed deployment with non-Istio and Istio enabled services we would rather not enforce mutual TLS:

> microk8s.enable istio
Enabling Istio
Enabling DNS
Applying manifest
service/kube-dns created
serviceaccount/kube-dns created
configmap/kube-dns created
deployment.extensions/kube-dns created
Restarting kubelet
DNS is enabled
Enforce mutual TLS authentication ( between sidecars? If unsure, choose N. (y/N): y

Believe it or not we are done, Istio v1.0 services are being set up, you can check the deployment progress with:

> watch microk8s.kubectl get all --all-namespaces

We have packaged istioctl in microk8s for your convenience:

> microk8s.istioctl get all --all-namespaces
grafana-ports-mtls-disabled istio-system 2m
DESTINATION-RULE NAME   HOST                                             SUBSETS   NAMESPACE      AGE
istio-policy istio-policy.istio-system.svc.cluster.local istio-system 3m
istio-telemetry istio-telemetry.istio-system.svc.cluster.local istio-system 3m
GATEWAY NAME                      HOSTS     NAMESPACE      AGE
istio-autogenerated-k8s-ingress * istio-system 3m

Do not get scared by the amount of services and deployments, everything is under the istio-system namespace. We are ready to start exploring!

Demo Time!

Istio needs to inject sidecars to the pods of your deployment. In microk8s auto-injection is supported so the only thing you have to label the namespace you will be using with istion-injection=enabled:

> microk8s.kubectl label namespace default istio-injection=enabled

Let’s now grab the bookinfo example from the v1.0 Istio release and apply it:

> wget
> microk8s.kubectl create -f bookinfo.yaml

The following services should be available soon:

> microk8s.kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) details ClusterIP <none> 9080/TCP kubernetes ClusterIP <none> 443/TCP productpage ClusterIP <none> 9080/TCP ratings ClusterIP <none> 9080/TCP reviews ClusterIP <none> 9080/TCP

We can reach the services using the ClusterIP they have; we can for example get to the productpage in the above example by pointing our browser to But let’s play by the rules and follow the official instructions on exposing the services via NodePort:

> wget
> microk8s.kubectl create -f bookinfo-gateway.yaml

To get to the productpage through ingress we shamelessly copy the example instructions:

> microk8s.kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?("http2")].nodePort}'

And our node is the localhost so we can point our browser to http://localhost:31380/productpage

Show me some graphs!

Of course graphs look nice in a blog post, so here you go.

The Grafana Service

You will need to grab the ClusterIP of the Grafana service:

microk8s.kubectl -n istio-system get svc grafana

Prometheus is also available in the same way.

microk8s.kubectl -n istio-system get svc prometheus
The Prometheus Service

And for traces you will need to look at the jaeger-query.

microk8s.kubectl -n istio-system get service/jaeger-query
The Jaeger Service

The servicegraph endpoint is available with:

microk8s.kubectl -n istio-system get svc servicegraph
The ServiceGraph

I should stop here. Go and checkout the Istio documentation for more details on how to take advantage of what Istio is offering.

What to keep from this post


Microk8s puts up its Istio and sails away was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more
Colin Ian King

Static Analysis Trends on Linux Next

I've been running static analysis using CoverityScan on linux-next for 2 years with the aim to find bugs (and try to fix some) before they are merged into Linux.  I have also been gathering the defect count data and tracking the defect trends:

As one can see from above, CoverityScan has found a considerable amount of defects and these are being steadily fixed by the Linux developer community.  The encouraging fact is that the outstanding issues are reducing over time. Some of the spikes in the data are because of changes in the analysis that I'm running (e.g. getting more coverage), but even so, one can see a definite trend downwards in the total defects in the Kernel.

With static analysis, some of these reported defects are false positives or corner cases that are in fact impossible to occur in real life and I am slowly working through these and annotating them so they don't get reported in the defect count.

It must be also noted that over these two years the kernel has grown from around 14.6 million to 17.1 million lines of code so the defect count has dropped from 1 defect in every ~2100 lines to 1 defect in every ~3000 lines over the past 2 years.  All in all, it is a remarkable improvement for such a large and complex codebase that is growing in size at such rate.

Read more

Ubuntu 18.04 marked the transition to a new, more granular, packaging of the NVIDIA drivers, which, unfortunately, combined with a change in logind, and with the previous migration from Lightdm to Gdm3, caused (Intel+NVIDIA) hybrid laptops to stop working the way they used to in Ubuntu 16.xx and older.

The following are the main issues experienced by our users:

  • An increase in power consumption when using the power saving profile (i.e. when the discrete GPU is off).
  • The inability to switch between power profiles on log out (thus requiring a reboot).

We have backported a commit to solve the problem with logind, and I have worked on a few changes in gpu-manager, and in the other key components, to improve the experience when using Gdm3.

NOTE: fixes for Lightdm, and for SDDM still need some work, and will be made available in the next update.

Both issues should be fixed in Ubuntu 18.10, and I have backported my work to Ubuntu 18.04, which is now available for testing.

If you run Ubuntu 18.04, own a hybrid laptop with an Intel and an NVIDIA GPU (supported by the 390 NVIDIA driver),  we would love to get your feedback on the updates in Ubuntu 18.04.

If you are interested, head over to the bug report, follow the instructions at the end of the bug description, and let us know about your experience.

Read more

Hello MAASters!

I’m happy to announce that MAAS 2.5.0 beta 1 has been released. The beta 1 now features

  • Complete proxing of machine communication through the rack controller. This includes DNS, HTTP to metadata server, Proxy with Squid and new in 2.5.0 beta 1, syslog.
  • CentOS 7 & RHEL 7 storage support (Requires a new Curtin version available in PPA).
  • Full networking for KVM pods.
  • ESXi network configuration

For more information, please refer to MAAS Discourse [1].


Read more
Christian Brauner

Today a new firmware update enabled the long-missing S3 support for 6en Lenovo ThinkPad X1. After getting the new update via:

sudo fwupdmgr refresh
sudo fwupdmgr get-updates

You should see:

20KHCTO1WW System Firmware has firmware updates:
GUID:                    a4b51dca-8f97-4310-8821-3330f83c9135
GUID:                    230c8b18-8d9b-53ec-838b-6cfc0383493a
ID:                      com.lenovo.ThinkPadN23ET.firmware
Update Version:          0.1.30
Update Name:             ThinkPad X1 Carbon 6th
Update Summary:          Lenovo ThinkPad X1 Carbon 6th System Firmware
Update Remote ID:        lvfs
Update Checksum:         SHA1(1a528d1b227e500bcaedbd4c7026a477c5f4a5ca)
Update Location:
Update Description:      Lenovo ThinkPad X1 Carbon 6th System Firmware
                         CHANGES IN THIS RELEASE
                         Version 1.30
                         [Important updates]
                          • Nothing.
                         [New functions or enhancements]
                          • Support Optimized Sleep State for Linux in ThinkPad Setup - Config - Power.
                          • (Note) "Linux"option is optimized for Linux OS, Windows user must select
                          • "Windows 10" option
                         [Problem fixes]
                          • Nothing.

After installing the update via:

sudo fwupdmgr update

S3 will still not be enabled. To enable it fully you must enter the BIOS on boot and change:

alt text


alt text


dmesg | grep S3

should show

[    0.236226] ACPI: (supports S0 S3 S4 S5)


Read more
K. Tsakalozos

Snap upgrades is a matter of trust.

Snap upgrades is a matter of trust.

Do you trust that the vendor (Canonical in the case of microk8s) will not break your deployment and/or drop features? Isn’t this similar to what all OS vendors are doing right now? Updates are pushed transparently to you, and you trust them to do so.

Specifically for snaps now. As you move from edge to beta and then candidate and stable channels things get more trustworthy. Software updates are pushed from the application vendor and you can schedule when these updates will land on your system. You can delay updates to a snap for up to 90 days. If you need to never update then you can grab the snap revision you want from the store and deploy as if it is a local snap (not recommended). If you are a corporation you will need to have your own snap store to gate any updates. The software update options you have with snaps are VERY similar with what you get with any other application distribution method, what changes is the default approach. The snaps default is that updates on all applications will be pushed when the application author decides to, and we both know you trust the application author because you are already using the app. If the application author is not a trustworthy then the application will just not sell.

This is a long discussion. I have to admit I was initially not convinced by the snap approach; it might not be suitable for all software out there. But after giving it a bit of thought, snaps present an elegant way to solve the problem of unpatched and outdated software.

Read more
K. Tsakalozos

A friend once asked, why would one prefer microk8s over minikube?… We never spoke since. True story!

That was a hard question, especially for an engineer. The answer is not so obvious largely because it has to do with personal preferences. Let me show you why.

Microk8s-wise this is what you have to do to have a local Kubernetes cluster with a registry:

sudo snap install microk8s --edge --classic
microk8s.enable registry

How is this great?

  • It is super fast! A couple of hundreds of MB over the internet tubes and you are all set.
  • You skip the pain of going through the docs for setting up and configuring Kubernetes with persistent storage and the registry.

So why is this bad?

  • As a Kubernetes engineer you may want to know what happens under the hood. What got deployed? What images? Where?
  • As a Kubernetes user you may want to configure the registry. Where are the images stored? Can you change any access credentials?

Do you see why this is a matter of preference? Minikube is a mature solution for setting up a Kubernetes in a VM. It runs everywhere (even on windows) and it does only one thing, sets up a Kubernetes cluster.

On the other hand, microk8s offers Kubernetes as an application. It is opinionated and it takes a step towards automating common development workflows. Speaking of development workflows...

The full story with the registry

The registry shipped with microk8s is available on port 32000 of the localhost. It is an insecure registry because, let’s be honest, who cares about security when doing local development :) .

And it’s getting better, check this out! The docker daemon used by microk8s is configured to trust this insecure registry. It is this daemon we talk to when we want to upload images. The easiest way to do so is by using the microk8s.docker command coming with microk8s:

# Lets get a Docker file first
# And build it
microk8s.docker build -t localhost:32000/nginx:testlocal . microk8s.docker push localhost:32000/nginx:testlocal
If you prefer to use an external docker client you should point it to the socket dockerd is listening on:
docker -H unix:///var/snap/microk8s/docker.sock ps

To use an image from the local registry just reference it in your manifests:

apiVersion: v1
kind: Pod
name: my-nginx
namespace: default
- name: nginx
image: localhost:32000/nginx:testlocal
restartPolicy: Always

And deploy with:

microk8s.kubectl create -f the-above-awesome-manifest.yaml

Microk8s and registry

What to keep from this post?

You want Kubernetes? We deliver it as a (sn)app!

You want to see your tool-chain in microk8s? Drop us a line. Send us a PR!

We are pleased to see happy Kubernauts!

Those of you who are here for the gossip. He was not that good of a friend (obviously!). We only met in a meetup :) !


Microk8s Docker Registry was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read more

Hello MAASTers

MAAS 2.4.1 has now been released and it is a bug fix release. Please see more details in [1].


Read more

A short lived ride After some time on Kubuntu on this new laptop, I just re-discovered that I did not want to live in the Plasma world anymore. While I do value all the work the team behind it does, the user interface is just not for me as it feels rather busy to my liking. In that aforementioned post I wrote about running the Ubuntu Report Tool on this system, it is not part of the Kubuntu install or first boot experience but you can install it by running apt install ubuntu-report followed by running ubuntu-report to actually create the report and if you want, send it too.

Read more
Christian Brauner

Unprivileged File Capabilities

alt text


File capabilities (fcaps) are capabilities associated with - well - files, usually a binary. They can be used to temporarily elevate privileges for unprivileged users in order to accomplish a privileged task. This allows various tools to drop the dangerous setuid (suid) or setgid (sgid) bits in favor of fcaps.

While fcaps are supported since Linux 2.6.24 they could only be set in the initial user namespace. If they would have been allowed to be set by root in a non-initial user namespace then any unprivileged user on the host would have been able to map their own uid to root in a new user namespace, set fcaps that would grant more privileges to them, and then execute the binary with elevated privileged on the host. This also means that until recently it was not safe to use fcaps in unprivileged containers, i.e. containers using user namespaces. The good news is that starting with Linux kernel version 4.14 it is possible to set fcaps in user namespaces.

Kernel Patchset

The patchset to enable this has been contributed by Serge Hallyn, a co-maintainer and core developer of the LXD and LXC projects:

commit 8db6c34f1dbc8e06aa016a9b829b06902c3e1340
Author: Serge E. Hallyn <>
Date:   Mon May 8 13:11:56 2017 -0500

    Introduce v3 namespaced file capabilities

LXD Now Preserves File Capabilities In User Namespaces

In parallel to the kernel patchset we have now enabled LXD to preserve fcaps in user namespaces. This means if your kernel supports namespaced fcaps LXD will preserve them whenever unprivileged containers are created, or when their idmapping is changed. No matter if you go from privileged to unpriviliged or the other way around. Your filesystem capabilities will be right there with you. In other news, there is now little to no use for the suid and sgid bits even in unprivileged containers.

This is something that the Linux Containers Project has wanted for a long time and we are happy that we are the first runtime to fully support this feature.

If all of the above either makes no sense to you or you’re asking yourself what is so great about this because some distros have been using fcaps for a long time don’t worry we’ll try to shed some light on all of this.

The dark ages: suid and sgid binaries

Not too long ago the suid and sgid bits were the only well-known mechanism to temporarily grant elevated privileges to unprivileged users executing a binary. Once some or all of the following binaries where suid or sgid binaries on most distros:

  • ping
  • newgidmap
  • newuidmap
  • mtr-packet

The binary that most developers will have already used is the ping binary. It’s convenient to just check whether a connection to the internet has been established successfully by pinging a random website. It’s such a common tool that most people don’t even think about it needing any sort of privilege. In fact it does require privileges. ping wants to open sockets of type SOCK_RAW but the kernel prevents unprivileged users from using sockets of type SOCK_RAW because it would allow them to e.g. send ICMP packages directly. But ping seems like a binary that is useful to unprivileged users as well as safe. Short of a better mechanism the most obvious choice is to have it be owned by uid 0 and set the suid bit.

> perms /bin/ping
-rwsr-xr-x 4755 /bin/ping

You can see the little s in the permissions. This indicates that this version of ping has the suid bit set. Hence, if called it will run as uid 0 independent of the uid of the caller. In short, if my user has uid 1000 and calls the ping binary ping will still run with uid 0.

While the suid mechanism gets the job done it is also wildly inappropriate. ping does need elevated privileges in one specific area. But by setting the suid bit and having ping be owned by uid 0 we’re granting it all kinds of privileges, in fact all privileges. If there ever is a major security sensitive bug in a suid binary it is trivial for anyone to exploit the fact that it runs as uid 0.

Of course, the kernel has all kinds of security mechanisms to deflate the impact of the suid and sgid bits. If you strace an suid binary the suid bit will be stripped, there are complex rules regarding execve()ing a binary that has the suid bit set, and the suid bit is also dropped when the owner of the binary in question changes, i.e. when you call chown() on it. Still these are all migitations for something that is inherently dangerous because it grants too much for too little gain. It’s like someone asking for a little sugar and you handing out the key to your house. To quote Eric:

Frankly being able to raise the priveleges of an existing process is such a dangerous mechanism and so limiting on system design that I wish someone would care, and remove all suid, sgid, and capabilities use from a distro. It is hard to count how many neat new features have been shelved because of the requirement to support suid root executables.

Capabilities and File Capabilities

This is where capabilities come into play 1. Capabilities start from the idea that the root privilege could as well be split into subsets of privileges. Whenever something requests to perform an operation that requires privileges it doesn’t have we can grant it a very specific subset instead of all privileges at once 2. For example, the ping binary would only need the CAP_NET_RAW capability because it is the capability that regulates whether a process can open SOCK_RAW sockets.

Capabilities are associated with processes and files. Granted, Linux capabilities are not the cleanest or easiest concept to grasp. But I’ll try to shed some light. In essence, capabilities can be present in four different types of sets. The kernel performs checks against a process by looking at its effective capability set, i.e. the capabilities the process has at the time of trying to perform the operation. The rest of the capability sets are (glossing over details now for the sake of brevity) basically used for calculating each other including the effective capability set. There are permitted capabilities, i.e. the capabilities a process is allowed to raise in the effective set, inheritable capabilities, i.e. capabilities that should be (but are only under certain restricted conditions) preserved across an execve(), and ambient capabilities that are there to fix the shortcomings of inheritable capabilities, i.e. they are there to allow unprivileged processes to preserve capabilities across an execve() call. 3 Last but not least we have file capabilities, i.e. capabilities that are attached to a file. When such a file is execve()ed the associated fcaps are taken into account when calculating the permissions after the execve().

Extended attributes and File Capabilities

The part most users are confused about is how capabilities get associated with files. This is where extended attributes (xattr) come into play. xattrs are <key>:<value> pairs that can be associated with files. They are stored on-disk as part of the metadata of a file. The <key> of an xattr will always be a string identifying the attribute in question whereas the <value> can be arbitrary data, i.e. it can be another string or binary data. Note that it is not guaranteed nor required by the kernel that a filesystem supports xattrs. While the virtual filesystem (vfs) will handle all core permission checks, i.e. it will verify that the caller is allowed to set the requested xattr but the actual operation of writing out the xattr on disk will be left to the filesystem. Without going into the specifics the callchain currently is:

SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
                const char __user *, name, const void __user *, value,
                size_t, size, int, flags)
-> static int path_setxattr(const char __user *pathname,
                            const char __user *name, const void __user *value,
                            size_t size, int flags, unsigned int lookup_flags)
   -> static long setxattr(struct dentry *d, const char __user *name,
                           const void __user *value, size_t size, int flags)
      -> int vfs_setxattr(struct dentry *dentry, const char *name,
                          const void *value, size_t size, int flags)
         -> int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
                                      const void *value, size_t size, int flags)

and finally __vfs_setxattr_noperm() will call

int __vfs_setxattr(struct dentry *dentry, struct inode *inode, const char *name,
                   const void *value, size_t size, int flags)
        const struct xattr_handler *handler;

        handler = xattr_resolve_name(inode, &name);
        if (IS_ERR(handler))
                return PTR_ERR(handler);
        if (!handler->set)
                return -EOPNOTSUPP;
        if (size == 0)
                value = "";  /* empty EA, do not remove */
        return handler->set(handler, dentry, inode, name, value, size, flags);

The __vfs_setxattr() function will then call xattr_resolve_name() which will find and return the appropriate handler for the xattr in the list struct xattr_handler of the corresponding filesystem. If the filesystem has a handler for the xattr in question it will return it and the attribute will be set and if not EOPNOTSUPP will be surfaced to the caller.

For this article we will only focus on the permission checks that the vfs performs not on the filesystem specifics. An important thing to note is that different xattrs are subject to different permission checks by the vfs. First, the vfs regulates what types of xattrs are supported in the first place. If you look at the xattr.h header you will find all supported xattr namespaces. An xattr namespace is essentially nothing but a prefix like security.. Let’s look at a few examples from the xattr.h header:

#define XATTR_SECURITY_PREFIX "security."

#define XATTR_SYSTEM_PREFIX "system."

#define XATTR_TRUSTED_PREFIX "trusted."

#define XATTR_USER_PREFIX "user."

Based on the detected prefix the vfs will decide what permission checks to perform. For example, the user. namespace is not subject to very strict permission checks since it exists to allow users to store arbitrary information. However, some xattrs are subject to very strict permission checks since they allow to change privileges. For example, this affects the security. namespace. In fact, the xattr.h header even exposes a specific capability suffix to use with the security. namespace:

#define XATTR_CAPS_SUFFIX "capability"

As you might have figured out file capabilities are associated with the security.capability xattr.

In contrast to other xattrs the value associated with the security.capability xattr key is not a string but binary data. The actual implementation is a C struct that contains bitmasks of capability flags. To actually set file capabilities userspace would usually use the libcap library because the low-level bits of the implementation are not very easy to use. Let’s say a user wanted to associate the CAP_NET_RAW capability with the ping binary on a system that only supports non-namespaced file capabilities. Then this is the minimum that you would need to do in order to set CAP_NET_RAW in the effective and permitted set of the file:

 * Do not simply copy this code. For the sake of brevity I e.g. omitted
 * handling the necessary endianess translation. (Not to speak of the apparent
 * ugliness and missing documentation of my sloppy macros.)

struct vfs_cap_data xattr = {0};

#define raise_cap_permitted(x, cap_data)[(x)>>5].permitted   |= (1<<((x)&31))
#define raise_cap_inheritable(x, cap_data)[(x)>>5].inheritable |= (1<<((x)&31))

raise_cap_permitted(CAP_NET_RAW, xattr);

setxattr("/bin/ping", "security.capability", &xattr, sizeof(xattr), 0);

After having done this we can look at the ping binary and use the getcap binary to check whether we successfully set the CAP_NET_RAW capability on the ping binary. Here’s a little demo:


Setting Unprivileged File Capabilities

On kernels that support namespaced file capabilities the straightforward way to set a file capability is to attach to the user namespace in question as root and then simply perform the above operations. The kernel will then transparently handle the translation between a non-namespaced and a namespaced capability by recording the rootid from the kernel’s perspective (the kuid).

However, it is also possible to set file capabilities in lieu of another user namespace. In order to do this the code above needs to be changed slightly:

 * Do not simply copy this code. For the sake of brevity I e.g. omitted
 * handling the necessary endianess translation. (Not to speak of the apparent
 * ugliness and missing documentation of my sloppy macros.)

struct vfs_ns_cap_data ns_xattr = {0};

#define raise_cap_permitted(x, cap_data)[(x)>>5].permitted   |= (1<<((x)&31))
#define raise_cap_inheritable(x, cap_data)[(x)>>5].inheritable |= (1<<((x)&31))

raise_cap_permitted(CAP_NET_RAW, ns_xattr);
ns_xattr.rootid = 1000000;

setxattr("/bin/ping", "security.capability", &ns_xattr, sizeof(ns_xattr), 0);

As you can see the struct we use has changed. Instead of using struct vfs_cap_data we are now using struct vfs_ns_cap_data which has gained an additional field rootid. In our example we are setting the rootid to 1000000 which in my example is the rootid of uid 0 in the container’s user namespace as seen from the host. Additionally, we set the magic_etc bit for the fcap version that the vfs is expected to support to VFS_CAP_REVISION_3.


As you can see from the asciicast we can’t execute the ping binary as an unprivileged user on the host since the fcaps is namespaced and associated with uid 1000000. But if we copy that binary to a container where this uid is mapped to uid 0 we can now call ping as an unprivileged user.

So let’s look at an actual unprivileged container and let’s set the CAP_NET_RAW capability on the ping binary in there:


Some Implementation Details

As you have seen above a new struct vfs_ns_cap_data has been added to the kernel:

 * same as vfs_cap_data but with a rootid at the end
struct vfs_ns_cap_data {
        __le32 magic_etc;
        struct {
                __le32 permitted;    /* Little endian */
                __le32 inheritable;  /* Little endian */
        } data[VFS_CAP_U32];
        __le32 rootid;

In the end this struct is what the kernel expects to be passed and which it will use to calculate fcaps. The location of the permitted and inheritable set in struct vfs_ns_cap_data are obvious but the effective set seems to be missing. Whether or not effective caps are set on the file is determined by raising the VFS_CAP_FLAGS_EFFECTIVE bit in the magic_etc mask. The magic_etc member is also used to tell the kernel which fcaps version the vfs is expected to support. The kernel will verify that either XATTR_CAPS_SZ_2 or XATTR_CAPS_SZ_3 are passed as size and are correctly paired with the VFS_CAP_REVISION_2 and VFS_CAP_REVISION_3 flag. If XATTR_CAPS_SZ_2 is set then the kernel will not try to look for a rootid field in the struct it received, i.e. even if you pass a struct vfs_ns_cap_data with a rootid but set XATTR_CAPS_SZ_2 as size parameter and VFS_CAP_REVISION_2 in magic_etc the kernel will be able to ignore the rootid field and instead use the rootid of the current user namespace. This allows the kernel to transparently translate from VFS_CAP_REVISION_2 to VFS_CAP_REVISION_3 fcaps. The main translation mechanism can be found in cap_convert_nscap() and rootid_from_xattr():

* User requested a write of security.capability.  If needed, update the
* xattr to change from v2 to v3, or to fixup the v3 rootid.
* If all is ok, we return the new size, on error return < 0.
int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
        struct vfs_ns_cap_data *nscap;
        uid_t nsrootid;
        const struct vfs_cap_data *cap = *ivalue;
        __u32 magic, nsmagic;
        struct inode *inode = d_backing_inode(dentry);
        struct user_namespace *task_ns = current_user_ns(),
                *fs_ns = inode->i_sb->s_user_ns;
        kuid_t rootid;
        size_t newsize;

        if (!*ivalue)
                return -EINVAL;
        if (!validheader(size, cap))
                return -EINVAL;
        if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
                return -EPERM;
        if (size == XATTR_CAPS_SZ_2)
                if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
                        /* user is privileged, just write the v2 */
                        return size;

        rootid = rootid_from_xattr(*ivalue, size, task_ns);
        if (!uid_valid(rootid))
                return -EINVAL;

        nsrootid = from_kuid(fs_ns, rootid);
        if (nsrootid == -1)
                return -EINVAL;

        newsize = sizeof(struct vfs_ns_cap_data);
        nscap = kmalloc(newsize, GFP_ATOMIC);
        if (!nscap)
                return -ENOMEM;
        nscap->rootid = cpu_to_le32(nsrootid);
        nsmagic = VFS_CAP_REVISION_3;
        magic = le32_to_cpu(cap->magic_etc);
        if (magic & VFS_CAP_FLAGS_EFFECTIVE)
                nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
        nscap->magic_etc = cpu_to_le32(nsmagic);
        memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);

        *ivalue = nscap;
        return newsize;


Having fcaps available in user namespaces just makes the argument to always use unprivileged containers even stronger. The Linux Containers Project is also working on a bunch of other kernel- and userspace features to improve unprivileged containers even more. Stay tuned! :)


  1. While capabilities provide a better mechanism to temporarily and selectively grant privileges to unprivileged processes they are by no means inherently safe. Setting fcaps should still be done rarely. If privilege escalation happens via suid or sgid bits or fcaps doesn’t matter in the end: it’s still a privilege escalation. 

  2. Exactly how to split up the root privilege and how exactly privileges should be implemented (e.g. should they be attached to file descriptors, should they be attached to inodes, etc.) is a good argument to have. For the sake of this article we will skip this discussion and assume the Linux implementation of POSIX capabilities. 

  3. If people are super keen and request this I can make a longer post how exactly they all relate to each other and possibly look at some of the implementation details too. 

Read more

Prologue After a week away from my computer I want to organize my thoughts on the progress made towards build VMs by providing this write up since that forum post can be a bit overwhelming if you are casually wanting to keep up to date. The reasons for this feature work to exist, for those not up to speed, is that we want to have a very consistent build environment for which anyone building a project can have an expectable outcome of a working snap (or non working one if it really doesn’t).

Read more