We’ve made a number of improvements to the Launchpad build farm in the last month, with the aim of improving its performance and robustness. This sort of work is usually invisible to users except when something goes wrong, so we thought it would be worth taking some time to give you a summary. Some of this work was on the Launchpad software itself, while some was on the launchpad.net hardware.
(To understand some of the rest of this post, it’s useful to be aware of the distinction between virtualised and devirtualised builders in Launchpad. Virtualised builders are used for most PPAs: they build untrusted code in a Xen guest which is initialised from scratch at the start of each build, and are only available for i386, amd64, and a small number of ARM builds by way of user-mode QEMU. Devirtualised builders run on ordinary hardware with less strict containment, and are used for Ubuntu distribution builds and a few specialised PPAs.)
ARM builders have been a headache for some time. For our devirtualised builders, we were using a farm of PandaBoards, having previously used BeagleBoards and Babbage boards. These largely did the job, but they’re really a development board rather than server-class hardware, and it showed in places: disk performance wasn’t up to our needs and we saw build failures due to data corruption much more frequently than we were comfortable with. We recently installed a cluster of Calxeda Highbank nodes, which have been performing much more reliably.
It has long been possible to cancel builds on virtualised builders: this is easy because we can just reset the guest. However, it was never possible to cancel builds on devirtualised builders: killing the top-level build process isn’t sufficient for builds that are stuck in various creative ways, and you need to make sure to go round and repeatedly kill all processes in the build chroot until they’ve all gone away. We’ve now hooked this up properly, and it is possible for build daemon maintainers to cancel builds on devirtualised builders without operator assistance, which should eliminate situations where we need urgent builds to jump the queue but can’t because all builders are occupied by long-running builds. (People with upload privileges can currently cancel builds too, which is intended mainly to allow cancelling your own builds; please don’t abuse this or we may need to tighten up the permissions.) As a bonus, cancelling a build no longer loses the build log.
Finally, we have been putting quite a bit of work into build farm reliability. A few problems have led to excessively long queues on virtual builders:
- Builders hung for some time when they should have timed out, due to a recent change in su; this is now fixed in the affected Ubuntu series.
- Xen guests often fail to restore for one reason or another, and when this happened builders would fail in ways that required an operator to fix. We had been dealing with this by having our operators do semi-automatic builder fixing runs a few times a day, but in recent months the frequency of failures has been difficult to keep up with in this way, especially at the weekend. Some of this is probably related to our current use of a rather old version of Xen, but the builder management code in Launchpad could also handle this much better by trying to reset the guest again in the same way that we do at the start of each build. As of this morning’s code deployment, we now do this, and the build farm seems to be holding up much more robustly.
This should make things better for everyone, but we aren’t planning to stop here. We’re intending to convert the virtual builders to an OpenStack deployment, which should allow us to scale them much more flexibly. We plan to take advantage of more reliable build cancellation to automatically cancel in-progress builds that have been superseded by new source uploads, so that we don’t spend resources on builds that will be rejected on upload. And we plan to move Ubuntu live file system building into Launchpad so that we can consolidate those two build farms and make better use of our available hardware.Read more