The Bazaar package importer is a service that we run to allow people to use
Bazaar for Ubuntu development by importing any source package uploads in to
bzr. It's not something that most Ubuntu developers will interact with directly,
but is of increasing importance.
I've spent a lot of time working in the background on this project, and while
the details have never been secret, and in fact the code has been available
for a while, I'm sure most people don't know what goes on. I wanted to rectify
that, and so started with some wiki documentation on the internals. This post
is more abstract, talking about the archtecture.
While it has a common pattern of requirements, and so those familiar with
the architecture of job systems will recognise the solution, the devil is in the
details. I therefore present this as a case-study of one such system that can
be used to constrast other similar sytstems as an aid to learning how differing
requirements affect the finished product.
The Problem
For the Ubuntu Distributed Development initative we have a need for a
process that imports packages in to bzr on an ongoing basis as they are
uploaded to Ubuntu. This is so that we can have a smooth transition rather
than a flag day where everyone switches. For those that are familiar with
them think Launchpad's code imports but with Debian/Ubuntu packages as the
source, rather than a foreign VCS.
This process is required to watch for uploads to Debian and Ubuntu and trigger
a run to import that upload to the bzr branches, pushing the result to LP. It
should be fast, though we currently have a publication delay in Ubuntu that
means we are used to latencies of an hour, so it doesn't have to be greased
lightning to get acceptance. It is more important to be reliable, so that the
bzr branches can be assumed to be up to date, that is crucial for acceptance.
It should also keep an audit trail of what it thinks is in the branches. As
we open up write access to the resulting branches to Ubuntu developers we can
not rely on the content of the branches not being tampered with. I don't expect
this will ever be a problem, but I wanted to ensure that we could at least detect
tampering, even if we couldn't know exactly what had happened by keeping private
copies of everything.
The Building Blocks
The first building block of the solution is the import script for a single package.
You can run this at any time and it will figure out what is unimported, and do
the import of the rest, so you can trigger it as many times as you like without
worrying that it will cause problems. Therefore the requirement is only to trigger
it at least once when there has been an upload since the last time it was run, which
is a nicer requirement than "exactly once per upload" or similar.
However, as it may import to a number of branches (both lucid and karmic-security
in the case of a security upload, say), and these must be consistent on Launchpad,
only one instance can run at once. There is no way to do atomic operations on sets
of branches on Launchpad, therefore we use locks to ensure that only one process
is running per-package at any one time. I would like to explore ways to remove this
requirement, such as avoiding race conditions by operating on the Launchpad branches
in a consistent manner, as this would give more freedom to scale out.
The other part of the system is a driver process. We use separate processes so that
any faults in the import script can be caught in the supervisor process, with the
errors being logged. The driver process picks a package to import and triggers a run
of the script for it. It uses something like the following to do that:
write_failure(package, "died")
try:
import(package)
except:
write_failure(packge, stderr)
finally:
remove_failure(package)
write_failure creates a record that the package failed to import with a reason. This
provides a list of problems to work through, and also means that we can avoid trying
to import a package if we know it has failed. This ensures that previous failures are
dealt with properly without giving them a chance to corrupt things later.
Queuing
I said that the driver picks a package and imports it. To do this it simply queries
the database for the highest priority job waiting, dispatching the result, or
sleeping if there are no waiting jobs. It can actually dispatch multiple jobs in
parallel as it uses processes to do the work.
The queue is filled by a couple of other processes triggered by cron. This is useful
as it means that further threads are not required, and there is less code running
in the monitor process, and so less chance that bugs will bring it down.
The first process is one that checks for new uploads since the last check and adds a
job for them, see below for the details. The second is one that looks at the current
list of failures and retries some of them automatically, if the failure looks like it
was likely to be transient, such as a timeout error trying to reach Launchpad. It
only retries after a timeout of a couple of hours has elapsed, and also if that package
hasn't failed in that same way several times in a row (to protect against e.g. the data
that job is sending to LP causing it to crash and so give timeout errors.)
It may be better to use an AMQP broker or a job server such as Gearman for this task,
rather that just using the database. However, we don't really need any of the more
advanced features that these provide, and already have some degree of loose-coupling,
so using fewer moving parts seems sensible.
Reacting to new uploads
I find this to be a rather neat solution, thanks to the Launchpad team. We use
the API for this, notably a method on IArchive called getPublishedSources().
They key here is the parameter "created_since_date". We keep track of this and
pass it to the API calls to get the uploads since the last time we ran, and
then act on those. Once we processed them all then we update the stored
date and go around again.
This has some nice properties, it is a poll interface, but has some things in
common with an event-based one. Key in my eyes is that we don't have to have
perfect uptime in order to ensure we never miss events.
However, I am not convinced that we will never get a publication that appears
later than one that we have dealt with, but that reports an earlier time.
If this happens we would never see it. The times we use always come from
LP, so don't require synchronised clocks between the machine where this
runs and the LP machines, but it could still happen inside LP.
To avoid this I subtract a delta when I send the request, so assuming
the skew would not be greater than that delta we won't get hit. This does
mean that you repeatedly try and import the same things, but that
is just a mild inefficiency.
Synchronisation
There is a synchronisation point when we push to Launchpad. Before and after
this critical period we can blow away what we are doing with no issues. During
it though we will have an inconsistent state of the world if we did that.
Therefore I used a protocol to ensure that we guard this section.
As we know locking ensures that only one process runs at a time, meaning that
the only way to race is with "yourself." All the code is written to assume
that things can go down at any time as I said, the supervisor catches this
and marks the failures, and even guards against itself dying. Therefore
when it picks back up and restarts the jobs that it was processing before
dying it needs to ensure that it wasn't in the critical section.
To do this we use a three-phase commit on the audit data to accomany the push.
When we are doing the import we track the additions to the audit data separately
from the committed data. Then if we die before we reach the critical section
we can just drop it again, returning to the inital state.
The next phase marks in the database that the critical section has begun. We then
start the push back. If we die here we know we were in the critical section and can
restart the push. Only once the push has fully completed do we move the new audit
data in to place.
The next step cleans up the local branches, dying here means we can just carry
on with the cleanup. Finally the mark that we are in the critical section is
removed, and we are back to the start state, indicating that the last run was
clean, and any subsequent run can proceed.
All of this means that if the processes go down for any reason, they will clean
up or continue as they restart as normal.
Dealing with Launchpad API issues
The biggest area of operational headaches I have tends to come from using the
Launchpad API. Overall the API is great to have, and generally a pleasure to
use, but I find that it isn't as robust as I would like. I have spent quite
some time trying to deal with that, and I would like to share some tips from
my experience. I'm also keen to help diagnose the issues further if any Launchpad
developers would like so that it can be more robust off the bat.
The first tip is: partition the data. Large datasets combined with fluctuating
load may mean that you suddenly hit a timeout error. Some calls allow you to
partition the data that you request. For instance, getPublishedSources that
I spoke about above allows you to specify a distro_series parameter. Doing
distro.main_archive.getPublishedSources()
is far far more likely to timeout than
for s in distro.series:
distro.main_archive.getPublishedSources(distro_series=s)
in fact, for Ubuntu, the former is guaranteed to timeout, it is a lot of data.
This is more coding, and not the natural way to do it, therefore it would be
great if launchpadlib automatically partioned and recombined the data.
The second tip is: expect failure. This one should be obvious, but the API doesn't
make it clear, unlike something like python-couchdb. It is a webservice, so you
will sometimes get HTTP exceptions, such as when LP goes offline for a rollout.
I've implemented randomized exponential backoff to help with this, as I tend
to get frequent errors that don't apparently correspond to service issues.
I very frequently see 502 return codes, on both edge and production, which I believe
means that apache can't reach the appservers in time.
Summary
Overall, I think this architecture is good, given the synchronisation requirements
we have for pushing to LP, without those it could be more loosely coupled.
The amount of day-to-day hand-holding required has reduced as I have learnt about
the types of issues that are encountered and changed the code to recognise and act
on them.