Friday, February 4th, 2011
Hi, I’m Phil. I work in the Online Services group at Canonical in the Operations and Foundations group. We work on keeping Ubuntu One up and humming along and improving its core technologies.
I wanted to take a moment and apologize for the extension of our planned downtime on Tuesday morning. I was unable to anticipate the problem and since it happened roughly 3/4 of the way through the total process it wasn’t possible to roll back and restore the service in its previous state.
We had run through the upgrade process (database servers upgraded to 10.04, Postgres upgraded from 8.3 to 8.4, and a series of database patches rolled up and applied) across all our database servers and everything ran well within the time window that we defined in our initial outage announcement. During the production upgrade, the last storage shard spent a surprisingly long time re-adding a foreign key constraint. I spent longer than I would have liked hoping to wait the process out, thinking that we’d spend less total downtime vs. starting that import over from scratch. That didn’t end up being a smart decision.
I eventually reached the conclusion that this process wasn’t going to complete, and the import process was restarted. Two hours later the import was complete, all servers were restarted, and the service was restored to 100% functionality. Every developer and admin involved heaved a pretty serious sigh of relief.
So what did I do wrong, or what could we do better next time? First, I’m going to do a lot better job scheduling downtime going forward. This was scheduled for a low-traffic period where we had standard developer and admin coverage on a weekday; that gave us a small low-traffic window and with the unexpected increase in process time, we quickly ran into prime time. I should have scheduled it for a Sunday evening, giving us a much longer low-traffic window to work in where a minimum of users would be disrupted.
Second, we’ll do a better job of trusting our math. It was clear something was wrong much earlier than when we finally we pulled the trigger on restarting the process; I could have saved a couple of hours by trusting our initial analysis.
Finally, we’ll continue to work hard to extend our architecture to remove downtime and perform rolling upgrades. Perhaps zero downtime is an unrealistic expectation, but I’m going to make sure we get as close to that as we can.
Thanks so much for your patience and I hope you continue to enjoy the features we’ve added recently and have upcoming for Ubuntu One.
- Philip Fibiger