About our recent upgrade and…downtime

Hi, I’m Phil. I work in the Online Services group at Canonical in the Operations and Foundations group. We work on keeping Ubuntu One up and humming along and improving its core technologies.

I wanted to take a moment and apologize for the extension of our planned downtime on Tuesday morning. I was unable to anticipate the problem and since it happened roughly 3/4 of the way through the total process it wasn’t possible to roll back and restore the service in its previous state.

We had run through the upgrade process (database servers upgraded to 10.04, Postgres upgraded from 8.3 to 8.4, and a series of database patches rolled up and applied) across all our database servers and everything ran well within the time window that we defined in our initial outage announcement. During the production upgrade, the last storage shard spent a surprisingly long time re-adding a foreign key constraint. I spent longer than I would have liked hoping to wait the process out, thinking that we’d spend less total downtime vs. starting that import over from scratch. That didn’t end up being a smart decision.

I eventually reached the conclusion that this process wasn’t going to complete, and the import process was restarted. Two hours later the import was complete, all servers were restarted, and the service was restored to 100% functionality. Every developer and admin involved heaved a pretty serious sigh of relief.

So what did I do wrong, or what could we do better next time? First, I’m going to do a lot better job scheduling downtime going forward. This was scheduled for a low-traffic period where we had standard developer and admin coverage on a weekday; that gave us a small low-traffic window and with the unexpected increase in process time, we quickly ran into prime time. I should have scheduled it for a Sunday evening, giving us a much longer low-traffic window to work in where a minimum of users would be disrupted.

Second, we’ll do a better job of trusting our math. It was clear something was wrong much earlier than when we finally we pulled the trigger on restarting the process; I could have saved a couple of hours by trusting our initial analysis.

Finally, we’ll continue to work hard to extend our architecture to remove downtime and perform rolling upgrades. Perhaps zero downtime is an unrealistic expectation, but I’m going to make sure we get as close to that as we can.

Thanks so much for your patience and I hope you continue to enjoy the features we’ve added recently and have upcoming for Ubuntu One.

- Philip Fibiger

16 Responses to “About our recent upgrade and…downtime”

  1. Vince Says:

    No Worries!I love every thing about Ubuntu!It’s the only O.S. running my machine.If we never made any mistakes,we’d never learn a thing.

  2. Gerhard Says:

    Hi Philip,

    Many thanks for this information. I really appreciate your openness. :-)

    Gerhard

  3. Kelton Says:

    Ditto what Gerhard said!

  4. Steve G Says:

    What a refreshing change, an honest description about the issues faced during the upgrade. MS and Apple should take a leaf from your book!

    Keep up the good work.

  5. Farth Says:

    You are great. I love that you work so fast and so open to users. Thanks.

  6. Vulpine Says:

    Yeah, these things do happen. It’s only in fairyland that everything goes according to plans. Thanks for the info and the openness.

  7. Magnus Says:

    +1

  8. Christophe de la Fabrique du Multimédia Says:

    Sometimes, bug occurs. This may upset users but you and your staff were certainly more upset than us

  9. Serge Says:

    Phillip,
    thanks a lot for the detailed and honest report, It really building trust and confidence,
    thanks and best regards,

  10. Chauncellor Says:

    Don’t worry about it. I’m still alive. Learn from your mistakes, that’s what the best do.

  11. Paul T Says:

    ….Sorry to see what happened to you :>( about the down time. That’s the way it goes sometimes –just have to live with it. :>) Most be patient.

    ….I tried to sign up for ubuntu one several weeks ago but no joy. Just did it again — still no joy. I’ve got an account ( I think) but the only thing I can do is stare at the pages with noplace to go or things to do. I’d like to upload some files but there is no way to do it. The only way to get off the page is to enter a URL for some other web site.

    ….No even a thank-you for doing it. Makes one feel like a real du-aaa for trying a second time and getting noware !!

    Paul.E.T

  12. Joel Says:

    Thank you. This blogpost should be a model for developers, as a reminder that an apology and an explanation can go a long long way.

  13. Chris Says:

    Thank you for being technical. This gives me a lot more confidence than the “Sorry, we fixed it” type of messages.

  14. Stephen Says:

    Nothing got lost so no worries. Perhaps more frequent updates but less included in the update to reduce potential conflicts within the update. Also if the downtime was at the same time every month or week then we would all realise why the downtime was happening and work around it.

  15. Stuart Says:

    WOW, how about that. an open, honest, and even human explanation for downtime. You would never get this form microflop or crapple. Nice work mate, Tuff luck on the update it happens to the best of us :-)
    i am now a paying customer.

  16. Milf Soup Says:

    I’ve been visiting your blog for a while now and I always find a gem in your new posts. Thanks for sharing.

Leave a Reply