Canonical Voices

What Sidnei da Silva talks about

Posts tagged with 'python'

Sidnei

As part of my job as Operations Engineer on the Ubuntu One team, I’m constantly looking for ways to improve the reliability and operational efficiency of the service as a whole.

Very high on my list of things to fix I have an item to look into Vaurien, and make some tweaks to the service to cope better with outages in other parts of the system.

As you probably realized by now if you’re somehow involved into the design and maintenance of any kind of distributed systems, network partitions are a big deal, and we’ve had some of those affect our service in the past, some with very interesting effects.

Take our application servers for example. They’ve been through many generations of rewrites, and switches from one WSGI server to another in the past (not long before I joined the team), each of them with a particular issue. Either they didn’t scale, or crashed constantly, or had memory leaks. Or maybe none of those (it was before I joined the team, so I wouldn’t know for sure). By the time I joined, Paste, of all the things, was one of the WSGI servers in use in part of the service, and Twisted WSGI was used in another part.

(The actual setup of those services is very interesting on itself. It’s a mix of Twisted and Django (and many others have done this before, so it’s not very unique. But there are internal details which are quite interesting. More below.)

Having moved from another team that used Twisted heavily, I decided to call it out and settle on Twisted WSGI, which seemed just fine.

As for the stability and memory issues, we started ironing them out one by one. Turns out the majority of the problems had nothing to do with the WSGI server itself, but everything to do with not cleaning up resources correctly, be it temporary files, file descriptors, and cycles between producers and consumers.

And everything was perfect.

But then we got a few networking issues and hardware issues. Some of the servers were eventually moved to a different datacenter and things got even more interesting. I’ll go into the details of the specific problems that I’m hoping to approach with Vaurien on a different post, but suffice to say that talking to many external services in a threaded server doesn’t get pretty when there’s a network blip.

So speaking of threaded and Twisted, and coming to the subject of this post.

In front of a subset of our services we currently have 4 HAProxy instances in different servers. They are all set up to use httpchk every 2 seconds, which by default sends an OPTIONS request to ‘/’. If you’re still following, we have a Django app running there, and depending on how you have your Django app configured, it might just take that OPTIONS request and treat it just like a GET, effectively (in our case) rendering a response just as if a normal browser had requested it. Turns out that page is not particularly lean in our case.

So you take 4 servers, effectively doing a GET request to your homepage every 2s each one, times many processes serving that page across a couple hosts, and you have a full plate for someone looking for things to optimize.

To make it more fun, early on I added monitoring of the thread pool used by Twisted WSGI, sending metrics to Graphite. Whenever we had a network blip we saw the queue growing and growing without bound. This was actually a combination of a couple things, which I’m still working on fixing:

  1. HAProxy will keep triggering the httpchk after the service is taken out of rotation.
  2. Twisted WSGI will keep accepting requests and throwing them in the thread pool queue, even if the thread pool is busy and the queue is building up.
  3. We do a terrible job at timing out connections to external services currently so a minor blip can easily cause the thread pool queue to build up.

As a strategy to alleviate that problem I came up with the following solution:

  1. Implement a custom thread pool that triggers a callback when going from busy -> free and from free -> busy (where busy is defined as: there are more requests queued than idle threads).
  2. Changed the response to the HAProxy httpchk to simply check that busy/free state.
  3. Changed the handling of that HAProxy check to *not* go through the thread pool.

(There’s a few more details that I won’t get into in this post, but that’s the high-level summary.)

I have good confidence that this will fix (or at least alleviate) the main issue, which is the queue growing without bounds in the thread pool, and it will instead move the queueing to HAProxy. But after looking through the metrics today I saw an unintended consequence of the changes.

busy-threads

load-avg

In retrospect, it seems fairly obvious that this was to be one of the expected outcomes. I was simply surprised to see it since it was not the immediate goal of the proposed changes, but simply a side effect of them.

I hope you enjoyed this glimpse into what goes on at the heart of my job. I expect to write more about this soon, and maybe explore some of the details that I didn’t get into, since this post is already too long.

Read more
Sidnei

Due to some unplanned traveling I ended up near the Bay Area last week, more specifically Canonical was holding an internal Cloud Sprint in Oakland, CA, and Martin asked me to participate and push our agenda for the upcoming click packages upload and download services, which need to be live by October at least on its simplest form. But I’ll tell you more about that in a separate post.

What I want to share with you today is the joy of being able to connect with old friends and recollect memories, as I mentioned I was longing for in my last post. In those few days I was in California I managed to catch up with Limi and Philipp, said an en passant hi to Rob Miller at the Mozilla SF office, had dinner with Gustavo, walked around the city with Fernando, Alberto, and Geoff, ending up at an amazing Chinese restaurant pretty much by accident, paid a visit to Marlon, who took me on a guided tour of the Facebook HQ followed by lunch at The Cheesecake Factory which I couldn’t refuse. It was exausting, but really great catching up with everyone!

A recurring topic between all of us was the general issues that all of our companies (Mozilla, Canonical, Facebook) have with general public perception. Most interestingly perhaps is the similarity between Canonical and Facebook when it comes down to privacy matters, how there seems to be a disconnect between the internal and external messaging on those matters, and how much the public perception is biased by the media and the very loud minority of privacy tinfoil hat zealots. I wish I could do more to help with solving that. Perhaps pushing for more transparency, better communication at least from the technical side of things could be a way to improve that.

Tech talks aside, I was simply overwhelmed by how much my kids’ pictures and videos are popular amongst friends. Every single person that I talked to was quick to mention that as the very first thing. Oddly, that generally does not reflect in likes and comments on those Facebook posts, which is an interesting observation. Are people generally afraid of clicking that Like link or is it too much effort for them? I’m sure it would do for a great usability study.

I hope to explore a bit more on the outcome of the sprint on a later post. Suffice to say that I was really glad to be present and contribute some feedback to all the planning that’s going into the next cycle, and the opportunity to meet some old friends while at it was invaluable. Looking forward to be doing more of that in the coming months, at FISL and PythonBrasil.

As an article I’ve read yesterday mentioned, we tech heads seem to live on a bubble that mostly bounces between social networks and having post work hours drinks with colleagues, usually from the same company. I wish we could all be more social in the physical world, and talk more about things that are not so tech-related. About life, and family, and non-work things, and enjoy ourselves more.

And headed straight into the shining sun.

Read more
Sidnei

Over the past few months, friends and family have been very curious about how my new job is going, and it’s been hard to stop for a moment and go into detail about it. I’ve been simply nodding and saying “It’s fine”.

This is an attempt at summarizing all the activity that happened in the last five months, though it’s far from being a short summary. If I had to pick a two words to describe my first five months at Canonical, it would be “Pure Awesomeness“. For a more detailed view, grab a cup of joe or your favorite other beverage and keep on reading.

Today is a special day. Exactly five months ago, on January 5th, I joined Canonical to work on the Landscape project.

It has been quite a ride so far, with two sprints with the Landscape team, AllHands and UDS in Barcelona just a week ago, and lots of excitement about the future. Saying that I’m completely stunned by the work everyone at Canonical has been putting together and how much the teams have grown in the last few months doesn’t do enough justice to it.

At AllHands and UDS we got a short preview of the things that are coming out in the next cycle and beyond. An example of that is the newly formed Design and User Experience (DUX) team which will not only be focusing on Ubuntu itself, but on many other areas across Canonical and the whole Open Source community in general.

At UDS, the DUX team had a special ‘booth’ where any person from any project could walk to them and get advice about their personal or favorite application. One person which has favorably used such advice already was Seif Lofty, from Gnome Zeitgeist. I met Seif during breakfast at the first day of UDS and I cannot describe how excited he was about simply *being* at UDS. And he was certainly twice as happy when he left.

Another person I had the joy to meet was David Siegel, of Gnome Do fame. We teamed during AllHands on the infamous “Fun In The Woods” activity, which had some people walking like zombies on the day after.

Even more importantly, I was able to meet many many more other colleagues from different teams and wrap up some loose ends from tasks that I got started during the first couple months. I’m definitely impressed by the amount of plain brilliant people that are part of Canonical as of today.

In a sense, being at a Canonical event is very much like being at a Plone conference. Everyone seems to be very receptive about new ideas and very friendly and laid back. And as a bonus, I was able to exercise my (not so secret now) power of throwing some crazy ideas around and see how they influence people. And man, I’m already impressed by the outcomes, just a week after the fact.

To me, this has been the most rewarding thing so far, to be able make big contributions not in lines of code, but in ideas that can make a concrete difference in the hands of the right people. This is something that can only be possible at a company the size Canonical is at the moment, where it’s just big enough that you can grab a mind or two to push an agenda without affecting the rest of the team and still small enough that you can influence decisions.

As my colleague Jamu would best describe, “I’m PUMPED!”. :)

But that’s not all. Software-wise, I was able to make some big contributions too. The Landscape team just finished a 6 month development cycle that brought many cool features to life. I’m really happy with that, and specially with the speed that this team can get features from the black board into reality. It’s also a much different environment than what I was used to, with very well-defined and refined processes for ensuring the overall quality of anything that is produced. One process that I’m specially enjoying is the requirement for two positive reviews before landing a branch. I hope to talk more about that soon (that is, sooner than 5 months from now *wink*).

As for my role in the team, it is quite different than what I’m used to. I’ve been focusing a lot on the UI aspects of Landscape, on ways to make things more obvious and more streamlined. I’ve been also writing a ton of Javascript, and collaborating with other teams to define better policies for Javascript testing in general. And finally, we will now have a person from the DUX team dedicated to working with us, which will push work on the Landscape UI even further.

I also had the chance of interacting with the Launchpad team, which has a much more refined process due to the size of their team. Over at Launchpad, I started a branch back in mid-December, even before starting at Canonical, to allow Launchpad to use the Chameleon Template Engine.

That was another wild ride, and during the course of this project I was able to contribute tons of fixes upstream to Malthe Borch to make Chameleon even more compatible with plain old ZPT. In fact, it is so compatible at the moment that due to the magic of z3c.ptcompat Launchpad will be able to run *both* Chameleon and ZPT with the flip of an environment variable. Even more stunning, the changes required to code were minimal, basically changing imports to use z3c.ptcompat, and in templates we’ve had to fix some non-XHTML compliant ones and remove unused i18n tags. I am happy to announce that this branch will soon be merged (it was submitted to PQM, successfully accepted and is waiting to land the buildbot queue). The bad news is that not all tests pass at the moment with Chameleon enabled, but we will be dogfooding and fixing those tests as we go. It was too much pain already to maintain a nearly 6 month old branch outside the main tree. ;)

I am really interested in many of the things that the Launchpad team is doing, process-wise. The PQM seems like a very nice idea for a bigger team like theirs, though it would probably be useful to our smaller team in Landscape too, and to others in general. Hopefully I will get a chance to explore it more and talk about it during the upcoming FISL 10, in Porto Alegre.

Lastly, but not least important, I’m also working on getting nightly builds of the Bzr Installer for Windows rolling, and a more streamlined process for the official builds. Karl Fogel, of Producing OSS fame, and our Launchpad Ombudsman is making sure I keep my promises about that, which is yet another great incentive.

All in all, there’s of course a ton of things I forgot to talk about and which happened in the last 5 months, but this post is already getting too long so I will stop right here and save some of the meat for a future one. Stay tuned!

Read more