Now that we've open sourced the code for Ubuntu One filesync, I thoughts I'd highlight some of the interesting challenges we had while building and scaling the service to several million users.
The teams that built the service were roughly split into two: the foundations team, who was responsible for the lowest levels of the service (storage and retrieval of files, data model, client and server protocol for syncing) and the web team, focused on user-visible services (website to manage files, photos, music streaming, contacts and Android/iOS equivalent clients).
I joined the web team early on and stayed with it until we shut it down, so that's where a lot of my stories will be focused on.
Today I'm going to focus on the challenge we faced when launching the Photos and Music streaming services. Given that by the time we launched them we had a few years of experience serving files at scale, our challenge turned out to be in presenting and manipulating the metadata quickly to each user, and be able to show the data in appealing ways to users (showing music by artist, genre and searching, for example). Photos was a similar story, people tended to have many thousands of photos and songs and we needed to extract metadata, parse it, store it and then be able to present it back to users quickly in different ways. Easy, right? It is, until a certain scale
Our architecture for storing metadata at the time was about 8 PostgreSQL master databases where we sharded metadata across (essentially your metadata lived on a different DB server depending on your user id) plus at least one read-only slave per shard. These were really beefy servers with a truck load of CPUs, more than 128GB of RAM and very fast disks (when reading this, remember this was 2009-2013, hardware specs seem tiny as time goes by!). However, no matter how big these DB servers got, given how busy they were and how much metadata was stored (for years, we didn't delete any metadata, so for every change to every file we duplicated the metadata) after a certain time we couldn't get a simple listing of a user's photos or songs (essentially, some of their files filtered by mimetype) in a reasonable time-frame (less than 5 seconds). As it grew we added caches, indexes, optimized queries and code paths but we quickly hit a performance wall that left us no choice but a much feared major architectural change. I say much feared, because major architectural changes come with a lot of risk to running services that have low tolerance for outages or data loss, whenever you change something that's already running in a significant way you're basically throwing out most of your previous optimizations. On top of that as users we expect things to be fast, we take it for granted. A 5 person team spending 6 months to make things as you expect them isn't really something you can brag about in the middle of a race with many other companies to capture a growing market.
In the time since we had started the project, NoSQL had taken off and matured enough for it to be a viable alternative to SQL and seemed to fit many of our use cases much better (webscale!). After some research and prototyping, we decided to generate pre-computed views of each user's data in a NoSQL DB (Cassandra), and we decided to do that by extending our existing architecture instead of revamping it completely. Given our code was pretty well built into proper layers of responsibility we hooked up to the lowest layer of our code,-database transactions- an async process that would send messages to a queue whenever new data was written or modified. This meant essentially duplicating the metadata we stored for each user, but trading storage for computing is usually a good trade-off to make, both in cost and performance. So now we had a firehose queue of every change that went on in the system, and we could build a separate piece of infrastructure who's focus would only be to provide per-user metadata *fast* for any type of file so we could build interesting and flexible user interfaces for people to consume back their own content. The stated internal goals were: 1) Fast responses (under 1 second), 2) Less than 10 seconds between user action and UI update and 3) Complete isolation from existing infrastructure.
Here's a rough diagram of how the information flowed throw the system:
It's a little bit scary when looking at it like that, but in essence it was pretty simple: write each relevant change that happened in the system to a temporary table in PG in the same transaction that it's written to the permanent table. That way you get transactional guarantees that you won't loose any data on that layer for free and use PG's built in cache that keeps recently added records cheaply accessible.
Then we built a bunch of workers that looked through those rows, parsed them, sent them to a persistent queue in RabbitMQ and once it got confirmation it was queued it would delete it from the temporary PG table.
Following that we took advantage of Rabbit's queue exchange features to build different types of workers that processes the data differently depending on what it was (music was stored differently than photos, for example).
Once we completed all of this, accessing someone's photos was a quick and predictable read operation that would give us all their data back in an easy-to-parse format that would fit in memory. Eventually we moved all the metadata accessed from the website and REST APIs to these new pre-computed views and the result was a significant reduction in load on the main DB servers, while now getting predictable sub-second request times for all types of metadata in a horizontally scalable system (just add more workers and cassandra nodes).
All in all, it took about 6 months end-to-end, which included a prototype phase that used memcache as a key/value store.
You can see the code that wrote and read from the temporary PG table if you branch the code and look under: src/backends/txlog/
The worker code, as well as the web ui is still not available but will be in the future once we finish cleaning it up to make it available. I decided to write this up and publish it now because I believe the value is more in the architecture rather than the code itself