Canonical Voices

Posts tagged with 'cloud'

niemeyer

Note: This is a candidate version of the specification. This note will be removed once v1 is closed, and any changes will be described at the end. Please get in touch if you’re implementing it.

Contents


Introduction

This specification defines strepr, a stable representation that enables computing hashes and cryptographic signatures out of a defined set of composite values that is commonly found across a number of languages and applications.

Although the defined representation is a serialization format, it isn’t meant to be used as a traditional one. It may not be seen entirely in memory at once, or written to disk, or sent across the network. Its role is specifically in aiding the generation of hashes and signatures for values that are serialized via other means (JSON, BSON, YAML, HTTP headers or query parameters, configuration files, etc).

The format is designed with the following principles in mind:

Understandable — The representation must be easy to understand to increase the chances of it being implemented correctly.

Portable — The defined logic works properly when the data is being transferred across different platforms and implementations, independently from the choice of protocol and serialization implementation.

Unambiguous — As a natural requirement for producing stable hashes, there is a single way to process any supported value being held in the native form of the host language.

Meaning-oriented — The stable representation holds the meaning of the data being transferred, not its type. For example, the number 7 must be represented in the same way whether it’s being held in a float64 or in an uint16.


Supported values

The following values are supported:

  • nil: the nil/null/none singleton
  • bool: the true and false singletons
  • string: raw sequence of bytes
  • integers: positive, zero, and negative integer numbers
  • floats: IEEE754 binary floating point numbers
  • list: sequence of values
  • map: associative value→value pairs


Representation

nil = 'z'

The nil/null/none singleton is represented by the single byte 'z' (0x7a).

bool = 't' / 'f'

The true and false singletons are represented by the bytes 't' (0x74) and 'f' (0x66), respectively.

unsigned integer = 'p' <value>

Positive and zero integers are represented by the byte 'p' (0x70) followed by the variable-length encoding of the number.

For example, the number 131 is always represented as {0x70, 0x81, 0x03}, independently from the type that holds it in the host language.

negative integer = 'n' <absolute value>

Negative integers are represented by the byte 'n' (0x6e) followed by the variable-length encoding of the absolute value of the number.

For example, the number -131 is always represented as {0x6e, 0x81, 0x03}, independently from the type that holds it in the host language.

string = 's' <num bytes> <bytes>

Strings are represented by the byte 's' (0x73) followed by the variable-length encoding of the number of bytes in the string, followed by the specified number of raw bytes. If the string holds a list of Unicode code points, the raw bytes must contain their UTF-8 encoding.

For example, the string hi is represented as {0x73, 0x02, 'h', 'i'}

Due to the complexity involved in Unicode normalization, it is not required for the implementation of this specification. Consequently, Unicode strings that if normalized would be equal may have different stable representations.

binary float = 'd' <binary64>

32-bit or 64-bit IEEE754 binary floating point numbers that are not holding integers are represented by the byte 'd' (0x64) followed by the big-endian 64-bit IEEE754 binary floating point encoding of the number.

There are two exceptions to that rule:

1. If the floating point value is holding a NaN, it must necessarily be encoded by the following sequence of bytes: {0x64, 0x7f, 0xf8, 0x00 0x00, 0x00, 0x00, 0x00, 0x00}. This ensures all NaN values have a single representation.

2. If the floating point value is holding an integer number it must instead be encoded as an unsigned or negative integer, as appropriate. Floating point values that hold integer numbers are defined as those where floor(v) == v && abs(v) != ∞.

For example, the value 1.1 is represented as {0x64, 0x3f, 0xf1, 0x99, 0x99, 0x99, 0x99, 0x99, 0x9a}, but the value 1.0 is represented as {0x70, 0x01}, and -0.0 is represented as {0x70, 0x00}.

This distinction means all supported numbers have a single representation, independently from the data type used by the host language and serialization format.

list = 'l' <num items> [<item> ...]

Lists of values are represented by the byte 'l' (0x6c), followed by the variable-length encoding of the number of pairs in the list, followed by the stable representation of each item in the list in the original order.

For example, the value [131, -131] is represented as {0x6c, 0x70, 0x81, 0x03, 0x6e, 0x81, 0x03, 0x65}

map = 'm' <num pairs> [<item key> <item value>  ...]

Associative maps of values are represented by the byte 'm' (0x6d) followed by the variable-length encoding of the number of pairs in the map, followed by an ordered sequence of the stable representation of each key and value in the map. The pairs must be sorted so that the stable representation of the keys is in ascending lexicographical order. A map must not have multiple keys with the same representation.

For example, the map {"a": 4, 5: "b"} is always represented as {0x6d, 0x02, 0x70, 0x05, 0x73, 0x01, 'b', 0x73, 0x01, 'a', 0x70, 0x04}.


Variable-length encoding

Integers are variable-length encoded so that they can be represented in short space and with unbounded size. In an encoded number, the last byte holds the 7 least significant bits of the unsigned value, and zero as the eight bit. If there are remaining non-zero bits, the previous byte holds the next 7 bits, and the eight bit is set on to flag the continuation to the next byte. The process continues until there are non-zero bits remaining. The most significant bits end up in the first byte of the encoded value, which must necessarily not be 0x80.

For example, the number 128 is variable-length encoded as {0x81, 0x00}.


Reference implementation

A reference implementation is available, including a test suite which should be considered when implementing the specification.


Changes

draft1 → draft2

  • Enforce the use of UTF-8 for Unicode strings and explain why normalization is being left out.
  • Enforce a single NaN representation for floats.
  • Explain that map key uniqueness refers to the representation.
  • Don’t claim the specification is easy to implement; floats require attention.
  • Mention reference implementation.

Read more
niemeyer

The very first time the concepts behind the juju project were presented, by then still under the prototype name of Ubuntu Pipes, was about four years ago, in July of 2009. It was a short meeting with Mark Shuttleworth, Simon Wardley, and myself, when Canonical still had an office on a tall building by the Thames. That was just the seed of a long road of meetings and presentations that eventually led to the codification of these ideas into what today is a major component of the Ubuntu strategy on servers.

Despite having covered the core concepts many times in those meetings and presentations, it recently occurred to me that they were never properly written down in any reasonable form. This is an omission that I’ll attempt to fix with this post while still holding the proper context in mind and while things haven’t changed too much.

It’s worth noting that I’ve stepped aside as the project technical lead in January, which makes more likely for some of these ideas to take a turn, but they are still of historical value, and true for the time being.

Contents

This post is long enough to deserve an index, but these sections do build up concepts incrementally, so for a full understanding sequential reading is best:


Classical deployments

In a simplistic sense, deploying an application means configuring and running a set of processes in one or more machines to compose an integrated system. This procedure includes not only configuring the processes for particular needs, but also appropriately interconnecting the processes that compose the system.

The following figure depicts a simple example of such a scenario, with two frontend machines that had the Wordpress software configured on them to serve the same content out of a single backend machine running the MySQL database.

Deploying even that simple environment already requires the administrator to deal with a variety of tasks, such as setting up physical or virtual machines, provisioning the operating system, installing the applications and the necessary dependencies, configuring web servers, configuring the database, configuring the communication across the processes including addresses and credentials, firewall rules, and so on. Then, once the system is up, the deployed system must be managed throughout its whole lifecycle, with upgrades, configuration changes, new services integrated, and more.

The lack of a good mechanism to turn all of these tasks into high-level operations that are convenient, repeatable, and extensible, is what motivated the development of juju. The next sections provide an overview of how these problems are solved.


Preparing a blank slate

Before diving into the way in which juju environments are organized, a few words must be said about what a juju environment is in the first place.

All resources managed by juju are said to be within a juju environment, and such an environment may be prepared by juju itself as long as the administrator has access to one of the supported infrastructure providers (AWS, OpenStack, MAAS, etc).

In practice, creating an environment is done by running juju’s bootstrap command:

$ juju bootstrap

This will start a machine in the configured infrastructure provider and prepare the machine for running the juju state server to control the whole environment. Once the machine and the state server are up, they’ll wait for future instructions that are provided via follow up commands or alternative user interfaces.


Service topologies

The high-level perspective that juju takes about an environment and its lifecycle is similar to the perspective that a person has about them. For instance, although the classical deployment example provided above is simple, the mental model that describes it is even simpler, and consists of just a couple of communicating services:

That’s pretty much the model that an administrator using juju has to input into the system for that deployment to be realized. This may be achieved with the following commands:

$ juju deploy cs:precise/wordpress
$ juju deploy cs:precise/mysql
$ juju add-relation wordpress mysql

These commands will communicate with the previously bootstrapped environment, and will input into the system the desired model. The commands themselves don’t actually change the current state of the deployed software, but rather inform the juju infrastructure of the state that the environment should be in. After the commands take place, the juju state server will act to transform the current state of the deployment into the desired one.

In the example described, for instance, juju starts by deploying two new machines that are able to run the service units responsible for Wordpress and MySQL, and configures the machines to run agents that manipulate the system as needed to realize the requested model. An intermediate stage of that process might conceptually be represented as:

topology-step-1

The service units are then provided with the information necessary to configure and start the real software that is responsible for the requested workload (Wordpress and MySQL themselves, in this example), and are also provided with a mechanism that enables service units that were related together to easily exchange data such as addresses, credentials, and so on.

At this point, the service units are able to realize the requested model:

topology-step-2

This is close to the original scenario described, except that there’s a single frontend machine running Wordpress. The next section details how to add that second frontend machine.


Scaling services horizontally

The next step to match the original scenario described is to add a second service unit that can run Wordpress, and that can be achieved by the single command:

$ juju add-unit wordpress

No further commands or information are necessary, because the juju state server understands what the model of the deployment is. That model includes both the configuration of the involved services and the fact that units of the wordpress service should talk to units of the mysql service.

This final step makes the deployed system look equivalent to the original scenario depicted:

topology-step-3

Although that is equivalent to the classic deployment first described, as hinted by these examples an environment managed by juju isn’t static. Services may be added, removed, reconfigured, upgraded, expanded, contracted, and related together, and these actions may take place at any time during the lifetime of an environment.

The way that the service reacts to such changes isn’t enforced by the juju infrastructure. Instead, juju delegates service-specific decisions to the charm that implements the service behavior, as described in the following section.


Charms

A juju-managed environment wouldn't be nearly as interesting if all it could do was constrained by preconceived ideas that the juju developers had about what services should be supported and how they should interact among themselves and with the world.

Instead, the activities within a service deployed by juju are all orchestrated by a juju charm, which is generally named after the main software it exposes. A charm is defined by its metadata, one or more executable hooks that are called after certain events take place, and optionally some custom content.

The charm metadata contains basic declarative information, such as the name and description of the charm, relationships the charm may participate in, and configuration options that the charm is able to handle.

The charm hooks are executable files with well-defined names that may be written in any language. These hooks are run non-concurrently to inform the charm that something happened, and they give a chance for the charm to react to such events in arbitrary ways. There are hooks to inform that the service is supposed to be first installed, or started, or configured, or for when a relation was joined, departed, and so on.

This means that in the previous example the service units depicted are in fact reporting relevant events to the hooks that live within the wordpress charm, and those hooks are the ones responsible for bringing the Wordpress software and any other dependencies up.

wordpress-service-unit

The interface offered by juju to the charm implementation is the same, independently from which infrastructure provider is being used. As long as the charm author takes some care, one can create entire service stacks that can be moved around among different infrastructure providers.


Relations

In the examples above, the concept of service relationships was introduced naturally, because it’s indeed a common and critical aspect of any system that depends on more than a single process. Interestingly, despite it being such a foundational idea, most management systems in fact pay little attention to how the interconnections are modeled.

With juju, it’s fair to say that service relations were part of the system since inception, and have driven the whole mindset around it.

Relations in juju have three main properties: an interface, a kind, and a name.

The relation interface is simply a unique name that represents the protocol that is conventionally followed by the service units to exchange information via their respective hooks. As long as the name is the same, the charms are assumed to have been written in a compatible way, and thus the relation is allowed to be established via the user interface. Relations with different interfaces cannot be established.

The relation kind informs whether a service unit that deploys the given charm will act as a provider, a requirer, or a peer in the relation. Providers and requirers are complementary, in the sense that a service that provides an interface can only have that specific relation established with a service that requires the same interface, and vice-versa. Peer relations are automatically established internally across the units of the service that declares the relation, and enable easily clustering together these units to setup masters and slaves, rings, or any other structural organization that the underlying software supports.

The relation name uniquely identifies the given relation within the charm, and allows a single charm (and service and service units that use it) to have multiple relations with the same interface but different purposes. That identifier is then used in hook names relative to the given relation, user interfaces, and so on.

For example, the two communicating services described in examples might hold relations defined as:

wordpress-mysql-relation-details

When that service model is realized, juju will eventually inform all service units of the wordpress service that a relation was established with the respective service units of the mysql service. That event is communicated via hooks being called on both units, in a way resembling the following representation:

wordpress-mysql-relation-workflow

As depicted above, such an exchange might take the following form:

  1. The administrator establishes a relation between the wordpress service and the mysql service, which causes the service units of these services (wordpress/1 and mysql/0 in the example) to relate.
  2. Both service units concurrently call the relation-joined hook for the respective relation. Note that the hook is named after the local relation name for each unit. Given the conventions established for the mysql interface, the requirer side of the relation does nothing, and the provider informs the credentials and database name that should be used.
  3. The requirer side of the relation is informed that relation settings have changed via the relation-changed hook. This hook implementation may pick up the provided settings and configure the software to talk to the remote side.
  4. The Wordpress software itself is run, and establishes the required TCP connection to the configured database.

In that workflow, neither side knows for sure what service is being related to. It would be feasible (and probably welcome) to have the mysql service replaced by a mariadb service that provided a compatible mysql interface, and the wordpress charm wouldn’t have to be changed to communicate with it.

Also, although this example and many real world scenarios will have relations reflecting TCP connections, this may not always be the case. It’s reasonable to have relations conveying any kind of metadata across the related services.


Configuration

Service configuration follows the same model of metadata plus executable hooks that was described above for relations. A charm can declare what configuration settings it expects in its metadata, and how to react to setting changes in an executable hook named config-changed. Then, once a valid setting is changed for a service, all of the respective service units will have that hook called to reflect the new configuration.

Changing a service setting via the command line may be as simple as:

$ juju set wordpress title="My Blog"

This will communicate with the juju state server, record the new configuration, and consequently incite the service units to realize the new configuration as described. For clarity, this process may be represented as:

config-changed


Taking from here

This conceptual overview hopefully provides some insight into the original thinking that went into designing the juju project. For more in-depth information on any of the topics covered here, the following resources are good starting points:

Read more
niemeyer

Today ubuntufinder.com was updated with the latest image data for Ubuntu 13.04 and all the previous releases as well. Rather than simply hardcoding the values again, though, the JavaScript code was changed so that it imports the new JSON-based feeds that Canonical has been publishing for the official Ubuntu images that are available in EC2, thanks to recent work by Scott Moser. This means the site is always up-to-date, with no manual actions.

Although the new feeds made that quite straightforward, there was a small detail to sort out: the Ubuntu Finder is visually dynamic, but it is actually a fully static web site served from S3, and the JSON feeds are served from the Canonical servers. This means the same-origin policy won’t allow that kind of cross-domain import to be easily done without further action.

The typical workaround for this kind of situation is to put a tiny proxy within the site server to load the JSON and dispatch to the browser from the same origin. Unfortunately, this isn’t an option in this case because there’s no custom server backing the data. There’s a similar option that actually works, though: deploying that tiny proxy server in some other corner and forward the JSON payload as JSONP or with cross-origin resource sharing enabled, so that browsers can bypass the same-origin restriction, and that’s what was done.

Rather than once again doing a special tiny server for that one service, though, this time around a slightly more general tool has emerged, and as an experiment it has been put live so anyone can use it. The server logic is pretty simple, and the idea is even simpler. Using the services from jsontest.com as an example, the following URL will serve a JSON document that can only be loaded from a page that is in a location allowed by the same-origin policy:

If one wanted to load that page from a different location, it might be transformed into a JSONP document by loading it from:

Alternatively, modern browsers that support the cross-origin resource sharing can simply load pure JSON by omitting the jsonpeercb parameter. The jsonpeer server will emit the proper header to allow the browser to load it:

This service is backed by a tiny Go server that lives in App Engine so it’s fast, secure (hopefully), and maintenance-less.

Some further details about the service:

  • Results are JSON with cross-origin resource sharing by default
  • With a query parameter jsonpeercb=<callback name>, results are JSONP
  • The callback name must consist of characters in the set [_.a-zA-Z0-9]
  • Query parameters provided to jsonpeer are used when doing the upstream request
  • HTTP headers are discarded in both directions
  • Results are cached for 5 minutes on memcache before being re-fetched
  • Upstream results must be valid JSON
  • Upstream results must have Content-Type application/json or text/plain
  • Upstream results must be under 500kb
  • Both http and https work; just tweak the URL and the path accordingly

Have fun if you need it, and please get in touch before abusing it.

UPDATE: The service and blog post were tweaked so that it defaults to returning plain JSON with CORS enabled, thanks to a suggestion by James Henstridge.

Read more
niemeyer

A small and fun experiment is out:

Read more
Gustavo Niemeyer

About 1 year after development started in Ensemble, today the stars finally aligned just the right way (review queue mostly empty, no other pressing needs, etc) for me to start writing the specification about the repository system we’ve been jointly planning for a long time. This is the system that the Ensemble client will communicate with for discovering which formulas are available, for publishing new formulas, for obtaining formula files for deployment, and so on.

We of course would have liked for this part of the project to have been specified and written a while ago, but unfortunately that wasn’t possible for several reasons. That said, there are also good sides of having an important piece flying around in minds and conversations for such a long time: sitting down to specify the system and describe the inner-working details has been a breeze. Even details such as the namespacing of formulas, which hasn’t been entirely clear in my mind, was just streamed into the document as the ideas we’ve been evolving finally got together in a written form.

One curious detail: this is the first long term project at Canonical that will be developed in Go, rather than Python or C/C++, which are the most used languages for projects within Canonical. Not only that, but we’ll also be using MongoDB for a change, rather than the traditional PostgreSQL, and will also use (you guessed) the mgo driver which I’ve been pushing entirely as a personal project for about 8 months now.

Naturally, with so many moving parts that are new to the company culture, this is still being seen as a closely watched experiment. Still, this makes me highly excited, because when I started developing mgo, the MongoDB driver for Go, my hopes that the Go, MongoDB, and mgo trio would eventually be used at Canonical were very low, precisely because they were all alien to the culture. We only got here after quite a lot of internal debate, experiments, and trust too.

All of that means these are happy times. Important feature in Ensemble being specified and written, very exciting tools, home grown software being useful..

Awesomeness.

Read more
niemeyer

One more Go library oriented towards building distributed systems hot off the presses: govclock. This one offers full vector clock support for the Go language. Vector clocks allow recording and analyzing the inherent partial ordering of events in a distributed system in a comfortable way.

The following features are offered by govclock, in addition to basic event tracking:

  • Compact serialization and deserialization
  • Flexible truncation (min/max entries, min/max update time)
  • Unit-independent update times
  • Traditional merging
  • Fast and memory efficient

If you’d like to know more about vector clocks, the Basho guys did a great job in the following pair of blog posts:

The following sample program demonstrates some sequential and concurrent events, dumping and loading, as well as merging of clocks. For more details, please look at the web page. The project is available under a BSD license.


package main

import (
    "launchpad.net/govclock"
    "fmt"
)

func main() {
    vc1 := govclock.New()
    vc1.Update([]byte("A"), 1)

    vc2 := vc1.Copy()
    vc2.Update([]byte("B"), 0)

    fmt.Println(vc2.Compare(vc1, govclock.Ancestor))   // => true
    fmt.Println(vc1.Compare(vc2, govclock.Descendant)) // => true

    vc1.Update([]byte("C"), 5)


    fmt.Println(vc1.Compare(vc2, govclock.Descendant)) // => false
    fmt.Println(vc1.Compare(vc2, govclock.Concurrent)) // => true

    vc2.Merge(vc1)

    fmt.Println(vc1.Compare(vc2, govclock.Descendant)) // => true

    data := vc2.Bytes()
    fmt.Printf("%#vn", string(data))
    // => "x01x01x01x01Ax01x01x01Bx01x00x01C"

    vc3, err := govclock.FromBytes(data)
    if err != nil { panic(err.String()) }

    fmt.Println(vc3.Compare(vc2, govclock.Equal))      // => true
}

Read more
Gustavo Niemeyer

A bit of history

I don’t know exactly why, but I’ve always enjoyed IRC bots. Perhaps it’s the fact that it emulates a person in an easy-to-program way, or maybe it’s about having a flexible and shared “command line” tool, or maybe it’s just the fact that it helps people perceive things in an asynchronous way without much effort. Probably a bit of everything, actually.

My bot programming started with pybot many years ago, when I was still working at Conectiva. Besides having many interesting features, this bot eventually got in an abandonware state, since Canonical already had pretty much equivalent features available when I joined, and I had other interests which got in the way. The code was a bit messy as well.. it was a time when I wasn’t very used to testing software properly (a friend has a great excuse for that kind of messy software: “I was young, and needed the money!”).

Then, a couple of years ago, while working in the Landscape project, there was an opportunity of getting some information more visible to the team. Coincidently, it was also a time when I wanted to get some practice with the concepts of Erlang, so I decided to write a bot from scratch with some nice support for plugins, just to get a feeling of how the promised stability of Erlang actually took place for real. This bot is called mup (Mup Pet, more formally), and its code is available publicly through Launchpad.

This was a nice experiment indeed, and I did learn quite a bit about the ins and outs of Erlang with it. Somewhat unexpected, though, was the fact that the bot grew up a few extra features which multiple teams in Canonical started to appreciate. This was of course very nice, but it also made it more obvious that the egocentric reason for having a bot written in Erlang would now hurt, because most of Canonical’s own coding is done in Python, and that’s what internal tools should generally be written in for everyone to contribute and help maintaining the code.

That’s where the desire of migrating mup into a Python-based brain again came from, and having a new feature to write was the perfect motivator for this.

LDAP and two-way SMSing over IRC

Canonical is a very distributed company. Employees are distributed over dozens of countries, literally. Not only that, but most people also work from their homes, rather than in an office. Many different countries also means many different timezones, and working from home with people from different timezones means flexible timing. All of that means communication gets… well.. interesting.

How do we reach someone that should be in an online meeting and is not? Or someone that is traveling to get to a sprint? Or how can someone that has no network connectivity reach an IRC channel to talk to the team? There are probably several answers to this question, but one of them is of course SMS. It’s not exactly cheap if we consider the cost of the data being transfered, but pretty much everyone has a mobile phone which can do SMS, and the model is not that far away from IRC, which is the main communication system used by the company.

So, the itch was itching. Let’s scratch it!

Getting the mobile phone of employees was already a solved problem for mup, because it had a plugin which could interact with the LDAP directory, allowing people to do something like this:

<joe> mup: poke gustavo
<mup> joe: niemeyer is Gustavo Niemeyer <…@canonical.com> <time:…> <mobile:…>

This just had to be migrated from Erlang into a Python-based brain for the reasons stated above. This time, though, there was no reason to write something from scratch. I could even have used pybot itself, but there was also supybot, an IRC bot which started around the same time I wrote the first version of pybot, and unlike the latter, supybot’s author was much more diligent in evolving it. There is quite a comprehensive list of plugins for supybot nowadays, and it includes means for testing plugins and so on. The choice of using it was straighforward, and getting “poke” support ported into a plugin wasn’t hard at all.

So, on to SMSing. Canonical already had a contract with an SMS gateway company which we established to test-drive some ideas on Landscape. With the mobile phone numbers coming out of the LDAP directory in hands and an SMS contract established, all that was needed was a plugin for the bot to talk to the SMS gateway. That “conversation” with the SMS gateway allows not only sending messages, but also receiving SMS messages which were sent to a specific number.

In practice, this means that people which are connected to IRC can very easily deliver an SMS to someone using their nicks. Something like this:

<joe> @sms niemeyer Where are you? We’re waiting!

And this would show up in the mobile screen as:

joe> Where are you? We’re waiting!

In addition to this, people which have no connectivity can also contact individuals and channels on IRC, with mup working as a middle man. The message would show up on IRC in a similar way to:

<mup> [SMS] <niemeyer> Sorry, the flight was delayed. Will be there in 5.

The communication from the bot to the gateway happens via plain HTTPS. The communication back is a bit more complex, though. There is a small proxy service deployed in Google App Engine to receive messages from the SMS gateway. This was done to avoid losing messages when the bot itself is taken down for maintenance. The SMS gateway doesn’t handle this case very well, so it’s better to have something which will be up most of the time buffering messages.

A picture is worth 210 words, so here is a simple diagram explaining how things got linked together:

This is now up for experimentation, and so far it’s working nicely. I’m hoping that in the next few weeks we’ll manage to port the rest of mup into the supybot-based brain.

Read more
Gustavo Niemeyer

Recovering a bootable EBS image

Scott Moser has just announced this week that the new Ubuntu images which boot out of an EBS-based root filesystem in EC2, and thus will persist across reboots, are available for testing.

As usual with something that just left the oven and is explicitly labeled for testing purposes, there was a minor bug in the first iteration of images which was even mentioned in the announcement itself. The bug, if not worked around as specified in the announcement, will prevent the image from rebooting.

Having an bootable EBS image which can’t reboot is a quite interesting (and ironic) problem. You have an image which persists, but suddenly you have no way to see what is inside the image anymore because you can’t boot it. Naturally, even if the said bug didn’t exist in the first place, it’s fairly easy to get into such a situation accidentally if you’re fiddling with the image configuration.

So, in this post we’ll see how to recover from a situation where a bootable EBS image can’t boot.

Getting started

To start this up, we’ll boot one of the EBS images which Scott mentioned in his announcement: ami-8bec03e2. As we see in the output of ec2-describe-images, this is an EBS-based image for i386:

% ec2-describe-images ami-8bec03e2
IMAGE ami-8bec03e2 099720109477/ebs/ubuntu-images-testing/ubuntu-lucid-daily-i386-server-20100305 099720109477 available public i386 machine aki-3fdb3756 ebs
BLOCKDEVICEMAPPING /dev/sda1 snap-f1efd098 15

Let’s run this image. Remember to replace the value passed in the -k command line option with your own key pair name.

% ec2-run-instances -k gsg-keypair ami-8bec03e2
RESERVATION r-9e4615f6 626886203892 default
INSTANCE i-e3e33a88 ami-8bec03e2 pending gsg-keypair 0 m1.small 2010-03-09T20:04:12+0000 us-east-1c aki-3fdb3756 monitoring-disabled ebs

There we go. We got an instance allocated in the availability zone us-east-1c. It’s important to keep track of this information, since EBS volumes are zone-specific.

As part of the above command, we must have been allocated an EBS volume automatically, and it should be attached to the instance we just started. We can investigate it with the ec2-describe-volumes command:

% ec2-describe-volumes
VOLUME vol-edca1684 15 snap-f1efd098 us-east-1c in-use 2010-03-09T20:04:20+0000
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attached 2010-03-09T20:04:24+0000

Now, we’ll get into the running instance and do some arbitrary modifications, just as a way to demonstrate that the data we don’t want to lose actually survives the recovering operation. Note that the domain name is obtained with the ec2-describe-instances command.

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-184-73-51-147.compute-1.amazonaws.com
(…)

ubuntu@domU-12-31-39-0E-A0-03:~$ echo “Important data” > important-data
ubuntu@domU-12-31-39-0E-A0-03:~$ ls -l important-data
-rw-r–r– 1 ubuntu ubuntu 15 Mar 9 20:15 important-data

ubuntu@domU-12-31-39-0E-A0-03:~$ sudo reboot
Broadcast message from ubuntu@domU-12-31-39-0E-A0-03
(/dev/pts/0) at 20:18 …
The system is going down for reboot NOW!

Note that we didn’t actually fix the problem reported by Scott, so our machine won’t really reboot. If we wait a while, we can even see that the problem is exactly what was reported in the announcement (note it really takes a bit for the output to be synced up):

% ec2-get-console-output i-e3e33a88 | tail -4
mount: special device ephemeral0 does not exist
mountall: mount /mnt [294] terminated with status 32
mountall: Filesystem could not be mounted: /mnt

Alright, now what? Machine is dead.. and can’t reboot. How do we get to our important data?

Fixing the problem

The first thing we do is to stop the instance. Do not terminate it, or you’ll lose the EBS volume! After stopping it, we’ll detach the EBS volume that was being used as the root filesystem, so that we can attach somewhere else.

% ec2-stop-instances i-e3e33a88
INSTANCE i-e3e33a88 running stopping

% ec2-detach-volume vol-edca1684
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 detaching 2010-03-09T20:04:22+0000

Now, we need to attach this volume in an image which actually boots, so that we can fix it. For this experiment, we’ll pick one of the daily Lucid images, but we could use any other working image really. Just remind that the image must be running in the same availability zone as our previous instance, since the EBS volume won’t be accessible otherwise.

% ec2-run-instances -k gsg-keypair -z us-east-1c ami-b5f619dc
RESERVATION r-967427fe 626886203892 default
INSTANCE i-fd08d196 ami-b5f619dc pending gsg-keypair 0 m1.small 2010-03-09T21:10:11+0000 us-east-1c aki-3fdb3756 monitoring-disabled instance-store

% ec2-attach-volume vol-edca1684 -i i-fd08d196 -d /dev/sdh1
ATTACHMENT vol-edca1684 i-fd08d196 /dev/sdh1 attaching 2010-03-09T21:10:51+0000

With the instance running and the EBS root device attached with an alternative device name, we can then login to fix the original problem which prevented the image from booting correctly. In our case, we’ll simply do what Scott suggested in the announcement.

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-204-236-194-196.compute-1.amazonaws.com
(…)
$ mkdir ebs-root
$ sudo mount /dev/sdh1 ebs-root
$ sudo sed -i ‘s/^ephemeral0/#ephemeral0/’ ebs-root/etc/fstab
$ sudo umount ebs-root
$ logout
Connection to ec2-204-236-194-196.compute-1.amazonaws.com closed.

Done! Our EBS volume is now correct, and it should boot alright. We’ll detach the volume from the temporary instance we created, and will reattach it back to the old bootable EBS instance which is stopped. Note that we won’t yet terminate the temporary instance, because we may need it in case something else is still wrong, and we are already paying to use it for the hour anyway. We just have to remind ourselves to terminate it once we’re fully done.

% ec2-detach-volume vol-edca1684
ATTACHMENT vol-edca1684 i-fd08d196 /dev/sdh1 detaching 2010-03-09T21:10:51+0000

% ec2-attach-volume vol-edca1684 -i i-e3e33a88 -d /dev/sda1
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attaching 2010-03-09T21:24:55+0000

% ec2-describe-volumes vol-edca1684
VOLUME vol-edca1684 15 snap-f1efd098 us-east-1c in-use 2010-03-09T20:04:20+0000
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attached 2010-03-09T21:24:55+0000

Okay! It should all be good now. It’s time to restart our instance, and see if it is working. Note that since you stopped and started the instance, the public domain name most probably has changed, and thus we need to find it out again with ec2-describe-instances once the instance is running.

% ec2-start-instances i-e3e33a88
INSTANCE i-e3e33a88 stopped pending

% ec2-describe-instances i-e3e33a88
RESERVATION r-9e4615f6 626886203892 default
INSTANCE i-e3e33a88 ami-8bec03e2 ec2-184-73-72-214.compute-1.amazonaws.com domU-12-31-39-03-B8-21.compute-1.internal running gsg-keypair 0 m1.small 2010-03-09T21:28:43+0000 us-east-1c aki-3fdb3756 monitoring-disabled 184.73.72.214 10.249.187.207 ebs
BLOCKDEVICE /dev/sda1 vol-edca1684 2010-03-09T21:24:55.000Z

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-184-73-72-214.compute-1.amazonaws.com
(…)
$ cat important-data
Important data

$ logout
Connection to ec2-184-73-72-214.compute-1.amazonaws.com closed.

It worked, and our important data is still there!

Don’t forget to kill the temporary instance you’ve used to fix it after you’re comfortable with the result:

% ec2-terminate-instances i-fd08d196
INSTANCE i-fd08d196 running shutting-down

Conclusion

Concluding, in this post we have seen how to fix a bootable EBS machine which can’t actually boot. The technique consists of detaching the volume from the stopped instance, attaching it to a temporary instance, fixing the image, and then reattaching it back to the original image. This back and forth of EBS volumes is quite useful in many circumstances, so keep it in your tool belt.

Read more
Gustavo Niemeyer

Some interesting changes have been happening in my professional life, so I wanted to share it here to update friends and also for me to keep track of things over time (at some point I will be older and will certainly laugh at what I called “interesting changes” in the ol’days). Given the goal, I apologize but this may come across as more egocentric than usual, so please feel free to jump over to your next blog post at any time.

It’s been little more than four years since I left Conectiva / Mandriva and joined Canonical, in August of 2005. Shortly after I joined, I had the luck of spending a few months working on the different projects which the company was pushing at the time, including Launchpad, then Bazaar, then a little bit on some projects which didn’t end up seeing much light. It was a great experience by itself, since all of these projects were abundant in talent. Following that, in the beginning of 2006, counting on the trust of people which knew more than I did, I was requested/allowed to lead the development of a brand new project the company wanted to attempt. After a few months of research I had the chance to sit next to Chris Armstrong and Jamu Kakar to bootstrap the development of what is now known as the Landscape distributed systems management project.

Fast forward three and a half years, in mid 2009, and Landscape became a massive project with hundreds of thousands of very well tested lines, sprawling not only a client branch, but also external child projects such as the Storm Object Relational Mapper, in use also by Launchpad and Ubuntu One. In the commercial side of things it looks like Landscape’s life is just starting, with its hosted and standalone versions getting more and more attention from enterprise customers. And the three guys which started the project didn’t do it alone, for sure. The toy project of early 2006 has grown to become a well structured team, with added talent spreading areas such as development, business and QA.

While I wasn’t watching, though, something happened. Facing that great action, my attention was slowly being spread thinly among management, architecture, development, testing, code reviews, meetings, and other tasks, sometimes in areas not entirely related, but very interesting of course. The net result of increased attention sprawl isn’t actually good, though. If it persists, even when the several small tasks may be individually significant, the achievement just doesn’t feel significant given the invested effort as a whole. At least not for someone that truly enjoys being a software architect, and loves to feel that the effort invested in the growth of a significant working software is really helping people out in the same magnitude of that investment. In simpler words, it felt like my position within the team just wasn’t helping the team out the same way it did before, and thus it was time for a change.

Last July an external factor helped to catapult that change. Eucalyptus needed a feature to be released with Ubuntu 9.10, due in October, to greatly simplify the installation of some standard machine images.. an Image Store. It felt like a very tight schedule, even more considering that I hadn’t been doing Java for a while, and Eucalyptus uses some sexy (and useful) new technology called the Google Web Toolkit, something I had to get acquainted with. Two months looked like a tight schedule, and a risky bet overall, but it also felt like a great opportunity to strongly refocus on a task that needed someone’s attention urgently. Again I was blessed with trust I’m thankful for, and by now I’m relieved to look back and perceive that it went alright, certainly thanks to the help of other people like Sidnei da Silva and Mathias Gug. Meanwhile, on the Landscape side, my responsibilities were distributed within the team so that I could be fully engaged on the problem.

Moving this forward a little bit we reach the current date. Right now the Landscape project has a new organizational structure, and it actually feels like it’s moving along quite well. Besides the internal changes, a major organizational change also took place around Landscape over that period, and the planned restructuring led me to my current role. In practice, I’m now engaging into the research of a new concept which I’m hoping to publish openly quite soon, if everything goes well. It’s challenging, it’s exciting, and most importantly, allows me to focus strongly on something which has a great potential (I will stop teasing you now). In addition to this, I’ll definitely be spending some of that time on the progress of Landscape and the Image Store, but mostly from an architectural point of view, since both of these projects will have bright hands taking care of them more closely.

Sit by the fireside if you’re interested in the upcoming chapters of that story. ;-)

Read more
Gustavo Niemeyer

Google won’t kill standalone GPS

It was already dead. In some senses, anyway.

Google announced a couple of days ago that they’re advancing into the business of GPS guided navigation, rather than staying with their widely popular offering of mapping and positioning only. This announcement affected the rest of the industry immediately, and some of the industry leaders in the area have quickly taken a hit on their share value.

As usual, Slashdot caught up on the news and asked the question: Will Google and Android kill standalone GPS?

Let me point out that the way the facts were covered by Slashdot was quite misguided. Google may be giving a hand to change the industry dynamics a bit faster, but both Garmin and TomTom, the companies which reportedly had an impact in their share value, have phone-based offerings of their own, so it’s not like Google suddenly had an idea for creating a phone-based navigation software which will replace every other offering. The world is slowly converging towards a multi-purpose device for quite a while, and these multi-purpose devices are putting GPSes in the hands of people that in many cases never considered buying a GPS.

The real reason why these companies are taking a hit in their shares now is because Google announced it will offer for free something that these companies charge quality money for at the moment, being it in a standalone GPS or not.

Read more
Gustavo Niemeyer

More than 40 years ago, a guy named Douglas Parkhill described the concept of utility computing. He described it as containing features such as:

  • Essentially simultaneous use of the system by many remote users.
  • Concurrent running of different multiple programs.
  • Availability of at least the same range of facilities and capabilities at the remote stations as the user would expect if he where the sole operator of a private computer.
  • A system of charging based upon a flat service charge and a variable charge based on usage.
  • Capacity for indefinite growth, so that as the customer load increases, the system can expanded without limit by various means.

Fast forward 40 years, and we now name pretty much this same concept as Cloud Computing, and everyone is very excited about the possibilities that exist within this new world. Different companies are pushing this idea in different ways. One of the pioneers in that area is of course Amazon, which managed to create a quite good public cloud offering through their Amazon Web Services product.

This kind of publicly consumable infrastructure is very interesting, because it allows people to do exactly what Douglas Parkhill described 40 years ago, so individuals and organizations can rent computing resources with minimum initial investment, and pay for as much as they need, no more no less.

This is all good, but one of the details is that not every organization can afford to send data or computations to a public cloud like Amazon’s AWS. There are many potential reasons for this, from legal regulations to volume cost. Out of these issues the term Private Cloud was coined. It basically represents exactly the same ideas that Douglas Parkhill described, but rather than using third party infrastructure, some organizations opt to use the same kind of technology, such as the Eucalyptus project deployed in a private infrastructure, so that the teams within the organization can still benefit from the mentioned features.

So we have the Public Cloud and the Private Cloud. Now, what would a Virtual Private Cloud be?

Well, it turns out that this is just a marketing term, purposefully coined to blur the line between a Private and a Public cloud .

The term was used in the announcement Amazon has made yesterday:

Amazon VPC enables enterprises to connect their existing infrastructure to a set of isolated AWS compute resources via a Virtual Private Network (VPN) connection, (…)

So, what is interesting about this is that this is actually not a Private Cloud, because the resources on the other side of the VPN are actually public infrastructure, and as such it doesn’t solve any of the problems which private clouds were created for solving in the first place.

Not only that, but it creates the false impression that organizations would have their own isolated resources. What isolated resources? A physical computer? Storage? Network? Of course, isolating these is not economically viable if you are charging 10 cents an hour per computer instance:

Each month, you pay for VPN Connection-hours and the amount of data transferred via the VPN connections. VPCs, subnets, VPN gateways, customer gateways, and data transferred between subnets within the same VPC are free. Charges for other AWS services, including Amazon EC2, are billed separately at published standard rates.

That doesn’t quite fit together, does it?

To complete the plot, Werner Vogels runs to his blog and screams out loud “Private Cloud is not the Cloud”, while announcing the Virtual Private Cloud which is actually a VPN to his Public Cloud, with infrastructure shared with the world.

Sure. What can I say? Well, maybe that Virtual Private Cloud is not the Private Cloud.

Read more