Canonical Voices

Gustavo Niemeyer

The job market for Go

I know there’s a lot of curiosity regarding Go’s growth and usage, and some people often ponder about whether they should invest on it since they don’t see hundreds of jobs in public sites, so would like to point something obvious, but which people often don’t realize:

You are responsible for the market.

If you appreciate programming in Go, and understand the qualities of the language and the principles that the core team behind its development follow, it’s in your hands to make the language well known or not.

This can be as simple as blogging about it, or telling your coworkers or programmer friends about it, or maybe talking in user groups and conferences, running coding contests, etc. The core idea in all of that is making sure the right people are exposed to the language.

Note that this is not some kind of evil plan. We all like hearing about nice tools, and I would personally be glad to hear about other tools and languages people have to expose as well. The real point is just that we’re the ones building the future marketplace right now, and if we want Go to have a significant chunk of that marketplace because we like using it for our daily tasks, we better talk about why that is so.

It’s not about convincing.. it’s about exposing and explaining. People will have gone through similar pain as you have and will correlate to your experience, and then they can easily decide by themselves if Go is in a position to offer them a less painful life.

Read more
Gustavo Niemeyer

Back in 2009 I quickly talked about the obvious revolution in computing that was rolling in the form of mobile phone as computer, and mentioned as well the fact that touch-based interfaces were going to dominate the marketplace because of that.

Move forward a couple of years, and last week I got my first tablet, running Android (a Samsung Galaxy Tab 10.1, if you’re curious). I didn’t know exactly why I needed one, but being in the tech industry I always have that nice excuse for myself of buying things early on for learning about the experience of using them. Last night, I could clearly see this can be a real claim in some cases (in others it’s just an excuse for the wife).

After getting the tablet last week, I’ve started by experimenting with the usual stuff any person would (email, browser, etc), and then downloaded a few games to take on board a longish flight. Some of them were pretty good.. a vertical scrolling shooter, a puzzle-solver, and so on. On all of them, though, it took just a few minutes before the novelty of holding the screen in my hands for interacting with the game got old, and the interest went away with it.

This last night, though, I’ve decided to try another game from the top list, named Cut the Rope, and this time I was immediately hooked into it. That was certainly one of the most enjoyable gaming experiences I had in quite a while, and when going to bed I started to ponder about what was different there.

The game is obviously well executed, with cute drawings and sounds, and also smooth, but I think there was something else as well. In retrospect, the other games felt a lot like ports of a desktop/laptop experience. The side scrolling game, for instance, was quite well suited for a joystick, and at least one other game had an actual joystick emulated on the screen, which is an enabler, but far from nice to be honest.

This one game, though, felt very well suited for a hands-based interaction: quickly drawing lines for cutting ropes, tapping on balloons to push air out, moving levers around, etc. In some more advanced levels, it was clear that my dexterity (or lack thereof) was playing a much more important role in accomplishing the tasks than the traditional button/joystick version of it. This felt like an entirely novel gaming experience that just hadn’t happened yet.

It’s funny and ironic that I had this experience within a week from Microsoft reportedly saying (again!) that a tablet is just another PC. It’s not, and if they tried it out with some minimum attention they’d see why it’s so clearly not.

In that experience, the joystick felt familiar but at the same quite awkward to use, but using my hands naturally in an environment where that was suitable felt very pleasing. We can generalize that a bit and note a common way to relate to innovation: we first try to reuse the knowledge we have when facing a new concept, but when we understand the concept better quite often we’re able to come up with more effective and interesting ways to relate to it.

In the tablet vs. laptop/desktop thread, you probably won’t want to be typing long documents in a tablet, but would most likely prefer to shuffle items in an agenda with your fingers. Also, you likely wouldn’t want to do that detailed CAD work with a fat finger in a screen, but would certainly be happy to review code or a document sitting in your backyard with the birds (no whales).

So, let’s please put that hammer away for a second while creating a most enjoyable touch-based experience.

Read more
niemeyer

Circular buffers are based on an algorithm well known by any developer who’s got past the “Hello world!” days. They offer a number of key characteristics with wide applicability such as constant and efficient memory use, efficient FIFO semantics, etc.

One feature which is not always desired, though, it the fact that circular buffers traditionally will either overwrite the last element, or raise an overflow error, since they are generally implemented as a buffer of constant size. This is an unwanted property when one is attempting to consume items from the buffer and it is not an option to blindly drop items, for instance.

This post presents an efficient (and potentially novel) algorithm for implementing circular buffers which preserves most of the key aspects of the traditional version, while also supporting dynamic expansion when the buffer would otherwise have its oldest entry overwritten. It’s not clear if the described approach is novel or not (most of my novel ideas seem to have been written down 40 years ago), so I’ll publish it below and let you decide.

Traditional circular buffers

Before introducing the variant which can actually expand during use, let’s go through a quick review on traditional circular buffers, so that we can then reuse the nomenclature when extending the concept. All the snippets provided in this post are written in Python, as a better alternative to pseudo-code, but the concepts are naturally portable to any other language.

So, the most basic circular buffer needs the buffer itself, its total capacity, and a position where the next write should occur. The following snippet demonstrates the concept in practice:

buf = [None, None, None, None, None]
bufcap = len(buf)
pushi = 0   

for elem in range(7):
    buf[pushi] = elem
    pushi = (pushi + 1) % bufcap
    
print buf # => [5, 6, 2, 3, 4]

In the example above, the first two elements of the series (0 and 1) were overwritten once the pointer wrapped around. That’s the specific feature of circular buffers which the proposal in this post will offer an alternative for.

The snippet below provides a full implementation of the traditional approach, this time including both the pushing and popping logic, and raising an error when an overflow or underflow would occur. Please note that these snippets are not necessarily idiomatic Python. The intention is to highlight the algorithm itself.

class CircBuf(object):

    def __init__(self):
        self.buf = [None, None, None, None, None]
        self.buflen = self.pushi = self.popi = 0
        self.bufcap = len(self.buf)

    def push(self, x):
        assert self.buflen == 0 or self.pushi != self.popi, 
               "Buffer overflow!"
        self.buf[self.pushi] = x
        self.pushi = (self.pushi + 1) % self.bufcap
        self.buflen += 1

    def pop(self):
        assert self.buflen != 0, "Buffer underflow!"
        x = self.buf[self.popi]
        self.buf[self.popi] = None
        self.buflen -= 1
        self.popi = (self.popi + 1) % self.bufcap
        return x

With the basics covered, let’s look at how to extend this algorithm to support dynamic expansion in case of overflows.

Dynamically expanding a circular buffer

The approach consists in imagining that the same buffer can contain both a circular buffer area (referred to as the ring area from here on), and an overflow area, and that it is possible to transform a mixed buffer back into a pure circular buffer again. To clarify what this means, some examples are presented below. The full algorithm will be presented afterwards.

First, imagine that we have an empty buffer with a capacity of 5 elements as per the snippet above, and then the following operations take place:

for i in range(5):
    circbuf.push(i)

circbuf.pop() # => 0
circbuf.pop() # => 1

circbuf.push(5)
circbuf.push(6)

print circbuf.buf # => [5, 6, 2, 3, 4]

At this point we have a full buffer, and with the original implementation an additional push would raise an assertion error. To implement expansion, the algorithm will be changed so that those items will be appended at the end of the buffer. Following the example, pushing two additional elements would behave the following way:

circbuf.push(7)
circbuf.push(8)

print circbuf.buf # => [5, 6, 2, 3, 4, 7, 8]

In that example, elements 7 and 8 are part of the overflow area, and the ring area remains with the same capacity and length of the original buffer. Let’s perform a few additional operations to see how it would behave when items are popped and pushed while the buffer is split:

circbuf.pop() # => 2
circbuf.pop() # => 3
circbuf.push(9)

print circbuf.buf # => [5, 6, None, None, 4, 7, 8, 9]

In this case, even though there are two free slots available in the ring area, the last item pushed was still appended at the overflow area. That’s necessary to preserve the FIFO semantics of the circular buffer, and means that the buffer may expand more than strictly necessary given the space available. In most cases this should be a reasonable trade off, and should stop happening once the circular buffer size stabilizes to reflect the production vs. consumption pressure (if you have a producer which constantly operates faster than a consumer, though, please look at the literature for plenty of advice on the problem).

The remaining interesting step in that sequence of events is the moment when the ring area capacity is expanded to cover the full allocated buffer again, with the previous overflow area being integrated into the ring area. This will happen when the content of the previous partial ring area is fully consumed, as shown below:

circbuf.pop() # => 4
circbuf.pop() # => 5
circbuf.pop() # => 6
circbuf.push(10)

print circbuf.buf # => [10, None, None, None, None, 7, 8, 9]

At this point, the whole buffer contains just a ring area and the overflow area is again empty, which means it becomes a traditional circular buffer.

Sample algorithm

With some simple modifications in the traditional implementation presented previously, the above semantics may be easily supported. Note how the additional properties did not introduce significant overhead. Of course, this version will incur in additional memory allocation to support the buffer expansion, bu that’s inherent to the problem being solved.

class ExpandingCircBuf(object):

    def __init__(self):
        self.buf = [None, None, None, None, None]
        self.buflen = self.ringlen = self.pushi = self.popi = 0
        self.bufcap = self.ringcap = len(self.buf)

    def push(self, x):
        if self.ringlen == self.ringcap or 
           self.ringcap != self.bufcap:
            self.buf.append(x)
            self.buflen += 1
            self.bufcap += 1
            if self.pushi == 0: # Optimization.
                self.ringlen = self.buflen
                self.ringcap = self.bufcap
        else:
            self.buf[self.pushi] = x
            self.pushi = (self.pushi + 1) % self.ringcap
            self.buflen += 1
            self.ringlen += 1

    def pop(self):
        assert self.buflen != 0, "Buffer underflow!"
        x = self.buf[self.popi]
        self.buf[self.popi] = None
        self.buflen -= 1
        self.ringlen -= 1
        if self.ringlen == 0 and self.buflen != 0:
            self.popi = self.ringcap
            self.pushi = 0
            self.ringlen = self.buflen
            self.ringcap = self.bufcap
        else:
            self.popi = (self.popi + 1) % self.ringcap
        return x

Note that the above algorithm will allocate each element in the list individually, but in sensible situations it may be better to allocate additional space for the overflow area in advance, to avoid potentially frequent reallocation. In a situation when the rate of consumption of elements is about the same as the rate of production, for instance, there are advantages in doubling the amount of allocated memory per expansion. Given the way in which the algorithm works, the previous ring area will be exhausted before the mixed buffer becomes circular again, so with a constant rate of production and an equivalent consumption it will effectively have its size doubled on expansion.

UPDATE: Below is shown a version of the same algorithm which not only allows allocating more than one additional slot at a time during expansion, but also incorporates it in the overflow area immediately so that the allocated space is used optimally.

class ExpandingCircBuf2(object):

    def __init__(self):
        self.buf = []
        self.buflen = self.ringlen = self.pushi = self.popi = 0
        self.bufcap = self.ringcap = len(self.buf)

    def push(self, x):
        if self.ringcap != self.bufcap:
            expandbuf = (self.pushi == 0)
            expandring = False
        elif self.ringcap == self.ringlen:
            expandbuf = True
            expandring = (self.pushi == 0)
        else:
            expandbuf = False
            expandring = False

        if expandbuf:
            self.pushi = self.bufcap
            expansion = [None, None, None]
            self.buf.extend(expansion)
            self.bufcap += len(expansion)
            if expandring:
                self.ringcap = self.bufcap

        self.buf[self.pushi] = x
        self.buflen += 1
        if self.pushi < self.ringcap:
            self.ringlen += 1
        self.pushi = (self.pushi + 1) % self.bufcap

    def pop(self):
        assert self.buflen != 0, "Buffer underflow!"
        x = self.buf[self.popi]
        self.buf[self.popi] = None
        self.buflen -= 1
        self.ringlen -= 1
        if self.ringlen == 0 and self.buflen != 0:
            self.popi = self.ringcap
            self.ringlen = self.buflen
            self.ringcap = self.bufcap
        else:
            self.popi = (self.popi + 1) % self.ringcap
        return x

Conclusion

This blog post presented an algorithm which supports the expansion of circular buffers while preserving most of their key characteristics. When not faced with an overflowing buffer, the algorithm should offer very similar performance characteristics to a normal circular buffer, with a few additional instructions and constant space for registers only. When faced with an overflowing buffer, the algorithm maintains the FIFO property and enables using contiguous allocated memory to maintain both the original circular buffer and the additional elements, and follows up reusing the full area as part of a new circular buffer in an attempt to find the proper size for the given use case.

Read more
niemeyer

One more Go library oriented towards building distributed systems hot off the presses: govclock. This one offers full vector clock support for the Go language. Vector clocks allow recording and analyzing the inherent partial ordering of events in a distributed system in a comfortable way.

The following features are offered by govclock, in addition to basic event tracking:

  • Compact serialization and deserialization
  • Flexible truncation (min/max entries, min/max update time)
  • Unit-independent update times
  • Traditional merging
  • Fast and memory efficient

If you’d like to know more about vector clocks, the Basho guys did a great job in the following pair of blog posts:

The following sample program demonstrates some sequential and concurrent events, dumping and loading, as well as merging of clocks. For more details, please look at the web page. The project is available under a BSD license.


package main

import (
    "launchpad.net/govclock"
    "fmt"
)

func main() {
    vc1 := govclock.New()
    vc1.Update([]byte("A"), 1)

    vc2 := vc1.Copy()
    vc2.Update([]byte("B"), 0)

    fmt.Println(vc2.Compare(vc1, govclock.Ancestor))   // => true
    fmt.Println(vc1.Compare(vc2, govclock.Descendant)) // => true

    vc1.Update([]byte("C"), 5)


    fmt.Println(vc1.Compare(vc2, govclock.Descendant)) // => false
    fmt.Println(vc1.Compare(vc2, govclock.Concurrent)) // => true

    vc2.Merge(vc1)

    fmt.Println(vc1.Compare(vc2, govclock.Descendant)) // => true

    data := vc2.Bytes()
    fmt.Printf("%#vn", string(data))
    // => "x01x01x01x01Ax01x01x01Bx01x00x01C"

    vc3, err := govclock.FromBytes(data)
    if err != nil { panic(err.String()) }

    fmt.Println(vc3.Compare(vc2, govclock.Equal))      // => true
}

Read more
Gustavo Niemeyer

ZooKeeper is a clever generic coordination server for distributed systems, and is one of the core softwares which facilitate the development of Ensemble (project for automagic IaaS deployments which we push at Canonical), so it was a natural choice to experiment with.

Gozk is a complete binding for ZooKeeper which explores the native features of Go to facilitate the interaction with a ZooKeeper server. To avoid reimplementing the well tested bits of the protocol in an unstable way, Gozk is built on top of the standard C ZooKeeper library.

The experience of integrating ZooKeeper with Go was certainly valuable on itself, and worked as a nice way to learn the details of integrating the Go language with a C library. If you’re interested in learning a bit about Go, ZooKeeper, or other details related to the creation of bindings and asynchronous programming, please fasten the seatbelt now.

Basics of C wrapping in Go

Creating the binding on itself was a pretty interesting experiment already. I have worked on the creation of quite a few bindings and language bridges before, and must say I was pleasantly surprised with the experience of creating the Go binding. With Cgo, the name given to the “foreign function interface” mechanism for C integration, one basically declares a special import statement which causes a pre-processor to look at the comment preceding it. Something similar to this:

// #include <zookeeper.h>
import "C"

The comment doesn’t have to be restricted to a single line, or to #include statements even. The C code contained in the comment will be transparently inserted into a helper C file which is compiled and linked with the final object file, and the given snippet will also be parsed and inclusions processed. In the Go side, that “C” import is simulated as if it were a normal Go package so that the C functions, types, and values are all directly accessible.

As an example, a C function with this prototype:

int zoo_wexists(zhandle_t *zh, const char *path, watcher_fn watcher,
                void *context, struct Stat *stat);

In Go may be used as:

cstat := C.struct_Stat{}
rc, cerr := C.zoo_wexists(zk.handle, cpath, nil, nil, &cstat)

When the C function is used in a context where two result values are requested, as done above, Cgo will save the well known errno variable after the function has finished executing and will return it wrapped into an os.Errno value.

Also, note how the C struct is defined in a way that can be passed straight to the C function. Interestingly, the allocation of the memory backing the structure is going to be performed and tracked by the Go runtime, and will be garbage collected appropriately once no more references exist within the Go runtime. This fact has to be kept in mind since the application will crash if a value allocated normally within Go is saved with a foreign C function and maintained after all the Go references are gone. The alternative in these cases is to call the usual C functions to get hold of memory for the involved values. That memory won’t be touched by the garbage collector, and, of course, must be explicitly freed when no longer necessary. Here is a simple example showing explicit allocation:

cbuffer := (*C.char)(C.malloc(bufferSize))
defer C.free(unsafe.Pointer(cbuffer))

Note the use of the defer statement above. Even when dealing with foreign functionality, it comes in handy. The above call will ensure that the buffer is deallocated right before the current function returns, for instance, so it’s a nice way to ensure no leaks happen, even if in the future the function suddenly gets a new exit point which didn’t consider the allocation of resources.

In terms of typing, Go is more strict than C, and Cgo-based logic will also ensure that the types returned and passed into the foreign C functions are correctly typed, in the same way done for the native types. Note above, for instance, how the call to the free() function has to explicitly convert the value into an unsafe.Pointer, even though in C no casting would be necessary to pass a pointer into a void * parameter.

The unsafe.Pointer is in fact a very special type within Go. Using it, one can convert any pointer type into any other pointer type in an unsafe way (thus the package name), and also back and forth into a uintptr value with the address of the memory referenced by the pointer. For every other type conversion, Go will ensure at compilation time that doing the conversion at runtime is a safe operation.

With all of these resources, including the ability to use common Go syntax and functionality even when dealing with foreign types, values, and function calls, the integration task turns out to be quite a pleasant experience. That said, some of the things may still require some good thinking to get right, as we’ll see shortly.

Watch callbacks and channels

One of the most interesting (and slightly tricky) aspects of mapping the ZooKeeper concepts into Go was the “watch” functionality. ZooKeeper allows one to attach a “watch” to a node so that the server will report back when changes happen to the given node. In the C library, this functionality is exposed via a callback function which is executed once the monitored node aspect is modified.

It would certainly be possible to offer this functionality in Go using a similar mechanism, but Go channels provide a number of advantages for that kind of asynchronous notification: waiting for multiple events via the select statement, synchronous blocking until the event happens, testing if the event is already available, etc.

The tricky bit, though, isn’t the use of channels. That part is quite simple. The tricky detail is that the C callback function execution happens in a C thread started by the ZooKeeper library, and happens asynchronously, while the Go application is doing its business elsewhere. Right now, there’s no straightforward way to transfer the execution of this asynchronous C function back into the Go land. The solution for this problem was found with some help from the folks at the golang-nuts mailing list, and luckily it’s not that hard to support or understand. That said, this is a good opportunity to get some coffee or your preferred focus-enhancing drink.

The solution works like this: when the ZooKeeper C library gets a watch notification, it executes a C callback function which is inside a Gozk helper file. Rather than transferring control to Go right away, this C function simply appends data about the event onto a queue, and signals a pthread condition variable to notify that an event is available. Then, on the Go side, once the first ZooKeeper connection is initialized, a new goroutine is fired and loops waiting for events to be available. The interesting detail about this loop, is that it blocks within a foreign C function waiting for an event to be available, through the signaling of the shared pthread condition variable. In the Go side, that’s how the call looks like, just to give a more practical feeling:

// This will block until there's a watch available.
data := C.wait_for_watch()

Then, on the C side, here is the function definition:

watch_data *wait_for_watch() {
    watch_data *data = NULL;
    pthread_mutex_lock(&watch_mutex);
    if (first_watch == NULL)
        pthread_cond_wait(&watch_available, &watch_mutex);
    data = first_watch;
    first_watch = first_watch->next;
    pthread_mutex_unlock(&watch_mutex);
    return data;
}

As you can see, not really a big deal. When that kind of blocking occurs inside a foreign C function, the Go runtime will correctly continue the execution of other goroutines within other operating system threads.

The result of this mechanism is a nice to use interface based on channels, which may be explored in different ways depending on the application needs. Here is a simple example blocking on the event synchronously, for instance:

stat, watch, err := zk.ExistsW("/some/path")
if stat == nil && err == nil {
    event := <-watch
    // Use event ...
}

Concluding

Those were some of the interesting aspects of implementing the ZooKeeper binding. I would like to speak about some additional details, but this post is rather long already, so I'll keep that for a future opportunity. The code is available under the LGPL, so if you're curious about some other aspect, or would like to use ZooKeeper with Go, please move on and check it out!

Read more
Gustavo Niemeyer

Continuing the sequence of experiments I’ve been running with the Go language, I’ve just made available a tiny but useful new package: gommap. As one would imagine, this new package provides access to low-level memory mapping for files and devices, and it allowed exploring a few new edges of the language implementation. Note that, strictly speaking, some of the details ahead are really more about the implementation than the language itself.

There were basically two main routes to follow when implementing support for memory mapping in Go. The first one is usually the way higher-level languages handle it. In Python, for instance, this is the way one may use a memory mapped file:

>>> import mmap
>>> file = open("/etc/passwd")
>>> mm = mmap.mmap(file.fileno(), size, access=PROT_READ)
>>> mm[0:4]
'root'

The way this was done has an advantage and a disadvantage which are perhaps non entirely obvious on a first look. The advantage is that the memory mapped area is truly hidden behind that interface, so any improper attempt to access a region which was already unmapped, for instance, may be blocked within the application with a nice error message which explains the issue. The disadvantage, though, is that this interface usually comes with a restriction that the way to use the memory region with normal libraries, is via copying of data. In the above example, for instance, the “root” string isn’t backed by the original mapped memory anymore, and is rather a copy of its contents (see PEP 3118 for a way to improve a bit this aspect with Python).

The other path, which can be done with Go, is to back a normal native array type with the allocated memory. This means that normal libraries don’t need to copy data out of the mapped memory, or to use a special memory saving interface, to deal with the memory mapped region. As a simple example, this would get the first line in the given file:

mmap, err := gommap.Map(file.Fd(), PROT_READ, MAP_PRIVATE)
if err == nil {
    end := bytes.Index(mmap, []byte{'\n'})
    firstLine := mmap[:end]
}

In the procedure above, mmap is defined as an alias to a native []byte array, so even though the standard bytes module was used, at no point was the data from the memory mapped region copied out or any auxiliary buffers allocated, so this is a very fast operation. To give an idea about this, let’s pretend for a moment that we want to increase a simple 8 bit counter in a file. This might be done with something as simple as:

mmap[13] += 1

This line of code would be compiled into something similar to the following assembly (amd64):

MOVQ    mmap+-32(SP),BX
CMPL    8(BX),$13
JHI     ,68
CALL    ,runtime.panicindex+0(SB)
MOVQ    (BX),BX
INCB    ,13(BX)

As you can see, this is just doing some fast index checking before incrementing the value directly in memory. Given that one of the important reasons why memory mapped files are used is to speed up access to disk files (sometimes large disk files), this advantage in performance is actually meaningful in this context.

Unfortunately, though, doing things this way also has an important disadvantage, at least right now. There’s no way at the moment to track references to the underlying memory, which was allocated by means not known to the Go runtime. This means that unmapping this memory is not a safe operation. The munmap system call will simply take the references away from the process, and any further attempt to touch those areas will crash the application.

To give you an idea about the background “magic” which is going on to achieve this support in Go, here is an interesting excerpt from the underlying mmap syscall as of this writing:

addr, _, errno := syscall.Syscall6(syscall.SYS_MMAP, (...))
(...)
dh := (*reflect.SliceHeader)(unsafe.Pointer(&mmap))
dh.Data = addr
dh.Len = int(length)
dh.Cap = dh.Len

As you can see, this is taking apart the memory backing the slice value into its constituting structure, and altering it to point to the mapped memory, including information about the length mapped so that bound checking as observed in the assembly above will work correctly.

In case the garbage collector is at some point extended to track references to these foreign regions, it would be possible to implement some kind of UnmapOnGC() method which would only unmap the memory once the last reference is gone. For now, though, the advantages of being able to reference memory mapped regions directly, at least to me, surpass the danger of having improper slices of the given region being used after unmapping. Also, I expect that usage of this kind of functionality will generally be encapsulated within higher level libraries, so it shouldn’t be too hard to keep the constraint in mind while using it this way.

For those reasons, gommap was implemented with the latter approach. In case you need memory mapping support for Go, just move ahead and goinstall launchpad.net/gommap.

UPDATE (2010-12-02): The interface was updated so that mmap itself is an array, rather than mmap.Data, and this post was changed to reflect this.

Read more
Gustavo Niemeyer

A while ago Martin Pool made a very interesting post on the design of interfaces, inspired by a talk from Rusty Russel from 2003.

Besides the interesting scale of interface quality explained there, this is a very insightful comment, often overlooked:


Once the code gets too big for one person, it’s all about damage control. Interfaces make damage control possible… except when the interfaces themselves are the problem.

Designing a system as a group of people requires splitting tasks up among team members, and/or the community, and perhaps even separate teams. Interfaces are the touch points of that splitting, and is what represents the functionality offered within the module/library/file/command/service/whatever. Too often, people spend a long time working on the implementation details, thinking really deep about how to obtain the desired behavior, and forget to define clearly what is the interface to that behavior.

Having good interfaces is a key aspect of software development, and getting it correctly offers a number of important benefits:

Good encapsulation

Having good encapsulation is pretty much a synonym of having good interfaces. Too often, though, people focus on the encapsulation of the small pieces (the functions, the classes, etc), and forget about the encapsulation of the larger blocks (the libraries, modules, packages, commands, etc).

Also, in my experience trying to encourage good architectures, I have found that stating “We need good encapsulation!” gives developers no tangible line of action. It reminds me of a parent telling the child “You should be responsible!”. Sure, encapsulation and responsibility both sound great, but.. what does that really mean?

When inviting developers to think about the interfaces of the system parts they are responsible for, encapsulation becomes a natural outcome. It’s clear that there must be a line drawn between that part of the system and the rest, and the shape of this line must be considered while (or even better, before) the behavior is implemented.

Given well designed interfaces, the additional requirement of only using other parts of the system through their public interfaces seals the achievement of good encapsulation. Ideally, this barrier would be a natural property of the language used to develop the system (see the interface quality scale in Martin’s post). In other cases, this must be achieved through conventions, agreements, and good documentation.

Improved scope and communication

By inviting developers to think about the interfaces of the parts they are responsible for, one is basically encouraging the consideration of the interaction between those pieces and the rest of the system. This process gives an interesting perspective, both in terms of the external expectations (what do I need to offer other people?), as well as the internal goals (what do I need to implement for satisfying what other people need?).

Besides helping people to figure the scope and goal of the piece being developed, this will also give a nice structure to some of the communication which must inevitably happen to integrate correctly the separate parts of the system being developed.

Improved testing and experimentation

If an interface is well designed and defined, and encapsulates well part of the functionality of the system, it improves significantly the testing and experimentation related to that part of the system. Again, this has an effect internally and externally to the interface.

Internally in the sense that there’s a clear boundary between the part in development and the rest of the system, and thus it should be easier to verify that the bits which compose it are working according to plan without dragging the whole system together, and also to verify that the interface itself is behaving as intended (and hopefully as documented).

Externally in the sense that, given that there’s agreement regarding what is the public interface to the part being considered, one may easily provide a test double (a fake, or dummy, or mock) to simulate that part of the system. This is well known to be useful in a number of ways:

  • Dependent work may be run in parallel by different people
  • Real implementation backing the given interface may be postponed, until the idea is proven useful, and the interface feels suitable
  • External systems which would be hard to run locally may be simulated so that tests run fast and cheap, even without network connections
  • Faults may be injected in the system via the test doubles to verify behavior in hostile conditions

and so on.

Quality isolation

This point is also my understanding of what Rusty refers to as damage control in his talk. This property is very useful when designing a system, but even then it’s often missed when discussing interfaces and encapsulation.

If there’s a well defined interface to a piece of functionality in the system, and that interface was carefully considered to cover the needs of the system, the implementation of that interface may not start as the most beautiful, or most scalable, or even most reliable piece of software. As any developer responsible for a successful startup will happily point out, a half-baked implementation is often good enough to get things going, prove the concept, and extend the project runway.

Good interfaces play an important role in this kind of situation. They are, in this sense, a way to be better prepared for success (or, for failure, depending on the perspective). If the interface implementation suddenly becomes an issue for whatever reason, the implementation itself may be replaced by something which better suits the current reality, while preserving the interaction with the rest of the system.

Of course, it’s still very hard to predict future system behavior when facing a completely different reality. Changing the scale requirements for the system a few orders of magnitude, for instance, may easily break existing assumptions, and interfaces designed around these assumptions. Still, even if good interfaces won’t be enough to avoid modifications in the architecture and integration points in many cases, they will certainly help framing the conversations which will take place when this happens and new interfaces must be developed.

Conclusion

When developing non-trivial software products, there’s no other way but to split out the problem solving in several layers and components. Looking at the points where these layers and components touch each other is a very useful and natural way to organize conversations and structure work which must take place to push the product forward.

It’s quite revealing to look at the points above, and note that it’s not simply the existence of interfaces themselves which presents the advantages described, but the process which they encourage around them. Software architecture is essentially about people.

Read more
Gustavo Niemeyer

It’s time to release my “side project” which has been evolving over the last several months: Gocheck. I’ve been watching Go for some time, and have been getting more and more interested in the language. My first attempt to write something interesting in it made it obvious that there would be benefit in having a richer testing platform than what is available in the standard library. That said, I do understand why the standard one is slim: it’s pretty minimalist, because it’s used by itself to test the rest of the platform. With Gocheck, though, I don’t have that requirement. I’m able to trust that the standard library works well, and focus on having features which will make me more productive while writing tests, including features such as:

  • Better error reporting
  • Richer test helpers: assertions which interrupt the test immediately, deep multi-type comparisons, string matching, etc
  • Suite-based grouping of tests
  • Fixtures: per suite and/or per test set up and tear down
  • Management of temporary directories
  • Panic-catching logic, with proper error reporting
  • Proper counting of successes, failures, panics, missed tests, skips, etc
  • Support for expected failures
  • Fully tested (yes, it manages to test itself reliably!)

That last point was actually quite fun to get right. It’s the first time I wrote a testing framework from the ground up, and of course I wanted to have it fully tested by itself, but I didn’t want to simply use a foreign testing framework to test it. So what it does is basically to have a “bootstrapping” phase, which ensures that the very basic parts of the library work, without trusting on pretty much any internal functionality (e.g. it verifies the number of executed functions, and works with low-level panics). Then, once the lower layers are trusted, tests for higher functionality was introduced by building on the trusted bits.

Gocheck is actually mostly ready for some time now, but I’ve been polishing edges with some real world usage before releasing it. Since both the real world usage and Gocheck itself are side projects, you can imagine that took a bit of time. Today, though, I’ve managed to fix the last few things which were bothering me, so it’s up for world consumption.

I hope you enjoy it, and make some good use of it so that we can all have more reliable software. ;-)

Read more
Gustavo Niemeyer

When I started programming in Python long ago, one of the features which really hooked me up was the quality interactive interpreter offered with the language implementation. It was (and still is) a fantastic way to experiment with syntax, semantics, modules, and whatnot. So much so that many first-class Python practitioners will happily tell you that the interactive interpreter is used not only as a programming sandbox, but many times as the their personal calculator too. This kind of interactive interpreter is also known as a REPL, standing for Read Eval Print Loop, and many languages have pretty advanced choices in that area by now.

After much rejoice with Python’s REPL, though, and as a normal human being, I’ve started wishing for more. The problem has a few different levels, which are easy to understand.

First, we’re using Python Twisted in Ensemble, one of the projects being pushed at Canonical. Twisted is an event-driven framework, which among other things means it works a lot with closures and callbacks. Having to redefine multi-line functions frequently to drive experiments isn’t exactly fun in a line-based interactive interpreter. Then, some of the languages I’ve started playing with, such as Erlang, have limited REPLs which differ in functionality significantly compared to what may be done in a text file. And finally, other languages I’ve been programming with recently, such as Go, lack a reasonable REPL altogether (there are only unusable hacks around).

Alright, so here is the idea: what if instead of being given an interactive REPL, you were presented with your favorite text editor, and whenever you wrote the file down, it was executed and results presented? That’s The Hacking Sandbox, or hsandbox. It supports 11 different programming languages out of the box, and given its nature it should be trivial to support any other language.

Here is a screenshot to clarify the idea:

Note that if you open a sandbox for a language like C or Go, the skeleton of what’s needed to run a program will already be in place, so you just have to “fill the blanks”.

For more details and download information, please check the hsandbox web page.

Read more
Gustavo Niemeyer

I’ve just read a post by Brett Cannon where, basically, he complains about complainers.

If you don’t know who Brett is, you’re probably not a heavy Python user. Brett is a very important Python core developer which has been around for a while and who does a great job at it. His post, though, makes me a bit sad.

Brett points out that there are two types of personalities which do not contribute to open source. The first one he defines as:

The first type is the “complainer”. This is someone who finds something they don’t like, points out that the thing they don’t like is suboptimal, but then offers no solutions.

And the second one is defined as:

(…) This is someone who, upon finding out about a decision that they think was sub-optimal, decides to bring up new ideas and solutions. The person is obviously trying to be helpful by bringing up new ideas and solutions, thinking that the current one is simply going to flop and they need to stop people from making a big mistake. The thing is, this person is not helping. (…)

This, on itself, is already shortsighted. If you’re tired of hearing the same arguments again and again for 10 years, from completely different people, there’s a pretty good chance that there’s an actual issue with your project, and your users are trying in their way to contribute and interact with you in the hope that it might get fixed.

This is really important: They are people, which use your project, and are trying to improve it. If you can’t stand that, you should stop maintaining an open source project now, or pick something which no one cares about.

The other issue which took my attention in his post is his example: the Python GIL. Look at the way in which Brett dismisses the problem:

(I am ignoring the fact that few people write CPU-intensive code requiring true threading support, that there is the multiprocessing library, true power users have extension modules which do operate with full threading, and that there are multiple VMs out there with a solution that have other concurrency solutions)

Brett, we can understand that the GIL is hard to remove, but it’s a fundamental flaw in the most important Python implementation, and being dismissive about it will either draw further complaints at you, or will simply drive users away from the language entirely.

I can understand why you think this way, though. Guido presents the same kind of feeling about the GIL for a very long time. Here is one excerpt from a mail thread about it:

Nevertheless, you’re right the GIL is not as bad as you would initially think: you just have to undo the brainwashing you got from Windows and Java proponents who seem to consider threads as the only way to approach concurrent activities.

Just Say No to the combined evils of locking, deadlocks, lock granularity, livelocks, nondeterminism and race conditions.

I apologize, but I have a very hard time reading this and not complaining.

In my world, the golden days of geometric growth in vertical processing power is over, multi-processed machines are here to stay, and the amount of traffic flowing through networks is just increasing. It feels reasonable to desire a less naïve approach to deal with real world problems, such as executing tasks concurrently.

I actually would love to not worry about things like non-determinism and race conditions, and would love even more to have a programming language which helps me with that!

Python, though, has a Global Interpreter Lock (yes, I’m talking about CPython, the most important interpreter). Python programs execute in sequence. No Fork/Join frameworks, no coroutines, no lightweight processes, nothing. Your Python code will execute in sequence if it lives in the same process space.

The answer from Brett and Guido to concurrency? Develop your code in C, or write your code to execute in multiple processes. If they really want people to get rid of non-determinism, locking issues, race conditions, and so on, they’re not helping at all.

I know this is just yet another complaint, though. I honestly cannot fix the problem either, and rather just talk about it in the hope that someone who’s able to do it will take care of it. That said, I wish that the language maintainers would do the same, and tell the world that it’s an unfortunate problem, and that they wished someone else would go there and fix it! If, instead, maintainers behave in a ridiculously dismissive way, like Guido did in that mail thread, and like Brett is doing in his post, the smart people that could solve the problem get turned down. People like to engage with motivated maintainers.. they like to solve problems that others are interested in seeing solved.

Perhaps agreeing with the shortcomings won’t help, though, and no one will show up to fix the problem either. But then, at least users will know that the maintainers are on the same side of the fence, and the hope that it will get fixed survives. If the maintainers just complain about the users which complain, and dismiss the problem, users are put in an awkward position. I can’t complain.. I can’t provide ideas or solutions.. I can’t fix the problem.. they don’t even care about the problem. Why am I using this thing at all?

Would you rather have users, or have no complainers?

Read more
Gustavo Niemeyer

The patch submitted last weekend to support Launchpad and Bazaar in goinstall went in!

This means that once the next release of Go is out (or you sync up with the tip code) you’ll be able to host modules for Go in Launchpad and people will be able to install it by running something similar to:

goinstall launchpad.net/yourproject

Then, you can simply use that in your source code as:

import “launchpad.net/yourproject”

The following URLs are supported:

import “launchpad.net/project”
import “launchpad.net/project/series”
import “launchpad.net/project/series/sub/directory”

import “launchpad.net/~user/project/branch”
import “launchpad.net/~user/project/branch/sub/directory”

import “launchpad.net/~user/+junk/branch”
import “launchpad.net/~user/+junk/branch/sub/directory”

Where you see sub/directory above, it means the module to be compiled, installed, and used, may be inside a subdirectory in the Bazaar branch itself. This is a convention used in all supported backends of goinstall.

Read more
Gustavo Niemeyer

After a few years in development, version 1.0 of Mocker is now available! Check out the changes since 0.10.1, the supported features, or go straight to the download page.

Read more
Gustavo Niemeyer

A bit of history

I don’t know exactly why, but I’ve always enjoyed IRC bots. Perhaps it’s the fact that it emulates a person in an easy-to-program way, or maybe it’s about having a flexible and shared “command line” tool, or maybe it’s just the fact that it helps people perceive things in an asynchronous way without much effort. Probably a bit of everything, actually.

My bot programming started with pybot many years ago, when I was still working at Conectiva. Besides having many interesting features, this bot eventually got in an abandonware state, since Canonical already had pretty much equivalent features available when I joined, and I had other interests which got in the way. The code was a bit messy as well.. it was a time when I wasn’t very used to testing software properly (a friend has a great excuse for that kind of messy software: “I was young, and needed the money!”).

Then, a couple of years ago, while working in the Landscape project, there was an opportunity of getting some information more visible to the team. Coincidently, it was also a time when I wanted to get some practice with the concepts of Erlang, so I decided to write a bot from scratch with some nice support for plugins, just to get a feeling of how the promised stability of Erlang actually took place for real. This bot is called mup (Mup Pet, more formally), and its code is available publicly through Launchpad.

This was a nice experiment indeed, and I did learn quite a bit about the ins and outs of Erlang with it. Somewhat unexpected, though, was the fact that the bot grew up a few extra features which multiple teams in Canonical started to appreciate. This was of course very nice, but it also made it more obvious that the egocentric reason for having a bot written in Erlang would now hurt, because most of Canonical’s own coding is done in Python, and that’s what internal tools should generally be written in for everyone to contribute and help maintaining the code.

That’s where the desire of migrating mup into a Python-based brain again came from, and having a new feature to write was the perfect motivator for this.

LDAP and two-way SMSing over IRC

Canonical is a very distributed company. Employees are distributed over dozens of countries, literally. Not only that, but most people also work from their homes, rather than in an office. Many different countries also means many different timezones, and working from home with people from different timezones means flexible timing. All of that means communication gets… well.. interesting.

How do we reach someone that should be in an online meeting and is not? Or someone that is traveling to get to a sprint? Or how can someone that has no network connectivity reach an IRC channel to talk to the team? There are probably several answers to this question, but one of them is of course SMS. It’s not exactly cheap if we consider the cost of the data being transfered, but pretty much everyone has a mobile phone which can do SMS, and the model is not that far away from IRC, which is the main communication system used by the company.

So, the itch was itching. Let’s scratch it!

Getting the mobile phone of employees was already a solved problem for mup, because it had a plugin which could interact with the LDAP directory, allowing people to do something like this:

<joe> mup: poke gustavo
<mup> joe: niemeyer is Gustavo Niemeyer <…@canonical.com> <time:…> <mobile:…>

This just had to be migrated from Erlang into a Python-based brain for the reasons stated above. This time, though, there was no reason to write something from scratch. I could even have used pybot itself, but there was also supybot, an IRC bot which started around the same time I wrote the first version of pybot, and unlike the latter, supybot’s author was much more diligent in evolving it. There is quite a comprehensive list of plugins for supybot nowadays, and it includes means for testing plugins and so on. The choice of using it was straighforward, and getting “poke” support ported into a plugin wasn’t hard at all.

So, on to SMSing. Canonical already had a contract with an SMS gateway company which we established to test-drive some ideas on Landscape. With the mobile phone numbers coming out of the LDAP directory in hands and an SMS contract established, all that was needed was a plugin for the bot to talk to the SMS gateway. That “conversation” with the SMS gateway allows not only sending messages, but also receiving SMS messages which were sent to a specific number.

In practice, this means that people which are connected to IRC can very easily deliver an SMS to someone using their nicks. Something like this:

<joe> @sms niemeyer Where are you? We’re waiting!

And this would show up in the mobile screen as:

joe> Where are you? We’re waiting!

In addition to this, people which have no connectivity can also contact individuals and channels on IRC, with mup working as a middle man. The message would show up on IRC in a similar way to:

<mup> [SMS] <niemeyer> Sorry, the flight was delayed. Will be there in 5.

The communication from the bot to the gateway happens via plain HTTPS. The communication back is a bit more complex, though. There is a small proxy service deployed in Google App Engine to receive messages from the SMS gateway. This was done to avoid losing messages when the bot itself is taken down for maintenance. The SMS gateway doesn’t handle this case very well, so it’s better to have something which will be up most of the time buffering messages.

A picture is worth 210 words, so here is a simple diagram explaining how things got linked together:

This is now up for experimentation, and so far it’s working nicely. I’m hoping that in the next few weeks we’ll manage to port the rest of mup into the supybot-based brain.

Read more
Gustavo Niemeyer

Released editmoin 1.15

Version 1.15 of editmoin is now available.

The following changes were made:

  • Moin used to work with numerical IDs for identification, and editmoin was still based on this model. This release adds support for direct authentication as available in current Moin releases. This was inspired by Reimar Bauer.
  • The new file ~/.moin_users is now parsed to obtain usernames, supporting the feature above. Shortcuts are also supported in this file.
  • Added support for textcha question handling.

Read more
Gustavo Niemeyer

I was just rambling randomly yesterday, in the usual microblogging platforms, about how result checking seems to be ignored or done badly. The precise wording was:

It’s really amazing how little attention error handling receives in most software development. Even *tutorials* often ignore it.

It indeed does amaze me. It sometimes feels like we write code for theoretical perfect worlds.. “If the processor executes exactly in this order, and the weather is calm, this program will work.”. There are countless examples of bad assumptions.. someday I will come with some statistics of the form “Every N seconds someone forgets to check the result of write().”.

If you are a teacher, or a developer that enjoys writing snippets of code to teach people, please join me in the quest of building a better future. Do not tell us that you’re “avoiding result checking for terseness”, because that’s exactly what we people will do (terseness is good, right?). On the contrary, take this chance to make us feel bad about avoiding result checking. You might do this by putting a comment like “If you don’t do this, you’re a bad programmer.” right next to the logic which is handling the result, and might take this chance to teach people how proper result handling is done.

Of course, there’s another forgotten art related to result checking. It sits on the other side of the fence. If you are a library author, do think through about how you plan to make us check conditions which happen inside your library, and try to imagine how to make our lives easier. If we suck at handling results when there are obvious ways to handle it, you can imagine what happens when you structure your result logic badly.

Here is a clear example of what not to do, coming straight from Python’s standard library, in the imaplib module:

    def login(self, user, password):
        typ, dat = self._simple_command('LOGIN', user, self._quote(password))
        if typ != 'OK':
            raise self.error(dat[-1])
        self.state = 'AUTH'
        return typ, dat

You see the problem there? How do you handle errors from this library? Should we catch the exception, or should we verify the result code? “Both!” is the right answer, unfortunately, because the author decided to do us a little favor and check the error condition himself in some arbitrary cases and raise the error, while letting it go through and end up in the result code in a selection of other arbitrary cases.

I may provide some additional advice on result handling in the future, but for now I’ll conclude with the following suggestion: please check the results from your actions, and help others to check theirs. That’s a good life-encompassing recommendation, actually.

Read more
Gustavo Niemeyer

In a hurry?

Go check it out!

The context

A while ago I found out about Sikuli, a very interesting project which allows people to script actions in GUIs based on screenshot excerpts. The idea is that you basically take images representing portions of your screen, like a button, or a label, or an icon, and then create a script which can detect a position in the screen which resembles one of these images, and perform actions on them, such as clicking, or hovering.

I had never imagined something like this, and the idea got me really excited about the possibilities. Imagine, for instance, what can be done in terms of testing. Testing of GUIs is unfortunately not yet a trivial task nowadays. We do have frameworks which are based on accessibility hooks, for instance, but these sometimes can’t be used because the hook is missing, or is even far off in terms of the context being tested (imagine testing that a browser can open a specific flash site successfully, for instance).

So, Sikuli opened my eyes to the possibility of using image matching technology in a GUI automation context, and I really wanted to play with it. In the days following the discovery, I fiddled a bit, communicated with the author, and even submitted some changes to make it work well in Ubuntu.

Then, the idea cooled down in my head, and I moved on with life. Well… until two weeks ago.

Right before heading to the Ubuntu Developer Summit for the next Ubuntu release, the desire of automating GUIs appeared again in the context of the widely scoped Ubuntu-level testing suite. Then, over the first few days last week, I was able to catch up with quite a few people which were interested in the concept of automating GUIs, with different purposes (testing, design approval, etc), which of course was all I needed to actually push that old desire forward.

Trying to get Sikuli to work, though, was quite painful. Even though I had sent patches upstream before, it looks like the build process isn’t working in Ubuntu again for other reasons (it’s not a polished build process, honestly), and even if I managed to make it work and contributed that to the upstream, in the end the path to integrate the Java-based tool in the Python-based testing framework which Ubuntu uses (Mago) wasn’t entirely straightforward either.

Reinventing the wheel

So, the the itch was in place, and there was a reason to let the NIH syndrome take over a bit. Plus, image processing is something I’d like to get a foot in anyway, so it felt like a good chance to have a closer look and at the same time contribute a small bit to potential quality improvements of Ubuntu.

That’s when Xpresser was born. Xpresser is a clean room implementation of the concepts explored by Sikuli, in the form of a Python library which can be used standalone, or embedded into other programs and testing frameworks such as Mago.

The project is sponsored by Canonical, and licensed under the LGPL.

Internally, it makes use of opencv for the image matching, pyatspi for the event generation (mouse clicks, etc), gtk for screen capturing and testing (of itself), and numpy for matrix operations. Clearly, the NIH syndrome, wasn’t entirely active. :-) As a side note, I haven’t played with numpy and gtk for some time, and I’m always amazed by the quality of these modules.

Contribute code and ideas

Concluding this post, which is already longer than I expected, the basics of Xpresser are in place, so go ahead and play with it! That said, there are quite a few low hanging fruits to get it to a point of being a really compelling GUI-driving library, so if you have any interest in the concept, I invite you to play with the code and submit contributions too. If you want ideas of what else could be done, let’s have a chat.

Read more
Gustavo Niemeyer

Recovering a bootable EBS image

Scott Moser has just announced this week that the new Ubuntu images which boot out of an EBS-based root filesystem in EC2, and thus will persist across reboots, are available for testing.

As usual with something that just left the oven and is explicitly labeled for testing purposes, there was a minor bug in the first iteration of images which was even mentioned in the announcement itself. The bug, if not worked around as specified in the announcement, will prevent the image from rebooting.

Having an bootable EBS image which can’t reboot is a quite interesting (and ironic) problem. You have an image which persists, but suddenly you have no way to see what is inside the image anymore because you can’t boot it. Naturally, even if the said bug didn’t exist in the first place, it’s fairly easy to get into such a situation accidentally if you’re fiddling with the image configuration.

So, in this post we’ll see how to recover from a situation where a bootable EBS image can’t boot.

Getting started

To start this up, we’ll boot one of the EBS images which Scott mentioned in his announcement: ami-8bec03e2. As we see in the output of ec2-describe-images, this is an EBS-based image for i386:

% ec2-describe-images ami-8bec03e2
IMAGE ami-8bec03e2 099720109477/ebs/ubuntu-images-testing/ubuntu-lucid-daily-i386-server-20100305 099720109477 available public i386 machine aki-3fdb3756 ebs
BLOCKDEVICEMAPPING /dev/sda1 snap-f1efd098 15

Let’s run this image. Remember to replace the value passed in the -k command line option with your own key pair name.

% ec2-run-instances -k gsg-keypair ami-8bec03e2
RESERVATION r-9e4615f6 626886203892 default
INSTANCE i-e3e33a88 ami-8bec03e2 pending gsg-keypair 0 m1.small 2010-03-09T20:04:12+0000 us-east-1c aki-3fdb3756 monitoring-disabled ebs

There we go. We got an instance allocated in the availability zone us-east-1c. It’s important to keep track of this information, since EBS volumes are zone-specific.

As part of the above command, we must have been allocated an EBS volume automatically, and it should be attached to the instance we just started. We can investigate it with the ec2-describe-volumes command:

% ec2-describe-volumes
VOLUME vol-edca1684 15 snap-f1efd098 us-east-1c in-use 2010-03-09T20:04:20+0000
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attached 2010-03-09T20:04:24+0000

Now, we’ll get into the running instance and do some arbitrary modifications, just as a way to demonstrate that the data we don’t want to lose actually survives the recovering operation. Note that the domain name is obtained with the ec2-describe-instances command.

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-184-73-51-147.compute-1.amazonaws.com
(…)

ubuntu@domU-12-31-39-0E-A0-03:~$ echo “Important data” > important-data
ubuntu@domU-12-31-39-0E-A0-03:~$ ls -l important-data
-rw-r–r– 1 ubuntu ubuntu 15 Mar 9 20:15 important-data

ubuntu@domU-12-31-39-0E-A0-03:~$ sudo reboot
Broadcast message from ubuntu@domU-12-31-39-0E-A0-03
(/dev/pts/0) at 20:18 …
The system is going down for reboot NOW!

Note that we didn’t actually fix the problem reported by Scott, so our machine won’t really reboot. If we wait a while, we can even see that the problem is exactly what was reported in the announcement (note it really takes a bit for the output to be synced up):

% ec2-get-console-output i-e3e33a88 | tail -4
mount: special device ephemeral0 does not exist
mountall: mount /mnt [294] terminated with status 32
mountall: Filesystem could not be mounted: /mnt

Alright, now what? Machine is dead.. and can’t reboot. How do we get to our important data?

Fixing the problem

The first thing we do is to stop the instance. Do not terminate it, or you’ll lose the EBS volume! After stopping it, we’ll detach the EBS volume that was being used as the root filesystem, so that we can attach somewhere else.

% ec2-stop-instances i-e3e33a88
INSTANCE i-e3e33a88 running stopping

% ec2-detach-volume vol-edca1684
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 detaching 2010-03-09T20:04:22+0000

Now, we need to attach this volume in an image which actually boots, so that we can fix it. For this experiment, we’ll pick one of the daily Lucid images, but we could use any other working image really. Just remind that the image must be running in the same availability zone as our previous instance, since the EBS volume won’t be accessible otherwise.

% ec2-run-instances -k gsg-keypair -z us-east-1c ami-b5f619dc
RESERVATION r-967427fe 626886203892 default
INSTANCE i-fd08d196 ami-b5f619dc pending gsg-keypair 0 m1.small 2010-03-09T21:10:11+0000 us-east-1c aki-3fdb3756 monitoring-disabled instance-store

% ec2-attach-volume vol-edca1684 -i i-fd08d196 -d /dev/sdh1
ATTACHMENT vol-edca1684 i-fd08d196 /dev/sdh1 attaching 2010-03-09T21:10:51+0000

With the instance running and the EBS root device attached with an alternative device name, we can then login to fix the original problem which prevented the image from booting correctly. In our case, we’ll simply do what Scott suggested in the announcement.

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-204-236-194-196.compute-1.amazonaws.com
(…)
$ mkdir ebs-root
$ sudo mount /dev/sdh1 ebs-root
$ sudo sed -i ‘s/^ephemeral0/#ephemeral0/’ ebs-root/etc/fstab
$ sudo umount ebs-root
$ logout
Connection to ec2-204-236-194-196.compute-1.amazonaws.com closed.

Done! Our EBS volume is now correct, and it should boot alright. We’ll detach the volume from the temporary instance we created, and will reattach it back to the old bootable EBS instance which is stopped. Note that we won’t yet terminate the temporary instance, because we may need it in case something else is still wrong, and we are already paying to use it for the hour anyway. We just have to remind ourselves to terminate it once we’re fully done.

% ec2-detach-volume vol-edca1684
ATTACHMENT vol-edca1684 i-fd08d196 /dev/sdh1 detaching 2010-03-09T21:10:51+0000

% ec2-attach-volume vol-edca1684 -i i-e3e33a88 -d /dev/sda1
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attaching 2010-03-09T21:24:55+0000

% ec2-describe-volumes vol-edca1684
VOLUME vol-edca1684 15 snap-f1efd098 us-east-1c in-use 2010-03-09T20:04:20+0000
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attached 2010-03-09T21:24:55+0000

Okay! It should all be good now. It’s time to restart our instance, and see if it is working. Note that since you stopped and started the instance, the public domain name most probably has changed, and thus we need to find it out again with ec2-describe-instances once the instance is running.

% ec2-start-instances i-e3e33a88
INSTANCE i-e3e33a88 stopped pending

% ec2-describe-instances i-e3e33a88
RESERVATION r-9e4615f6 626886203892 default
INSTANCE i-e3e33a88 ami-8bec03e2 ec2-184-73-72-214.compute-1.amazonaws.com domU-12-31-39-03-B8-21.compute-1.internal running gsg-keypair 0 m1.small 2010-03-09T21:28:43+0000 us-east-1c aki-3fdb3756 monitoring-disabled 184.73.72.214 10.249.187.207 ebs
BLOCKDEVICE /dev/sda1 vol-edca1684 2010-03-09T21:24:55.000Z

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-184-73-72-214.compute-1.amazonaws.com
(…)
$ cat important-data
Important data

$ logout
Connection to ec2-184-73-72-214.compute-1.amazonaws.com closed.

It worked, and our important data is still there!

Don’t forget to kill the temporary instance you’ve used to fix it after you’re comfortable with the result:

% ec2-terminate-instances i-fd08d196
INSTANCE i-fd08d196 running shutting-down

Conclusion

Concluding, in this post we have seen how to fix a bootable EBS machine which can’t actually boot. The technique consists of detaching the volume from the stopped instance, attaching it to a temporary instance, fixing the image, and then reattaching it back to the original image. This back and forth of EBS volumes is quite useful in many circumstances, so keep it in your tool belt.

Read more
Gustavo Niemeyer

In the last post, we’ve seen some security issues which exist in the Android password manager gbaSafe version 1.1.0a, by analyzing the security description provided in its web site. As described there, even though the system depends on a “master key” which might be secure, the security of the system is seriously compromised by the encouragement of very weak keys (a few digits only) in what is named an “unlock key”, used to encrypt the master key itself. All of that in an application which claims to strongly protect people’s data from unwanted eyes.

In this post, we will play a bit with the Linux-based Android OS to actually explore these security deficiencies, demonstrating that such issues are very real, and that the claims of being hard to unveil the data is unfounded. Since the most serious weakness lies in the key itself, we’ll run a simple brute force attack to try to find arbitrary unlock keys.

This procedure is actually mentioned by the author of gbaSafe himself in the web page, except he overestimates the work involved in producing such a mechanism:

Theoretically, somebody could write a program that tries to decrypt the master key by trying all possible values of the short key (with 4 digits there are only 10000 possibilities), but this would still be much work, as more information about the crypting algorithm is needed (e.g. salt bytes, iteration count).

So let’s get started.

As a first step, we’ll need the Android SDK with a working emulator (I’ve used API level 5, revision 1), and a copy of the application itself. I got a trial version of the application from AndAppStore.com.

The application downloaded is bundled within that .apk file, which is really a .zip file that may be opened up normally with any tool which understands this file format.

Once that’s done, we get access to all the information needed to run the application, including icons, interface layouts, and most importantly in this case, the bytecode which targets the Dalvik VM. This bytecode is the end result of a sequence of translations which happen when the program’s Java source code is compiled, so that’s what we’ll have to fiddle with to figure details of the application we want to investigate.

The bytecode is located inside the classes.dex file, and as expected it’s not easy to read in its native format. Luckily, though, a smart guy has already written a couple of tools, smali and baksmali, which allow people to decompile and recompile that bytecode format to/from something that is much easier to understand.

After downloading these tools, the following command should decompile the file:

$ java -jar baksmali.jar –output classes classes.dex

We now have a classes/ directory full of .smali files.

Before going any further, let’s ponder for a moment about what we want to do. A brute force attack is when we attempt sequentially many possible keys, and given the context already presented, what we’re looking after is to attempt different “unlock keys”. With that in mind, we’ll introduce a very small modification in the application so that it will attempt to enter the unlock key automatically, rather than reporting an error when the key entered in the unlock dialog is invalid.

With that in mind, after some quick research, it looks like the onClick() method within the DlgUnlock.smali file is a pretty good candidate. This is supposedly called when the button in the unlock dialog is clicked, so something interesting about the password being correct or not must happen there.

Before doing anything there, I’ve increased the number of registers in the function to 12, to get some additional registers to play with, and then initialized a register with the value zero, to serve as a monotonically increasing number (our keys!):

.method public onClick(Landroid/view/View;)V
- .registers 9
+ .registers 12
.parameter “view”
+ const/16 v9, 0×0

Then, we have to instruct the program to use these keys rather than whatever is typed in the dialog box. Some lines down, we’ll see a call to the checkUnlockKey() method, which is certainly what we’re looking after. Let’s do this now:

+ :mykey
+ invoke-static {v9}, Ljava/lang/String;->valueOf(I)Ljava/lang/String;
+ move-result-object v2
invoke-static {v2}, Lcom/gbizapps/safeA/Crypt;->checkUnlockKey(Ljava/lang/String;)I

Now, what if this key is wrong? We don’t want the master key to be removed as mentioned in the software description. We want to simply attempt the next key. With some analysis, we see that in case of errors, the next couple of lines below the above code will instruct the VM to jump to an error branch. Rather than following up with the normal error logic, we’ll increment the key, and jump back to the above code:

:cond_6c
+ add-int/lit8 v9, v9, 0×1
+ goto :mykey

Now we just have to rebundle this and put it into the emulator. I won’t go over it in too much detail here, since there’s plenty of information available online, but the steps to do that are:

  1. Recreate a modified classes.dex with smali
  2. Recreate a modified .apk file by just zipping the modified content
  3. Sign and zipalign the new .apk file
  4. Install it

And that’s it, seriously! This would be enough to break the software security if it was working correctly.

Interestingly, though, the software wasn’t working correctly with this change. Instead, it was Force Closing on certain keys. To test it out, use the master key “master key”, and the unlock key “999999″, and then once you close and open the application again, try to unlock it with the key “1175″. Instead of showing an error message, it will break badly.

Now, for the proof of concept to work, I actually had to fix the bug, which felt a bit funny to do given the context.

Looking at the traceback trough adb logcat, I found out that there was a null being dereferenced in the file Crypt.smali, so I fixed the problem by injecting some error checking at this position and jumping the flow into an existing error branch:

+ if-eqz v3, :cond_5a
const-string v4, “ucpmhkexov85MDKhdfdfFGQPYxywq7209fcrqhghjkuiopy”

With this in place came the biggest surprise of the experiment. The keys which were crashing the application were special, in the sense that they actually decode the master key successfully! That’s right: whatever the algorithm is doing, that six-digit “999999″ encrypts the master key in such a way that attempting the “1175″ key works, so even big keys are rendered extremely weak with the logic used to encrypt the master key.

At this point, I added some trivial logic to display the key found with a Toast, just to ensure the whole thing was working correctly:

Toast displaying unlock key found

Note that the key generation implemented above is a bit simplistic, in the sense that it doesn’t attempt keys with leading zeros, but this would be trivial to implement, and my intention here isn’t to actually break any keys for real, but just to show how the promised security in this application is not to be trusted at all. Just the logic above will already be enough for a brute force attack against the application, and has broken all the keys I’ve tried in mere seconds, in a slow emulator.

As a conclusion, if you want to put your data in a secure place, rather than picking an application which promises security because the salt is hidden somewhere or because it’s too much work to figure its logic, pick an open source application with logic which is publicly verifiable and has already had many eyes over it. Chances are that doing something like what was described in this post won’t be so trivial. Then, choose your keys wisely! The most secure application won’t be enough if you pick a bad key.

Read more
Gustavo Niemeyer

For some time now I’ve been wanting to research more deeply about the internals of Android. Until now, though, this was just a sentiment. Then, a couple of weeks ago I’ve finally managed to replace my iPhone for an Android phone, and that was the final motivator for me to actually get into learning more about the inner workings of the Linux-based OS.

Now, I just had to pick an actual task for digging into. The Dalvik VM is certainly one of the most innovative and advertised technical details about the OS, so something around it would be a nice start.. some kind of bytecode fiddling perhaps, but what? Luckily, even without trying too hard, I eventually stumbled upon an interesting case for researching upon.

The “victim” of this research is the application gbaSafe version 1.1.0a, which claims to protect user passwords using unbreakable algorithms (how’s that for a hint of a Snake oil case?).

Before we get into some hacking, let’s see some words on the software security by the author himself, and then render some analysis on conceptual issues on it:

The confidential data can only be decrypted if the master key is known. You should choose a long key (at least 16 characters) with mixed case and unreadable text. Of course you cannot enter this key each time you want to access the confidential data, so it is stored in the user settings encrypted with a shorter key (4 to 6 digits) and normally you only have to enter this unlock key. Theoretically it is possible to try all possible values (brute force attack), but then you must use another program, since gbaSafe deletes the encrypted master key from the user settings when you enter the unlock key wrong three times repeatedly, and then you must enter the master key. If you wrote a program to decrypt the master key, you would have to know the algorithm used, the salt bytes and iteration count (used to augment the short unlock key), which are very hard to extract from the binary program module gbaSafe.

If you have some security background, I’m sure that by now you’re already counting the issues on this single paragraph.

The most obvious issue is the fact that there’s a “strong key” and a “weak key”, and the strong key is encrypted with the weak one. This is a very common cryptography sin, as would say my friend and coworker Andreas Hasenack (a security researcher himself). A security system is only as secure as its weakest spot. It obviously makes little difference for an attacker if he has to attempt decrypting a master key or the actual data, since decrypting the master key will give access to the data.

Then, it mentions en passant that the software enforces the use of digits for the weak key. This ensures that the weak key is really weak! Four digits is basically ten thousand attempts, which is absolutely nothing for nowadays’s hardware. This number would move up to about 15 million by simply allowing upper and lowercase letters as well (which isn’t great either, but a few orders of magnitude never hurt in this scenario).

It follows up encouraging people to think that it’s actually hard to figure the algorithm and other implementation details. Considering that there’s absolutely nothing preventing people from getting their hands in the implementation itself, this is in fact asserting that the security mechanism is based on the ignorance of the attacker. Counting on the ignorance of people is bad at all times, and in a security context it’s a major error.

There’s a final security issue in this description which is a bit more subtle, but further analysis on the logic used leaves no doubt. In cryptography, the salt is supposed to increase the work needed in a brute force attack by strengthening the number of bits of the actual passphrase, in a case where the salt is actually unavailable, or at least prevent that a single large word dictionary can be used to attack several encryptions or hashes at once, in a case where the salt is known but variable. In the latter case, it helps because encrypting a single key with two different salts must be done twice, rather than once, so it increases the computational task when attacking multiple items. A salt which is known and does not change across all processed items is worth pretty close to nothing.

So, indeed, considering the many security issues here, this isn’t something I’d store my passwords or credit card numbers on, and I suggest you don’t do it either.

In my next post on this topic I’ll actually implement a trivial brute force attack to prove that these issues are very real, and that, actually, it’s not even hard to break into a security system like this.

The application author has been contacted about this blog post, since he’ll likely want to fix some of these issues.

Read more
Gustavo Niemeyer

Some interesting changes have been happening in my professional life, so I wanted to share it here to update friends and also for me to keep track of things over time (at some point I will be older and will certainly laugh at what I called “interesting changes” in the ol’days). Given the goal, I apologize but this may come across as more egocentric than usual, so please feel free to jump over to your next blog post at any time.

It’s been little more than four years since I left Conectiva / Mandriva and joined Canonical, in August of 2005. Shortly after I joined, I had the luck of spending a few months working on the different projects which the company was pushing at the time, including Launchpad, then Bazaar, then a little bit on some projects which didn’t end up seeing much light. It was a great experience by itself, since all of these projects were abundant in talent. Following that, in the beginning of 2006, counting on the trust of people which knew more than I did, I was requested/allowed to lead the development of a brand new project the company wanted to attempt. After a few months of research I had the chance to sit next to Chris Armstrong and Jamu Kakar to bootstrap the development of what is now known as the Landscape distributed systems management project.

Fast forward three and a half years, in mid 2009, and Landscape became a massive project with hundreds of thousands of very well tested lines, sprawling not only a client branch, but also external child projects such as the Storm Object Relational Mapper, in use also by Launchpad and Ubuntu One. In the commercial side of things it looks like Landscape’s life is just starting, with its hosted and standalone versions getting more and more attention from enterprise customers. And the three guys which started the project didn’t do it alone, for sure. The toy project of early 2006 has grown to become a well structured team, with added talent spreading areas such as development, business and QA.

While I wasn’t watching, though, something happened. Facing that great action, my attention was slowly being spread thinly among management, architecture, development, testing, code reviews, meetings, and other tasks, sometimes in areas not entirely related, but very interesting of course. The net result of increased attention sprawl isn’t actually good, though. If it persists, even when the several small tasks may be individually significant, the achievement just doesn’t feel significant given the invested effort as a whole. At least not for someone that truly enjoys being a software architect, and loves to feel that the effort invested in the growth of a significant working software is really helping people out in the same magnitude of that investment. In simpler words, it felt like my position within the team just wasn’t helping the team out the same way it did before, and thus it was time for a change.

Last July an external factor helped to catapult that change. Eucalyptus needed a feature to be released with Ubuntu 9.10, due in October, to greatly simplify the installation of some standard machine images.. an Image Store. It felt like a very tight schedule, even more considering that I hadn’t been doing Java for a while, and Eucalyptus uses some sexy (and useful) new technology called the Google Web Toolkit, something I had to get acquainted with. Two months looked like a tight schedule, and a risky bet overall, but it also felt like a great opportunity to strongly refocus on a task that needed someone’s attention urgently. Again I was blessed with trust I’m thankful for, and by now I’m relieved to look back and perceive that it went alright, certainly thanks to the help of other people like Sidnei da Silva and Mathias Gug. Meanwhile, on the Landscape side, my responsibilities were distributed within the team so that I could be fully engaged on the problem.

Moving this forward a little bit we reach the current date. Right now the Landscape project has a new organizational structure, and it actually feels like it’s moving along quite well. Besides the internal changes, a major organizational change also took place around Landscape over that period, and the planned restructuring led me to my current role. In practice, I’m now engaging into the research of a new concept which I’m hoping to publish openly quite soon, if everything goes well. It’s challenging, it’s exciting, and most importantly, allows me to focus strongly on something which has a great potential (I will stop teasing you now). In addition to this, I’ll definitely be spending some of that time on the progress of Landscape and the Image Store, but mostly from an architectural point of view, since both of these projects will have bright hands taking care of them more closely.

Sit by the fireside if you’re interested in the upcoming chapters of that story. ;-)

Read more