Canonical Voices

Gustavo Niemeyer

Continuing the sequence of experiments I’ve been running with the Go language, I’ve just made available a tiny but useful new package: gommap. As one would imagine, this new package provides access to low-level memory mapping for files and devices, and it allowed exploring a few new edges of the language implementation. Note that, strictly speaking, some of the details ahead are really more about the implementation than the language itself.

There were basically two main routes to follow when implementing support for memory mapping in Go. The first one is usually the way higher-level languages handle it. In Python, for instance, this is the way one may use a memory mapped file:

>>> import mmap
>>> file = open("/etc/passwd")
>>> mm = mmap.mmap(file.fileno(), size, access=PROT_READ)
>>> mm[0:4]
'root'

The way this was done has an advantage and a disadvantage which are perhaps non entirely obvious on a first look. The advantage is that the memory mapped area is truly hidden behind that interface, so any improper attempt to access a region which was already unmapped, for instance, may be blocked within the application with a nice error message which explains the issue. The disadvantage, though, is that this interface usually comes with a restriction that the way to use the memory region with normal libraries, is via copying of data. In the above example, for instance, the “root” string isn’t backed by the original mapped memory anymore, and is rather a copy of its contents (see PEP 3118 for a way to improve a bit this aspect with Python).

The other path, which can be done with Go, is to back a normal native array type with the allocated memory. This means that normal libraries don’t need to copy data out of the mapped memory, or to use a special memory saving interface, to deal with the memory mapped region. As a simple example, this would get the first line in the given file:

mmap, err := gommap.Map(file.Fd(), PROT_READ, MAP_PRIVATE)
if err == nil {
    end := bytes.Index(mmap, []byte{'\n'})
    firstLine := mmap[:end]
}

In the procedure above, mmap is defined as an alias to a native []byte array, so even though the standard bytes module was used, at no point was the data from the memory mapped region copied out or any auxiliary buffers allocated, so this is a very fast operation. To give an idea about this, let’s pretend for a moment that we want to increase a simple 8 bit counter in a file. This might be done with something as simple as:

mmap[13] += 1

This line of code would be compiled into something similar to the following assembly (amd64):

MOVQ    mmap+-32(SP),BX
CMPL    8(BX),$13
JHI     ,68
CALL    ,runtime.panicindex+0(SB)
MOVQ    (BX),BX
INCB    ,13(BX)

As you can see, this is just doing some fast index checking before incrementing the value directly in memory. Given that one of the important reasons why memory mapped files are used is to speed up access to disk files (sometimes large disk files), this advantage in performance is actually meaningful in this context.

Unfortunately, though, doing things this way also has an important disadvantage, at least right now. There’s no way at the moment to track references to the underlying memory, which was allocated by means not known to the Go runtime. This means that unmapping this memory is not a safe operation. The munmap system call will simply take the references away from the process, and any further attempt to touch those areas will crash the application.

To give you an idea about the background “magic” which is going on to achieve this support in Go, here is an interesting excerpt from the underlying mmap syscall as of this writing:

addr, _, errno := syscall.Syscall6(syscall.SYS_MMAP, (...))
(...)
dh := (*reflect.SliceHeader)(unsafe.Pointer(&mmap))
dh.Data = addr
dh.Len = int(length)
dh.Cap = dh.Len

As you can see, this is taking apart the memory backing the slice value into its constituting structure, and altering it to point to the mapped memory, including information about the length mapped so that bound checking as observed in the assembly above will work correctly.

In case the garbage collector is at some point extended to track references to these foreign regions, it would be possible to implement some kind of UnmapOnGC() method which would only unmap the memory once the last reference is gone. For now, though, the advantages of being able to reference memory mapped regions directly, at least to me, surpass the danger of having improper slices of the given region being used after unmapping. Also, I expect that usage of this kind of functionality will generally be encapsulated within higher level libraries, so it shouldn’t be too hard to keep the constraint in mind while using it this way.

For those reasons, gommap was implemented with the latter approach. In case you need memory mapping support for Go, just move ahead and goinstall launchpad.net/gommap.

UPDATE (2010-12-02): The interface was updated so that mmap itself is an array, rather than mmap.Data, and this post was changed to reflect this.

Read more
Gustavo Niemeyer

A while ago Martin Pool made a very interesting post on the design of interfaces, inspired by a talk from Rusty Russel from 2003.

Besides the interesting scale of interface quality explained there, this is a very insightful comment, often overlooked:


Once the code gets too big for one person, it’s all about damage control. Interfaces make damage control possible… except when the interfaces themselves are the problem.

Designing a system as a group of people requires splitting tasks up among team members, and/or the community, and perhaps even separate teams. Interfaces are the touch points of that splitting, and is what represents the functionality offered within the module/library/file/command/service/whatever. Too often, people spend a long time working on the implementation details, thinking really deep about how to obtain the desired behavior, and forget to define clearly what is the interface to that behavior.

Having good interfaces is a key aspect of software development, and getting it correctly offers a number of important benefits:

Good encapsulation

Having good encapsulation is pretty much a synonym of having good interfaces. Too often, though, people focus on the encapsulation of the small pieces (the functions, the classes, etc), and forget about the encapsulation of the larger blocks (the libraries, modules, packages, commands, etc).

Also, in my experience trying to encourage good architectures, I have found that stating “We need good encapsulation!” gives developers no tangible line of action. It reminds me of a parent telling the child “You should be responsible!”. Sure, encapsulation and responsibility both sound great, but.. what does that really mean?

When inviting developers to think about the interfaces of the system parts they are responsible for, encapsulation becomes a natural outcome. It’s clear that there must be a line drawn between that part of the system and the rest, and the shape of this line must be considered while (or even better, before) the behavior is implemented.

Given well designed interfaces, the additional requirement of only using other parts of the system through their public interfaces seals the achievement of good encapsulation. Ideally, this barrier would be a natural property of the language used to develop the system (see the interface quality scale in Martin’s post). In other cases, this must be achieved through conventions, agreements, and good documentation.

Improved scope and communication

By inviting developers to think about the interfaces of the parts they are responsible for, one is basically encouraging the consideration of the interaction between those pieces and the rest of the system. This process gives an interesting perspective, both in terms of the external expectations (what do I need to offer other people?), as well as the internal goals (what do I need to implement for satisfying what other people need?).

Besides helping people to figure the scope and goal of the piece being developed, this will also give a nice structure to some of the communication which must inevitably happen to integrate correctly the separate parts of the system being developed.

Improved testing and experimentation

If an interface is well designed and defined, and encapsulates well part of the functionality of the system, it improves significantly the testing and experimentation related to that part of the system. Again, this has an effect internally and externally to the interface.

Internally in the sense that there’s a clear boundary between the part in development and the rest of the system, and thus it should be easier to verify that the bits which compose it are working according to plan without dragging the whole system together, and also to verify that the interface itself is behaving as intended (and hopefully as documented).

Externally in the sense that, given that there’s agreement regarding what is the public interface to the part being considered, one may easily provide a test double (a fake, or dummy, or mock) to simulate that part of the system. This is well known to be useful in a number of ways:

  • Dependent work may be run in parallel by different people
  • Real implementation backing the given interface may be postponed, until the idea is proven useful, and the interface feels suitable
  • External systems which would be hard to run locally may be simulated so that tests run fast and cheap, even without network connections
  • Faults may be injected in the system via the test doubles to verify behavior in hostile conditions

and so on.

Quality isolation

This point is also my understanding of what Rusty refers to as damage control in his talk. This property is very useful when designing a system, but even then it’s often missed when discussing interfaces and encapsulation.

If there’s a well defined interface to a piece of functionality in the system, and that interface was carefully considered to cover the needs of the system, the implementation of that interface may not start as the most beautiful, or most scalable, or even most reliable piece of software. As any developer responsible for a successful startup will happily point out, a half-baked implementation is often good enough to get things going, prove the concept, and extend the project runway.

Good interfaces play an important role in this kind of situation. They are, in this sense, a way to be better prepared for success (or, for failure, depending on the perspective). If the interface implementation suddenly becomes an issue for whatever reason, the implementation itself may be replaced by something which better suits the current reality, while preserving the interaction with the rest of the system.

Of course, it’s still very hard to predict future system behavior when facing a completely different reality. Changing the scale requirements for the system a few orders of magnitude, for instance, may easily break existing assumptions, and interfaces designed around these assumptions. Still, even if good interfaces won’t be enough to avoid modifications in the architecture and integration points in many cases, they will certainly help framing the conversations which will take place when this happens and new interfaces must be developed.

Conclusion

When developing non-trivial software products, there’s no other way but to split out the problem solving in several layers and components. Looking at the points where these layers and components touch each other is a very useful and natural way to organize conversations and structure work which must take place to push the product forward.

It’s quite revealing to look at the points above, and note that it’s not simply the existence of interfaces themselves which presents the advantages described, but the process which they encourage around them. Software architecture is essentially about people.

Read more
Gustavo Niemeyer

It’s time to release my “side project” which has been evolving over the last several months: Gocheck. I’ve been watching Go for some time, and have been getting more and more interested in the language. My first attempt to write something interesting in it made it obvious that there would be benefit in having a richer testing platform than what is available in the standard library. That said, I do understand why the standard one is slim: it’s pretty minimalist, because it’s used by itself to test the rest of the platform. With Gocheck, though, I don’t have that requirement. I’m able to trust that the standard library works well, and focus on having features which will make me more productive while writing tests, including features such as:

  • Better error reporting
  • Richer test helpers: assertions which interrupt the test immediately, deep multi-type comparisons, string matching, etc
  • Suite-based grouping of tests
  • Fixtures: per suite and/or per test set up and tear down
  • Management of temporary directories
  • Panic-catching logic, with proper error reporting
  • Proper counting of successes, failures, panics, missed tests, skips, etc
  • Support for expected failures
  • Fully tested (yes, it manages to test itself reliably!)

That last point was actually quite fun to get right. It’s the first time I wrote a testing framework from the ground up, and of course I wanted to have it fully tested by itself, but I didn’t want to simply use a foreign testing framework to test it. So what it does is basically to have a “bootstrapping” phase, which ensures that the very basic parts of the library work, without trusting on pretty much any internal functionality (e.g. it verifies the number of executed functions, and works with low-level panics). Then, once the lower layers are trusted, tests for higher functionality was introduced by building on the trusted bits.

Gocheck is actually mostly ready for some time now, but I’ve been polishing edges with some real world usage before releasing it. Since both the real world usage and Gocheck itself are side projects, you can imagine that took a bit of time. Today, though, I’ve managed to fix the last few things which were bothering me, so it’s up for world consumption.

I hope you enjoy it, and make some good use of it so that we can all have more reliable software. ;-)

Read more
Gustavo Niemeyer

When I started programming in Python long ago, one of the features which really hooked me up was the quality interactive interpreter offered with the language implementation. It was (and still is) a fantastic way to experiment with syntax, semantics, modules, and whatnot. So much so that many first-class Python practitioners will happily tell you that the interactive interpreter is used not only as a programming sandbox, but many times as the their personal calculator too. This kind of interactive interpreter is also known as a REPL, standing for Read Eval Print Loop, and many languages have pretty advanced choices in that area by now.

After much rejoice with Python’s REPL, though, and as a normal human being, I’ve started wishing for more. The problem has a few different levels, which are easy to understand.

First, we’re using Python Twisted in Ensemble, one of the projects being pushed at Canonical. Twisted is an event-driven framework, which among other things means it works a lot with closures and callbacks. Having to redefine multi-line functions frequently to drive experiments isn’t exactly fun in a line-based interactive interpreter. Then, some of the languages I’ve started playing with, such as Erlang, have limited REPLs which differ in functionality significantly compared to what may be done in a text file. And finally, other languages I’ve been programming with recently, such as Go, lack a reasonable REPL altogether (there are only unusable hacks around).

Alright, so here is the idea: what if instead of being given an interactive REPL, you were presented with your favorite text editor, and whenever you wrote the file down, it was executed and results presented? That’s The Hacking Sandbox, or hsandbox. It supports 11 different programming languages out of the box, and given its nature it should be trivial to support any other language.

Here is a screenshot to clarify the idea:

Note that if you open a sandbox for a language like C or Go, the skeleton of what’s needed to run a program will already be in place, so you just have to “fill the blanks”.

For more details and download information, please check the hsandbox web page.

Read more
Gustavo Niemeyer

I’ve just read a post by Brett Cannon where, basically, he complains about complainers.

If you don’t know who Brett is, you’re probably not a heavy Python user. Brett is a very important Python core developer which has been around for a while and who does a great job at it. His post, though, makes me a bit sad.

Brett points out that there are two types of personalities which do not contribute to open source. The first one he defines as:

The first type is the “complainer”. This is someone who finds something they don’t like, points out that the thing they don’t like is suboptimal, but then offers no solutions.

And the second one is defined as:

(…) This is someone who, upon finding out about a decision that they think was sub-optimal, decides to bring up new ideas and solutions. The person is obviously trying to be helpful by bringing up new ideas and solutions, thinking that the current one is simply going to flop and they need to stop people from making a big mistake. The thing is, this person is not helping. (…)

This, on itself, is already shortsighted. If you’re tired of hearing the same arguments again and again for 10 years, from completely different people, there’s a pretty good chance that there’s an actual issue with your project, and your users are trying in their way to contribute and interact with you in the hope that it might get fixed.

This is really important: They are people, which use your project, and are trying to improve it. If you can’t stand that, you should stop maintaining an open source project now, or pick something which no one cares about.

The other issue which took my attention in his post is his example: the Python GIL. Look at the way in which Brett dismisses the problem:

(I am ignoring the fact that few people write CPU-intensive code requiring true threading support, that there is the multiprocessing library, true power users have extension modules which do operate with full threading, and that there are multiple VMs out there with a solution that have other concurrency solutions)

Brett, we can understand that the GIL is hard to remove, but it’s a fundamental flaw in the most important Python implementation, and being dismissive about it will either draw further complaints at you, or will simply drive users away from the language entirely.

I can understand why you think this way, though. Guido presents the same kind of feeling about the GIL for a very long time. Here is one excerpt from a mail thread about it:

Nevertheless, you’re right the GIL is not as bad as you would initially think: you just have to undo the brainwashing you got from Windows and Java proponents who seem to consider threads as the only way to approach concurrent activities.

Just Say No to the combined evils of locking, deadlocks, lock granularity, livelocks, nondeterminism and race conditions.

I apologize, but I have a very hard time reading this and not complaining.

In my world, the golden days of geometric growth in vertical processing power is over, multi-processed machines are here to stay, and the amount of traffic flowing through networks is just increasing. It feels reasonable to desire a less naïve approach to deal with real world problems, such as executing tasks concurrently.

I actually would love to not worry about things like non-determinism and race conditions, and would love even more to have a programming language which helps me with that!

Python, though, has a Global Interpreter Lock (yes, I’m talking about CPython, the most important interpreter). Python programs execute in sequence. No Fork/Join frameworks, no coroutines, no lightweight processes, nothing. Your Python code will execute in sequence if it lives in the same process space.

The answer from Brett and Guido to concurrency? Develop your code in C, or write your code to execute in multiple processes. If they really want people to get rid of non-determinism, locking issues, race conditions, and so on, they’re not helping at all.

I know this is just yet another complaint, though. I honestly cannot fix the problem either, and rather just talk about it in the hope that someone who’s able to do it will take care of it. That said, I wish that the language maintainers would do the same, and tell the world that it’s an unfortunate problem, and that they wished someone else would go there and fix it! If, instead, maintainers behave in a ridiculously dismissive way, like Guido did in that mail thread, and like Brett is doing in his post, the smart people that could solve the problem get turned down. People like to engage with motivated maintainers.. they like to solve problems that others are interested in seeing solved.

Perhaps agreeing with the shortcomings won’t help, though, and no one will show up to fix the problem either. But then, at least users will know that the maintainers are on the same side of the fence, and the hope that it will get fixed survives. If the maintainers just complain about the users which complain, and dismiss the problem, users are put in an awkward position. I can’t complain.. I can’t provide ideas or solutions.. I can’t fix the problem.. they don’t even care about the problem. Why am I using this thing at all?

Would you rather have users, or have no complainers?

Read more
Gustavo Niemeyer

The patch submitted last weekend to support Launchpad and Bazaar in goinstall went in!

This means that once the next release of Go is out (or you sync up with the tip code) you’ll be able to host modules for Go in Launchpad and people will be able to install it by running something similar to:

goinstall launchpad.net/yourproject

Then, you can simply use that in your source code as:

import “launchpad.net/yourproject”

The following URLs are supported:

import “launchpad.net/project”
import “launchpad.net/project/series”
import “launchpad.net/project/series/sub/directory”

import “launchpad.net/~user/project/branch”
import “launchpad.net/~user/project/branch/sub/directory”

import “launchpad.net/~user/+junk/branch”
import “launchpad.net/~user/+junk/branch/sub/directory”

Where you see sub/directory above, it means the module to be compiled, installed, and used, may be inside a subdirectory in the Bazaar branch itself. This is a convention used in all supported backends of goinstall.

Read more
Gustavo Niemeyer

After a few years in development, version 1.0 of Mocker is now available! Check out the changes since 0.10.1, the supported features, or go straight to the download page.

Read more
Gustavo Niemeyer

A bit of history

I don’t know exactly why, but I’ve always enjoyed IRC bots. Perhaps it’s the fact that it emulates a person in an easy-to-program way, or maybe it’s about having a flexible and shared “command line” tool, or maybe it’s just the fact that it helps people perceive things in an asynchronous way without much effort. Probably a bit of everything, actually.

My bot programming started with pybot many years ago, when I was still working at Conectiva. Besides having many interesting features, this bot eventually got in an abandonware state, since Canonical already had pretty much equivalent features available when I joined, and I had other interests which got in the way. The code was a bit messy as well.. it was a time when I wasn’t very used to testing software properly (a friend has a great excuse for that kind of messy software: “I was young, and needed the money!”).

Then, a couple of years ago, while working in the Landscape project, there was an opportunity of getting some information more visible to the team. Coincidently, it was also a time when I wanted to get some practice with the concepts of Erlang, so I decided to write a bot from scratch with some nice support for plugins, just to get a feeling of how the promised stability of Erlang actually took place for real. This bot is called mup (Mup Pet, more formally), and its code is available publicly through Launchpad.

This was a nice experiment indeed, and I did learn quite a bit about the ins and outs of Erlang with it. Somewhat unexpected, though, was the fact that the bot grew up a few extra features which multiple teams in Canonical started to appreciate. This was of course very nice, but it also made it more obvious that the egocentric reason for having a bot written in Erlang would now hurt, because most of Canonical’s own coding is done in Python, and that’s what internal tools should generally be written in for everyone to contribute and help maintaining the code.

That’s where the desire of migrating mup into a Python-based brain again came from, and having a new feature to write was the perfect motivator for this.

LDAP and two-way SMSing over IRC

Canonical is a very distributed company. Employees are distributed over dozens of countries, literally. Not only that, but most people also work from their homes, rather than in an office. Many different countries also means many different timezones, and working from home with people from different timezones means flexible timing. All of that means communication gets… well.. interesting.

How do we reach someone that should be in an online meeting and is not? Or someone that is traveling to get to a sprint? Or how can someone that has no network connectivity reach an IRC channel to talk to the team? There are probably several answers to this question, but one of them is of course SMS. It’s not exactly cheap if we consider the cost of the data being transfered, but pretty much everyone has a mobile phone which can do SMS, and the model is not that far away from IRC, which is the main communication system used by the company.

So, the itch was itching. Let’s scratch it!

Getting the mobile phone of employees was already a solved problem for mup, because it had a plugin which could interact with the LDAP directory, allowing people to do something like this:

<joe> mup: poke gustavo
<mup> joe: niemeyer is Gustavo Niemeyer <…@canonical.com> <time:…> <mobile:…>

This just had to be migrated from Erlang into a Python-based brain for the reasons stated above. This time, though, there was no reason to write something from scratch. I could even have used pybot itself, but there was also supybot, an IRC bot which started around the same time I wrote the first version of pybot, and unlike the latter, supybot’s author was much more diligent in evolving it. There is quite a comprehensive list of plugins for supybot nowadays, and it includes means for testing plugins and so on. The choice of using it was straighforward, and getting “poke” support ported into a plugin wasn’t hard at all.

So, on to SMSing. Canonical already had a contract with an SMS gateway company which we established to test-drive some ideas on Landscape. With the mobile phone numbers coming out of the LDAP directory in hands and an SMS contract established, all that was needed was a plugin for the bot to talk to the SMS gateway. That “conversation” with the SMS gateway allows not only sending messages, but also receiving SMS messages which were sent to a specific number.

In practice, this means that people which are connected to IRC can very easily deliver an SMS to someone using their nicks. Something like this:

<joe> @sms niemeyer Where are you? We’re waiting!

And this would show up in the mobile screen as:

joe> Where are you? We’re waiting!

In addition to this, people which have no connectivity can also contact individuals and channels on IRC, with mup working as a middle man. The message would show up on IRC in a similar way to:

<mup> [SMS] <niemeyer> Sorry, the flight was delayed. Will be there in 5.

The communication from the bot to the gateway happens via plain HTTPS. The communication back is a bit more complex, though. There is a small proxy service deployed in Google App Engine to receive messages from the SMS gateway. This was done to avoid losing messages when the bot itself is taken down for maintenance. The SMS gateway doesn’t handle this case very well, so it’s better to have something which will be up most of the time buffering messages.

A picture is worth 210 words, so here is a simple diagram explaining how things got linked together:

This is now up for experimentation, and so far it’s working nicely. I’m hoping that in the next few weeks we’ll manage to port the rest of mup into the supybot-based brain.

Read more
Gustavo Niemeyer

Released editmoin 1.15

Version 1.15 of editmoin is now available.

The following changes were made:

  • Moin used to work with numerical IDs for identification, and editmoin was still based on this model. This release adds support for direct authentication as available in current Moin releases. This was inspired by Reimar Bauer.
  • The new file ~/.moin_users is now parsed to obtain usernames, supporting the feature above. Shortcuts are also supported in this file.
  • Added support for textcha question handling.

Read more
Gustavo Niemeyer

I was just rambling randomly yesterday, in the usual microblogging platforms, about how result checking seems to be ignored or done badly. The precise wording was:

It’s really amazing how little attention error handling receives in most software development. Even *tutorials* often ignore it.

It indeed does amaze me. It sometimes feels like we write code for theoretical perfect worlds.. “If the processor executes exactly in this order, and the weather is calm, this program will work.”. There are countless examples of bad assumptions.. someday I will come with some statistics of the form “Every N seconds someone forgets to check the result of write().”.

If you are a teacher, or a developer that enjoys writing snippets of code to teach people, please join me in the quest of building a better future. Do not tell us that you’re “avoiding result checking for terseness”, because that’s exactly what we people will do (terseness is good, right?). On the contrary, take this chance to make us feel bad about avoiding result checking. You might do this by putting a comment like “If you don’t do this, you’re a bad programmer.” right next to the logic which is handling the result, and might take this chance to teach people how proper result handling is done.

Of course, there’s another forgotten art related to result checking. It sits on the other side of the fence. If you are a library author, do think through about how you plan to make us check conditions which happen inside your library, and try to imagine how to make our lives easier. If we suck at handling results when there are obvious ways to handle it, you can imagine what happens when you structure your result logic badly.

Here is a clear example of what not to do, coming straight from Python’s standard library, in the imaplib module:

    def login(self, user, password):
        typ, dat = self._simple_command('LOGIN', user, self._quote(password))
        if typ != 'OK':
            raise self.error(dat[-1])
        self.state = 'AUTH'
        return typ, dat

You see the problem there? How do you handle errors from this library? Should we catch the exception, or should we verify the result code? “Both!” is the right answer, unfortunately, because the author decided to do us a little favor and check the error condition himself in some arbitrary cases and raise the error, while letting it go through and end up in the result code in a selection of other arbitrary cases.

I may provide some additional advice on result handling in the future, but for now I’ll conclude with the following suggestion: please check the results from your actions, and help others to check theirs. That’s a good life-encompassing recommendation, actually.

Read more
Gustavo Niemeyer

In a hurry?

Go check it out!

The context

A while ago I found out about Sikuli, a very interesting project which allows people to script actions in GUIs based on screenshot excerpts. The idea is that you basically take images representing portions of your screen, like a button, or a label, or an icon, and then create a script which can detect a position in the screen which resembles one of these images, and perform actions on them, such as clicking, or hovering.

I had never imagined something like this, and the idea got me really excited about the possibilities. Imagine, for instance, what can be done in terms of testing. Testing of GUIs is unfortunately not yet a trivial task nowadays. We do have frameworks which are based on accessibility hooks, for instance, but these sometimes can’t be used because the hook is missing, or is even far off in terms of the context being tested (imagine testing that a browser can open a specific flash site successfully, for instance).

So, Sikuli opened my eyes to the possibility of using image matching technology in a GUI automation context, and I really wanted to play with it. In the days following the discovery, I fiddled a bit, communicated with the author, and even submitted some changes to make it work well in Ubuntu.

Then, the idea cooled down in my head, and I moved on with life. Well… until two weeks ago.

Right before heading to the Ubuntu Developer Summit for the next Ubuntu release, the desire of automating GUIs appeared again in the context of the widely scoped Ubuntu-level testing suite. Then, over the first few days last week, I was able to catch up with quite a few people which were interested in the concept of automating GUIs, with different purposes (testing, design approval, etc), which of course was all I needed to actually push that old desire forward.

Trying to get Sikuli to work, though, was quite painful. Even though I had sent patches upstream before, it looks like the build process isn’t working in Ubuntu again for other reasons (it’s not a polished build process, honestly), and even if I managed to make it work and contributed that to the upstream, in the end the path to integrate the Java-based tool in the Python-based testing framework which Ubuntu uses (Mago) wasn’t entirely straightforward either.

Reinventing the wheel

So, the the itch was in place, and there was a reason to let the NIH syndrome take over a bit. Plus, image processing is something I’d like to get a foot in anyway, so it felt like a good chance to have a closer look and at the same time contribute a small bit to potential quality improvements of Ubuntu.

That’s when Xpresser was born. Xpresser is a clean room implementation of the concepts explored by Sikuli, in the form of a Python library which can be used standalone, or embedded into other programs and testing frameworks such as Mago.

The project is sponsored by Canonical, and licensed under the LGPL.

Internally, it makes use of opencv for the image matching, pyatspi for the event generation (mouse clicks, etc), gtk for screen capturing and testing (of itself), and numpy for matrix operations. Clearly, the NIH syndrome, wasn’t entirely active. :-) As a side note, I haven’t played with numpy and gtk for some time, and I’m always amazed by the quality of these modules.

Contribute code and ideas

Concluding this post, which is already longer than I expected, the basics of Xpresser are in place, so go ahead and play with it! That said, there are quite a few low hanging fruits to get it to a point of being a really compelling GUI-driving library, so if you have any interest in the concept, I invite you to play with the code and submit contributions too. If you want ideas of what else could be done, let’s have a chat.

Read more
Gustavo Niemeyer

Recovering a bootable EBS image

Scott Moser has just announced this week that the new Ubuntu images which boot out of an EBS-based root filesystem in EC2, and thus will persist across reboots, are available for testing.

As usual with something that just left the oven and is explicitly labeled for testing purposes, there was a minor bug in the first iteration of images which was even mentioned in the announcement itself. The bug, if not worked around as specified in the announcement, will prevent the image from rebooting.

Having an bootable EBS image which can’t reboot is a quite interesting (and ironic) problem. You have an image which persists, but suddenly you have no way to see what is inside the image anymore because you can’t boot it. Naturally, even if the said bug didn’t exist in the first place, it’s fairly easy to get into such a situation accidentally if you’re fiddling with the image configuration.

So, in this post we’ll see how to recover from a situation where a bootable EBS image can’t boot.

Getting started

To start this up, we’ll boot one of the EBS images which Scott mentioned in his announcement: ami-8bec03e2. As we see in the output of ec2-describe-images, this is an EBS-based image for i386:

% ec2-describe-images ami-8bec03e2
IMAGE ami-8bec03e2 099720109477/ebs/ubuntu-images-testing/ubuntu-lucid-daily-i386-server-20100305 099720109477 available public i386 machine aki-3fdb3756 ebs
BLOCKDEVICEMAPPING /dev/sda1 snap-f1efd098 15

Let’s run this image. Remember to replace the value passed in the -k command line option with your own key pair name.

% ec2-run-instances -k gsg-keypair ami-8bec03e2
RESERVATION r-9e4615f6 626886203892 default
INSTANCE i-e3e33a88 ami-8bec03e2 pending gsg-keypair 0 m1.small 2010-03-09T20:04:12+0000 us-east-1c aki-3fdb3756 monitoring-disabled ebs

There we go. We got an instance allocated in the availability zone us-east-1c. It’s important to keep track of this information, since EBS volumes are zone-specific.

As part of the above command, we must have been allocated an EBS volume automatically, and it should be attached to the instance we just started. We can investigate it with the ec2-describe-volumes command:

% ec2-describe-volumes
VOLUME vol-edca1684 15 snap-f1efd098 us-east-1c in-use 2010-03-09T20:04:20+0000
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attached 2010-03-09T20:04:24+0000

Now, we’ll get into the running instance and do some arbitrary modifications, just as a way to demonstrate that the data we don’t want to lose actually survives the recovering operation. Note that the domain name is obtained with the ec2-describe-instances command.

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-184-73-51-147.compute-1.amazonaws.com
(…)

ubuntu@domU-12-31-39-0E-A0-03:~$ echo “Important data” > important-data
ubuntu@domU-12-31-39-0E-A0-03:~$ ls -l important-data
-rw-r–r– 1 ubuntu ubuntu 15 Mar 9 20:15 important-data

ubuntu@domU-12-31-39-0E-A0-03:~$ sudo reboot
Broadcast message from ubuntu@domU-12-31-39-0E-A0-03
(/dev/pts/0) at 20:18 …
The system is going down for reboot NOW!

Note that we didn’t actually fix the problem reported by Scott, so our machine won’t really reboot. If we wait a while, we can even see that the problem is exactly what was reported in the announcement (note it really takes a bit for the output to be synced up):

% ec2-get-console-output i-e3e33a88 | tail -4
mount: special device ephemeral0 does not exist
mountall: mount /mnt [294] terminated with status 32
mountall: Filesystem could not be mounted: /mnt

Alright, now what? Machine is dead.. and can’t reboot. How do we get to our important data?

Fixing the problem

The first thing we do is to stop the instance. Do not terminate it, or you’ll lose the EBS volume! After stopping it, we’ll detach the EBS volume that was being used as the root filesystem, so that we can attach somewhere else.

% ec2-stop-instances i-e3e33a88
INSTANCE i-e3e33a88 running stopping

% ec2-detach-volume vol-edca1684
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 detaching 2010-03-09T20:04:22+0000

Now, we need to attach this volume in an image which actually boots, so that we can fix it. For this experiment, we’ll pick one of the daily Lucid images, but we could use any other working image really. Just remind that the image must be running in the same availability zone as our previous instance, since the EBS volume won’t be accessible otherwise.

% ec2-run-instances -k gsg-keypair -z us-east-1c ami-b5f619dc
RESERVATION r-967427fe 626886203892 default
INSTANCE i-fd08d196 ami-b5f619dc pending gsg-keypair 0 m1.small 2010-03-09T21:10:11+0000 us-east-1c aki-3fdb3756 monitoring-disabled instance-store

% ec2-attach-volume vol-edca1684 -i i-fd08d196 -d /dev/sdh1
ATTACHMENT vol-edca1684 i-fd08d196 /dev/sdh1 attaching 2010-03-09T21:10:51+0000

With the instance running and the EBS root device attached with an alternative device name, we can then login to fix the original problem which prevented the image from booting correctly. In our case, we’ll simply do what Scott suggested in the announcement.

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-204-236-194-196.compute-1.amazonaws.com
(…)
$ mkdir ebs-root
$ sudo mount /dev/sdh1 ebs-root
$ sudo sed -i ‘s/^ephemeral0/#ephemeral0/’ ebs-root/etc/fstab
$ sudo umount ebs-root
$ logout
Connection to ec2-204-236-194-196.compute-1.amazonaws.com closed.

Done! Our EBS volume is now correct, and it should boot alright. We’ll detach the volume from the temporary instance we created, and will reattach it back to the old bootable EBS instance which is stopped. Note that we won’t yet terminate the temporary instance, because we may need it in case something else is still wrong, and we are already paying to use it for the hour anyway. We just have to remind ourselves to terminate it once we’re fully done.

% ec2-detach-volume vol-edca1684
ATTACHMENT vol-edca1684 i-fd08d196 /dev/sdh1 detaching 2010-03-09T21:10:51+0000

% ec2-attach-volume vol-edca1684 -i i-e3e33a88 -d /dev/sda1
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attaching 2010-03-09T21:24:55+0000

% ec2-describe-volumes vol-edca1684
VOLUME vol-edca1684 15 snap-f1efd098 us-east-1c in-use 2010-03-09T20:04:20+0000
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attached 2010-03-09T21:24:55+0000

Okay! It should all be good now. It’s time to restart our instance, and see if it is working. Note that since you stopped and started the instance, the public domain name most probably has changed, and thus we need to find it out again with ec2-describe-instances once the instance is running.

% ec2-start-instances i-e3e33a88
INSTANCE i-e3e33a88 stopped pending

% ec2-describe-instances i-e3e33a88
RESERVATION r-9e4615f6 626886203892 default
INSTANCE i-e3e33a88 ami-8bec03e2 ec2-184-73-72-214.compute-1.amazonaws.com domU-12-31-39-03-B8-21.compute-1.internal running gsg-keypair 0 m1.small 2010-03-09T21:28:43+0000 us-east-1c aki-3fdb3756 monitoring-disabled 184.73.72.214 10.249.187.207 ebs
BLOCKDEVICE /dev/sda1 vol-edca1684 2010-03-09T21:24:55.000Z

% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-184-73-72-214.compute-1.amazonaws.com
(…)
$ cat important-data
Important data

$ logout
Connection to ec2-184-73-72-214.compute-1.amazonaws.com closed.

It worked, and our important data is still there!

Don’t forget to kill the temporary instance you’ve used to fix it after you’re comfortable with the result:

% ec2-terminate-instances i-fd08d196
INSTANCE i-fd08d196 running shutting-down

Conclusion

Concluding, in this post we have seen how to fix a bootable EBS machine which can’t actually boot. The technique consists of detaching the volume from the stopped instance, attaching it to a temporary instance, fixing the image, and then reattaching it back to the original image. This back and forth of EBS volumes is quite useful in many circumstances, so keep it in your tool belt.

Read more
Gustavo Niemeyer

In the last post, we’ve seen some security issues which exist in the Android password manager gbaSafe version 1.1.0a, by analyzing the security description provided in its web site. As described there, even though the system depends on a “master key” which might be secure, the security of the system is seriously compromised by the encouragement of very weak keys (a few digits only) in what is named an “unlock key”, used to encrypt the master key itself. All of that in an application which claims to strongly protect people’s data from unwanted eyes.

In this post, we will play a bit with the Linux-based Android OS to actually explore these security deficiencies, demonstrating that such issues are very real, and that the claims of being hard to unveil the data is unfounded. Since the most serious weakness lies in the key itself, we’ll run a simple brute force attack to try to find arbitrary unlock keys.

This procedure is actually mentioned by the author of gbaSafe himself in the web page, except he overestimates the work involved in producing such a mechanism:

Theoretically, somebody could write a program that tries to decrypt the master key by trying all possible values of the short key (with 4 digits there are only 10000 possibilities), but this would still be much work, as more information about the crypting algorithm is needed (e.g. salt bytes, iteration count).

So let’s get started.

As a first step, we’ll need the Android SDK with a working emulator (I’ve used API level 5, revision 1), and a copy of the application itself. I got a trial version of the application from AndAppStore.com.

The application downloaded is bundled within that .apk file, which is really a .zip file that may be opened up normally with any tool which understands this file format.

Once that’s done, we get access to all the information needed to run the application, including icons, interface layouts, and most importantly in this case, the bytecode which targets the Dalvik VM. This bytecode is the end result of a sequence of translations which happen when the program’s Java source code is compiled, so that’s what we’ll have to fiddle with to figure details of the application we want to investigate.

The bytecode is located inside the classes.dex file, and as expected it’s not easy to read in its native format. Luckily, though, a smart guy has already written a couple of tools, smali and baksmali, which allow people to decompile and recompile that bytecode format to/from something that is much easier to understand.

After downloading these tools, the following command should decompile the file:

$ java -jar baksmali.jar –output classes classes.dex

We now have a classes/ directory full of .smali files.

Before going any further, let’s ponder for a moment about what we want to do. A brute force attack is when we attempt sequentially many possible keys, and given the context already presented, what we’re looking after is to attempt different “unlock keys”. With that in mind, we’ll introduce a very small modification in the application so that it will attempt to enter the unlock key automatically, rather than reporting an error when the key entered in the unlock dialog is invalid.

With that in mind, after some quick research, it looks like the onClick() method within the DlgUnlock.smali file is a pretty good candidate. This is supposedly called when the button in the unlock dialog is clicked, so something interesting about the password being correct or not must happen there.

Before doing anything there, I’ve increased the number of registers in the function to 12, to get some additional registers to play with, and then initialized a register with the value zero, to serve as a monotonically increasing number (our keys!):

.method public onClick(Landroid/view/View;)V
- .registers 9
+ .registers 12
.parameter “view”
+ const/16 v9, 0×0

Then, we have to instruct the program to use these keys rather than whatever is typed in the dialog box. Some lines down, we’ll see a call to the checkUnlockKey() method, which is certainly what we’re looking after. Let’s do this now:

+ :mykey
+ invoke-static {v9}, Ljava/lang/String;->valueOf(I)Ljava/lang/String;
+ move-result-object v2
invoke-static {v2}, Lcom/gbizapps/safeA/Crypt;->checkUnlockKey(Ljava/lang/String;)I

Now, what if this key is wrong? We don’t want the master key to be removed as mentioned in the software description. We want to simply attempt the next key. With some analysis, we see that in case of errors, the next couple of lines below the above code will instruct the VM to jump to an error branch. Rather than following up with the normal error logic, we’ll increment the key, and jump back to the above code:

:cond_6c
+ add-int/lit8 v9, v9, 0×1
+ goto :mykey

Now we just have to rebundle this and put it into the emulator. I won’t go over it in too much detail here, since there’s plenty of information available online, but the steps to do that are:

  1. Recreate a modified classes.dex with smali
  2. Recreate a modified .apk file by just zipping the modified content
  3. Sign and zipalign the new .apk file
  4. Install it

And that’s it, seriously! This would be enough to break the software security if it was working correctly.

Interestingly, though, the software wasn’t working correctly with this change. Instead, it was Force Closing on certain keys. To test it out, use the master key “master key”, and the unlock key “999999″, and then once you close and open the application again, try to unlock it with the key “1175″. Instead of showing an error message, it will break badly.

Now, for the proof of concept to work, I actually had to fix the bug, which felt a bit funny to do given the context.

Looking at the traceback trough adb logcat, I found out that there was a null being dereferenced in the file Crypt.smali, so I fixed the problem by injecting some error checking at this position and jumping the flow into an existing error branch:

+ if-eqz v3, :cond_5a
const-string v4, “ucpmhkexov85MDKhdfdfFGQPYxywq7209fcrqhghjkuiopy”

With this in place came the biggest surprise of the experiment. The keys which were crashing the application were special, in the sense that they actually decode the master key successfully! That’s right: whatever the algorithm is doing, that six-digit “999999″ encrypts the master key in such a way that attempting the “1175″ key works, so even big keys are rendered extremely weak with the logic used to encrypt the master key.

At this point, I added some trivial logic to display the key found with a Toast, just to ensure the whole thing was working correctly:

Toast displaying unlock key found

Note that the key generation implemented above is a bit simplistic, in the sense that it doesn’t attempt keys with leading zeros, but this would be trivial to implement, and my intention here isn’t to actually break any keys for real, but just to show how the promised security in this application is not to be trusted at all. Just the logic above will already be enough for a brute force attack against the application, and has broken all the keys I’ve tried in mere seconds, in a slow emulator.

As a conclusion, if you want to put your data in a secure place, rather than picking an application which promises security because the salt is hidden somewhere or because it’s too much work to figure its logic, pick an open source application with logic which is publicly verifiable and has already had many eyes over it. Chances are that doing something like what was described in this post won’t be so trivial. Then, choose your keys wisely! The most secure application won’t be enough if you pick a bad key.

Read more
Gustavo Niemeyer

For some time now I’ve been wanting to research more deeply about the internals of Android. Until now, though, this was just a sentiment. Then, a couple of weeks ago I’ve finally managed to replace my iPhone for an Android phone, and that was the final motivator for me to actually get into learning more about the inner workings of the Linux-based OS.

Now, I just had to pick an actual task for digging into. The Dalvik VM is certainly one of the most innovative and advertised technical details about the OS, so something around it would be a nice start.. some kind of bytecode fiddling perhaps, but what? Luckily, even without trying too hard, I eventually stumbled upon an interesting case for researching upon.

The “victim” of this research is the application gbaSafe version 1.1.0a, which claims to protect user passwords using unbreakable algorithms (how’s that for a hint of a Snake oil case?).

Before we get into some hacking, let’s see some words on the software security by the author himself, and then render some analysis on conceptual issues on it:

The confidential data can only be decrypted if the master key is known. You should choose a long key (at least 16 characters) with mixed case and unreadable text. Of course you cannot enter this key each time you want to access the confidential data, so it is stored in the user settings encrypted with a shorter key (4 to 6 digits) and normally you only have to enter this unlock key. Theoretically it is possible to try all possible values (brute force attack), but then you must use another program, since gbaSafe deletes the encrypted master key from the user settings when you enter the unlock key wrong three times repeatedly, and then you must enter the master key. If you wrote a program to decrypt the master key, you would have to know the algorithm used, the salt bytes and iteration count (used to augment the short unlock key), which are very hard to extract from the binary program module gbaSafe.

If you have some security background, I’m sure that by now you’re already counting the issues on this single paragraph.

The most obvious issue is the fact that there’s a “strong key” and a “weak key”, and the strong key is encrypted with the weak one. This is a very common cryptography sin, as would say my friend and coworker Andreas Hasenack (a security researcher himself). A security system is only as secure as its weakest spot. It obviously makes little difference for an attacker if he has to attempt decrypting a master key or the actual data, since decrypting the master key will give access to the data.

Then, it mentions en passant that the software enforces the use of digits for the weak key. This ensures that the weak key is really weak! Four digits is basically ten thousand attempts, which is absolutely nothing for nowadays’s hardware. This number would move up to about 15 million by simply allowing upper and lowercase letters as well (which isn’t great either, but a few orders of magnitude never hurt in this scenario).

It follows up encouraging people to think that it’s actually hard to figure the algorithm and other implementation details. Considering that there’s absolutely nothing preventing people from getting their hands in the implementation itself, this is in fact asserting that the security mechanism is based on the ignorance of the attacker. Counting on the ignorance of people is bad at all times, and in a security context it’s a major error.

There’s a final security issue in this description which is a bit more subtle, but further analysis on the logic used leaves no doubt. In cryptography, the salt is supposed to increase the work needed in a brute force attack by strengthening the number of bits of the actual passphrase, in a case where the salt is actually unavailable, or at least prevent that a single large word dictionary can be used to attack several encryptions or hashes at once, in a case where the salt is known but variable. In the latter case, it helps because encrypting a single key with two different salts must be done twice, rather than once, so it increases the computational task when attacking multiple items. A salt which is known and does not change across all processed items is worth pretty close to nothing.

So, indeed, considering the many security issues here, this isn’t something I’d store my passwords or credit card numbers on, and I suggest you don’t do it either.

In my next post on this topic I’ll actually implement a trivial brute force attack to prove that these issues are very real, and that, actually, it’s not even hard to break into a security system like this.

The application author has been contacted about this blog post, since he’ll likely want to fix some of these issues.

Read more
Gustavo Niemeyer

Some interesting changes have been happening in my professional life, so I wanted to share it here to update friends and also for me to keep track of things over time (at some point I will be older and will certainly laugh at what I called “interesting changes” in the ol’days). Given the goal, I apologize but this may come across as more egocentric than usual, so please feel free to jump over to your next blog post at any time.

It’s been little more than four years since I left Conectiva / Mandriva and joined Canonical, in August of 2005. Shortly after I joined, I had the luck of spending a few months working on the different projects which the company was pushing at the time, including Launchpad, then Bazaar, then a little bit on some projects which didn’t end up seeing much light. It was a great experience by itself, since all of these projects were abundant in talent. Following that, in the beginning of 2006, counting on the trust of people which knew more than I did, I was requested/allowed to lead the development of a brand new project the company wanted to attempt. After a few months of research I had the chance to sit next to Chris Armstrong and Jamu Kakar to bootstrap the development of what is now known as the Landscape distributed systems management project.

Fast forward three and a half years, in mid 2009, and Landscape became a massive project with hundreds of thousands of very well tested lines, sprawling not only a client branch, but also external child projects such as the Storm Object Relational Mapper, in use also by Launchpad and Ubuntu One. In the commercial side of things it looks like Landscape’s life is just starting, with its hosted and standalone versions getting more and more attention from enterprise customers. And the three guys which started the project didn’t do it alone, for sure. The toy project of early 2006 has grown to become a well structured team, with added talent spreading areas such as development, business and QA.

While I wasn’t watching, though, something happened. Facing that great action, my attention was slowly being spread thinly among management, architecture, development, testing, code reviews, meetings, and other tasks, sometimes in areas not entirely related, but very interesting of course. The net result of increased attention sprawl isn’t actually good, though. If it persists, even when the several small tasks may be individually significant, the achievement just doesn’t feel significant given the invested effort as a whole. At least not for someone that truly enjoys being a software architect, and loves to feel that the effort invested in the growth of a significant working software is really helping people out in the same magnitude of that investment. In simpler words, it felt like my position within the team just wasn’t helping the team out the same way it did before, and thus it was time for a change.

Last July an external factor helped to catapult that change. Eucalyptus needed a feature to be released with Ubuntu 9.10, due in October, to greatly simplify the installation of some standard machine images.. an Image Store. It felt like a very tight schedule, even more considering that I hadn’t been doing Java for a while, and Eucalyptus uses some sexy (and useful) new technology called the Google Web Toolkit, something I had to get acquainted with. Two months looked like a tight schedule, and a risky bet overall, but it also felt like a great opportunity to strongly refocus on a task that needed someone’s attention urgently. Again I was blessed with trust I’m thankful for, and by now I’m relieved to look back and perceive that it went alright, certainly thanks to the help of other people like Sidnei da Silva and Mathias Gug. Meanwhile, on the Landscape side, my responsibilities were distributed within the team so that I could be fully engaged on the problem.

Moving this forward a little bit we reach the current date. Right now the Landscape project has a new organizational structure, and it actually feels like it’s moving along quite well. Besides the internal changes, a major organizational change also took place around Landscape over that period, and the planned restructuring led me to my current role. In practice, I’m now engaging into the research of a new concept which I’m hoping to publish openly quite soon, if everything goes well. It’s challenging, it’s exciting, and most importantly, allows me to focus strongly on something which has a great potential (I will stop teasing you now). In addition to this, I’ll definitely be spending some of that time on the progress of Landscape and the Image Store, but mostly from an architectural point of view, since both of these projects will have bright hands taking care of them more closely.

Sit by the fireside if you’re interested in the upcoming chapters of that story. ;-)

Read more
Gustavo Niemeyer

Google won’t kill standalone GPS

It was already dead. In some senses, anyway.

Google announced a couple of days ago that they’re advancing into the business of GPS guided navigation, rather than staying with their widely popular offering of mapping and positioning only. This announcement affected the rest of the industry immediately, and some of the industry leaders in the area have quickly taken a hit on their share value.

As usual, Slashdot caught up on the news and asked the question: Will Google and Android kill standalone GPS?

Let me point out that the way the facts were covered by Slashdot was quite misguided. Google may be giving a hand to change the industry dynamics a bit faster, but both Garmin and TomTom, the companies which reportedly had an impact in their share value, have phone-based offerings of their own, so it’s not like Google suddenly had an idea for creating a phone-based navigation software which will replace every other offering. The world is slowly converging towards a multi-purpose device for quite a while, and these multi-purpose devices are putting GPSes in the hands of people that in many cases never considered buying a GPS.

The real reason why these companies are taking a hit in their shares now is because Google announced it will offer for free something that these companies charge quality money for at the moment, being it in a standalone GPS or not.

Read more
Gustavo Niemeyer

Geocaching on the Easter Island

This post is not about what you think it is, unfortunately. I actually do hope to go to the Easter Island at some point, but this post is about a short story which involves geohash.org, Groundspeak (from geocaching.com), and very very poor minded behavior.

The context

So, before anything else, it’s important to understand what geohash.org is. As announced when the service was launched (also as a post on Groundspeak’s own forum), geohash.org offers short URLs which encode a latitude/longitude pair, so that referencing them in emails, forums, and websites is more convenient, and that’s pretty much it.

When people go to geohash.org, they can enter geographic coordinates that they want to encode, and they get back a nice little map with the location, some links to useful services, and most importantly the actual Geohash they can use to link to the location, so as an example they could be redirected to the URL http://geohash.org/6gkzwgjf3.

Of course, it’s pretty boring to be copy & pasting coordinates around, so shortly after the service launched, the support for geocoding addresses was also announced, which means people could type a human oriented address and get back the Geohash page for it. Phew.. much more practical.

The problem

All was going well, until a couple of months ago, when a user reported that the geocoding of addresses wasn’t working anymore. After some investigation, it turned out that geohash.org was indeed going over the free daily quota allowed by the geocoding provider used. But, that didn’t quite fit with the overall usage reports for the system, so I went on to investigate what was up in the logs.

The cause

Something was wrong indeed. The system was getting thousands of queries a day from some application, and not only that, but the queries were entirely unrelated to Geohashes. The application was purely interested in the geocoding of addresses which the site supported for the benefit of Geohash users. Alright, that wasn’t something nice to do, but I took it lightly since the interface implemented could perhaps give the impression that the site was a traditional geocoding system. So, to fix the situation, the non-Geohash API was removed at this point, and requests for the old API then started to get an error saying something like 403 Forbidden: For geocoding without geohashes, please look elsewhere..

Unfortunately, that wasn’t the end of the issue. Last week I went on to see the logs, and the damn application was back, and this time it was using Geohashes, so I became curious about who was doing that. Could I be mistakingly screwing up some real user of Geohashes? So, based on the logs, I went on to search for who could possibly be using the system in such a way. It wasn’t too long until I found out that, to my surprise, it was Groundspeak’s iPhone application. Groundspeak’s paid iPhone application, to be more precise, because the address searching feature is only available for paying users.

Looking at the release notes for the application, there was no doubt. Version 2.3.1, sent to Apple on September 10th, shortly after the old API was blocked, fixes the Search by Address/Postal Code feature says the maintainer, and there’s even a thread discussing the breakage where the maintainer mentions:

The geocoding service we’ve been using just turned their service off. That’s why things are failing; it was relying on an external service for this feature. We’re fixing the issue on our end and using a service that shouldn’t fail as easily. Unfortunately we’ll have to do an update to the store to get this feature out to the users. This will take some time, but in version 2.4 this will work.

Wait, ok, so let’s see this again. First, they were indeed not using Geohashes at all, and instead using geohash.org purely as a geocoding service. Then, when the API they used is disabled with hints that the Geohash service is not a pure geocoding service, they workaround this by decoding the Geohash retrieved and grabbing the coordinates so that they can still use it as a pure geocoding service. At the same time, they tell their users that they changed to “a service that shouldn’t fail as easily”. Under no circumstances they contact someone at geohash.org to see what was going on (shouldn’t be necessary, really, but assuming immaculate innocence, sending an email would be pretty cool).

Redirecting users to the Easter Island

So, yeah, sorry, but I didn’t see many reasons to sustain the situation. Not only because it looks like an unfriendly behavior overall, but also because, on their way of using an unrelated free service to sustain their paid application, they were killing the free geocoding feature of geohash.org with thousands of geocoding requests a day, which impacted on the daily quota the service has by itself.

So, what to do? I could just disable the service again, or maybe contact the maintainers and ask them to please stop using the service in such a way, after all there are dozens of real geocoding services out there! But… hmmm… I figured a friendly poke could be nice at this point, before actually bringing up that whole situation.

And that’s what happened: rather than blocking their client, the service was modified so that all of their geocoding requests translated into the geographic coordinates of the Easter Island.

Of course, users quickly noticed it and started reporting the problem again.

The answer from Groundspeak

After users started complaining loudly, Bryan Roth, which signs as co-founder of Groundspeak, finally contacted me for the first time asking if there was a way to keep the service alive. Unfortunately, I really can’t, and provided the whole explanation to Bryan, and even mentioned that I actually use Google as the upstream geocoding provider and that I would be breaking the terms of service doing this, but offered to redirect their requests to their own servers if necessary.

Their answer to this? Pretty bad I must say. I got nothing via email, but they posted this in the forum:

But seriously, this bug actually has nothing to do with our app and everything to do with the external service we’ve been using to convert an address into GPS coordinates. For the next app update, we’re completely dropping that provider since they’ve now failed us twice. We’ll be using only Google from that point on, so hopefully their data will be more accurate.

I can barely believe what I read. They blame the upstream service, as if they were using a first class geocoding provider somewhere rather than sucking resources from a site they felt cool to link their paid application to, take my suggestion of using Google for geocoding, and lie about the fact that the data would be more accurate (it obviously can’t, since it was already Google that was being used).

I mentioned something about this in the forum itself, but I was moderated out immediately of course.

Way to go Groundspeak.

UPDATE

After some back and forth with Bryan and Josh, the last post got edited away to avoid the misleading details, and Bryan clarified the case in the forum. Then, we actually settled on my proposal of redirecting the iPhone Geocaching.com application requests to Groundspeak’s own servers so that users of previous versions of the application wouldn’t miss the feature while they work on the new release.

If such communication had taken place way back when the feature was being planned, or when it was “fixed” the first time, the whole situation would never have happened.

No matter what, I’m glad it ended up being sorted towards a more friendly solution.

Read more
Gustavo Niemeyer

More than 40 years ago, a guy named Douglas Parkhill described the concept of utility computing. He described it as containing features such as:

  • Essentially simultaneous use of the system by many remote users.
  • Concurrent running of different multiple programs.
  • Availability of at least the same range of facilities and capabilities at the remote stations as the user would expect if he where the sole operator of a private computer.
  • A system of charging based upon a flat service charge and a variable charge based on usage.
  • Capacity for indefinite growth, so that as the customer load increases, the system can expanded without limit by various means.

Fast forward 40 years, and we now name pretty much this same concept as Cloud Computing, and everyone is very excited about the possibilities that exist within this new world. Different companies are pushing this idea in different ways. One of the pioneers in that area is of course Amazon, which managed to create a quite good public cloud offering through their Amazon Web Services product.

This kind of publicly consumable infrastructure is very interesting, because it allows people to do exactly what Douglas Parkhill described 40 years ago, so individuals and organizations can rent computing resources with minimum initial investment, and pay for as much as they need, no more no less.

This is all good, but one of the details is that not every organization can afford to send data or computations to a public cloud like Amazon’s AWS. There are many potential reasons for this, from legal regulations to volume cost. Out of these issues the term Private Cloud was coined. It basically represents exactly the same ideas that Douglas Parkhill described, but rather than using third party infrastructure, some organizations opt to use the same kind of technology, such as the Eucalyptus project deployed in a private infrastructure, so that the teams within the organization can still benefit from the mentioned features.

So we have the Public Cloud and the Private Cloud. Now, what would a Virtual Private Cloud be?

Well, it turns out that this is just a marketing term, purposefully coined to blur the line between a Private and a Public cloud .

The term was used in the announcement Amazon has made yesterday:

Amazon VPC enables enterprises to connect their existing infrastructure to a set of isolated AWS compute resources via a Virtual Private Network (VPN) connection, (…)

So, what is interesting about this is that this is actually not a Private Cloud, because the resources on the other side of the VPN are actually public infrastructure, and as such it doesn’t solve any of the problems which private clouds were created for solving in the first place.

Not only that, but it creates the false impression that organizations would have their own isolated resources. What isolated resources? A physical computer? Storage? Network? Of course, isolating these is not economically viable if you are charging 10 cents an hour per computer instance:

Each month, you pay for VPN Connection-hours and the amount of data transferred via the VPN connections. VPCs, subnets, VPN gateways, customer gateways, and data transferred between subnets within the same VPC are free. Charges for other AWS services, including Amazon EC2, are billed separately at published standard rates.

That doesn’t quite fit together, does it?

To complete the plot, Werner Vogels runs to his blog and screams out loud “Private Cloud is not the Cloud”, while announcing the Virtual Private Cloud which is actually a VPN to his Public Cloud, with infrastructure shared with the world.

Sure. What can I say? Well, maybe that Virtual Private Cloud is not the Private Cloud.

Read more
Gustavo Niemeyer

Alright, so I appreciate the idea of RESTful Web Services, but I’ve got a small dilemma I’d appreciate some opinions on.

In the RESTful Web Services book, by Leonard Richardson and Sam Ruby, there’s emphasis on making the programmable web look like the human web, by following an architecture oriented to having addressable resources rather than oriented to remote procedure calls. Through the book, the RPC (or REST-RPC, when mixed with some RESTful characteristics), is clearly downplayed. In some cases, though, it’s unclear to me what’s the extent of this advice. Humans and computers are of course very different in the nature of tasks they perform, and how well they perform them. To illustrate the point clearly, let me propose a short example.

Let’s imagine the following scenario: we are building a web site with information on a large set of modern books. In this system, we want to follow RESTful principles strictly: each book is addressable at http://example.com/book/<id>, and we can get a list of book URIs by accessing http://example.com/book/list?filter=<words>.

Now, we want to allow people to easily become aware of the newest edition of a given book. To do that, we again follow RESTful characteristics and add a, let’s say, new-editions field to the data which composes a book resource. This field contains a list of URIs of books which are more recent editions of the given book. So far so good. Looks like a nice design.

Now, we want to implement a feature which allows people to access the list of all recent editions of books in their home library, given that they know the URIs for the books because a client program stored the URIs locally in their machines. How would we go about implementing this? We certainly wouldn’t want to do 200 queries to learn about updates for 200 books which a given person has, since that’s unnecessarily heavy on the client computer, on the network, and on the server. It’s also hard to encode the resource scope (as defined in the book) in the URI, since the amount of data to define the scope (the 200 books in our case) can be arbitrarily large. This actually feels like perfectly fit for an RPC: “Hey, server, here are 200 URIs in my envelope.. let me know what are the updated books and their URIs.” I can imagine some workarounds for this, like saving a temporary list of books with PUT, and then doing the query on that temporary list’s URI, but this feels like a considerably more complex design just for the sake of purity.

When I read examples of RESTful interfaces, I usually see examples about how a Google Search API can be RESTful, for instance. Of course, Google Search is actually meant to be operated by humans, with a simple search string. But computers, unlike humans, can thankfully handle a large volume of data for us, and let us know about the interesting details only. It feels a bit like once the volume of data and the complexity of operations on that data goes up, the ability for someone to do a proper RESTful design goes down, and an RPC-style interface becomes an interesting option again.

I would be happy to learn about a nice RESTful approach to solve this kind of problem, though.

Read more
Gustavo Niemeyer

Backwards and forwards compatibility is an art. In the very basic and generic form, it consists in organizing the introduction of new concepts while allowing people to maintain existing assets working. In some cases, the new concepts introduced are disruptive, in the sense that they prevent the original form of the asset to be preserved completely, and then some careful consideration has to be done for creating a migration path which is technically viable, and which at the same time helps people keeping the process in mind. A great example of what not to do when introducing such disruptive changes has happened in Python recently.

Up to Python 2.5, any strings you put within normal quotes (without a leading character marker in front of it) would be considered to be of the type str, which originally was used for both binary data and textual data, but in modern times it was seen as the type to be used for binary data only. For textual information, the unicode type has been introduced in Python 2.0, and it provides easy access to all the goodness of Unicode. Besides converting to and from str, it’s also possible to use Unicode literals in the code by preceding the quotes with a leading u character.

This evolution has happened quite cleanly, but it introduced one problem: these two types were both seen as the main way to input textual data in one point in time, and the language syntax clearly makes it very easy to use either type interchangeably. Sounds good in theory, but the types are not interchangeable, and what is worse: in many cases the problem is only seen at runtime when incompatible data passes through the code. This is what gives form to the interminable UnicodeDecodeError problem you may have heard about. So what can be done about this? Enter Python 3.0.

In Python 3.0 an attempt is being made to sanitize this, by promoting the unicode type to a more prominent position, removing the original str type, and introducing a similar but incompatible bytes type which is more clearly oriented towards binary data.

So far so good. The motivation is good, the target goal is a good one too. As usual, the details may complicate things a bit. Before we go into what was actually done, let’s look at an ideal scenario for such an incompatible change.

As mentioned above, when introducing disruptive changes like this, we want a good migration path, and we want to help people keeping the procedure in mind, so that they do the right thing even though they’re not spending too many brain cycles on it. Here is a suggested schema of what might have happened to achieve the above goal: in Python 2.6, introduce the bytes type, with exactly the same semantics of what will be seen in Python 3.0. During 2.6, encourage people to migrate str references in their code to either the previously existent unicode type, when dealing with textual data, or to the new bytes type, when handling binary data. When 3.0 comes along, simply kill the old str types, and we’re done. People can easily write code in 2.6 which supports 3.0, and if they see a reference to str they know something must be done. No big deal, and apparently quite straightforward.

Now, let’s see how to do it in a bad way.

Python 2.6 introduces the bytes type, but it’s not actually a new type. It’s simply an alias to the existing str type. This means that if you write code to support bytes in 2.6, you are actually not writing code which is compatible with Python 3.0. Why on earth would someone introduce an alias on 2.6 which will generate incompatible code with 3.0 is beyond me. It must be some kind of anti-migration pattern. Then, Python 3.0 renames unicode to str, and kills the old str. So, the result is quite bad: Python 3.0 has both str and bytes, and they both mean something else than they did on 2.6, which is the first version which supposedly should help migration, and not a single one of the three types from 2.6 got their names and semantics preserved in 3.0. In fact, just unicode exists at all, and it has a different name.

There you go. I’ve heard people learn better from counter-examples. Here we have a good one to keep in mind and avoid repeating.

Read more