A tale of software quality improvement

A long time ago in the dawn of the new millennium lived a man. He was working as a software developer in a medium sized company. His life was pretty good all things considered. For the purposes of this narrative, let us call him Junior Developer.

At his workplace CVS was used for revision control. This was okay, but every now and then problems arose. Because it could not do atomic commits, sometimes two people would check in things at the same time, which broke everything. Sometimes this was immediately apparent and caused people to scramble around to quickly fix the code. At other times the bugs would lay dormant and break things at the worst possible times.

This irritated everyone but since the system mostly worked, this was seen as an unfortunate fact of life. Then Junior Developer found out about a brand new thing called Subversion. It seeemed to be just the thing they needed. It was used in almost exactly the same way, but it was so much better. All commits were atomic, which made mixing commits impossible. One could even rename files, a feature thus far unheard of.

This filled the heart of Junior Developer with joy. With one stroke they could eliminate most of their tooling problems and therefore improve their product’s quality. Overjoyed he waited for the next team meeting where they discussed internal processes.

Once the meeting started, the Junior Developer briefly reminded people of the problems and then explained how Subversion would fix most of the problems that CVS was causing.

At the other end of the table sat a different kind of man, who we shall call the Senior Developer. He was a man of extraordinary skill. He had personally designed and coded many of the systems that the company’s products relied on. Whenever anyone had a difficult technical issue, he would go ask the Senior Developer. His knowledge on his trade had extensive depth and breadth.

Before anyone else had time to comment on the Junior Developer’s suggestion, he grabbed the stage and let out a reply in a stern voice.

- This is not the sort of discussion we should be getting into. CVS is good and we’ll keep using it. Next issue.

The Junior Developer was shocked. He tried to form some sort of a reply but words just refused to come out of his mouth. Why was his suggestion for improvement shot down so fast? Had someone already done a Subversion test he had not heard about? Had he presented his suggestion too brazenly and offended the Senior Developer? What was going on?

These and other questions raced around in his head for the next few days. Eventually he gathered enough willpower and approached the Senior Developer during a coffee break.

- Hey, remember that discussion on Subversion and CVS we had a few days ago? Why did you declare it useless so quickly? Have you maybe tried it and found it lacking?

- I figured you might mention this. Look, I’m sure you are doing your best but the fact of the matter is that CVS is a good tool and it does everything we want it to do.

- But that’s just the thing. It does not do what we want. For example it does not have atomic commits. It would prevent check-in conflicts.

The Senior Programmer lifted his coffee cup to his lips and took a swig. Not to play for time, but simply to give emphasis to his message.

- CVS is a great tool. You just have to be careful when using it and problems like this don’t happen.

The Junior Programmer then tried to explain that even though they were being careful, breakages still happened and they had been for as long as anyone could remember. He tried explaining that in Subversion one could rename files and retain their version control history. He tried to explain how these and other features would be good, how they would allow for developers to spend more of their time on actual code and less on working around features of their tools.

The Senior Developer countered each of these issues with one of two points. Number one was the fact that you could achieve roughly the same with CVS if that is really what was wanted (with an unspoken but very clear implication that it was not). Number two was that functionality such as renames could be achieved by manually editing the repository files. This was considered a major plus, since one would not be able to do this kind of maintenance work on Subversion repositories because they were not plain text files. Any database runs the risk of corruption, which was unacceptable for something as important as source code.

The discussion went on but eventually the Senior Developer had finished his coffee and walked over to the dishwasher to put his cup away.

- Look, I appreciate the thinking you have obviously put into this but let me tell you a little something.

He put his cup away and turned to face the Junior Developer who at this point was majorly frustrated.

- I have used CVS for over 10 years. It is the best system there is. In this time there have been dozens of revision control systems that claim to have provided the same benefits that this Subversion of yours does. I have tried many of them and none have delivered. Some have been worse than the systems CVS replaced if you can believe that. The same will happen with Subversion. The advantages it claims to provide simply are not there, and that’s the sad truth.

The Junior Developer felt like shouting but controlled himself realizing that no good thing could come out of losing his temper. He slouched in his chair. The Senior Developer looked at the clearly disillusioned Junior Developer and realized the argument had ended in his favor. He left back to his office.

At the coffee room door he turned around and said his final words to the Junior Developer.

- Besides, even if the atomic commits you seem to hold so dear would work, the improvements they provide would be minimal. They are a nice-to-have toy and will never account for anything more than that.

Epilogue

This story is not true. But it could be.

Variants of this discussion are being held every single day in software development companies and communities around the world.

The true source of quality

Software quality has received a lot of attention recently. There have been tons of books, blog posts, conferences and the like on improving quality. Tools and practices such as TDD, automatic builds, agile methods, pair programming and static code analysers are praised for improving code quality.

And, indeed, that is what they have done.

But one should never mix the tool with the person using it. All these wonderful tools are just that: tools. They are not the source of quality, only facilitators of it. The true essence of quality does not flow from them. It comes from somewhere else entirely. When distilled down to its core, there is only one source of true quality.

Caring.

The only way to get consistently high quality code is that the people who generate it care about it. This means that they have a personal interest in their code tree. They want it to succeed and flourish. In the best case they are even proud of it. This is the foundation all quality tools lie on.

If caring does not exist, even the best of tools can not help. This is due to the fact that human beings are very, very good at avoiding work they don’t want to do. As an example, let’s look at code review. A caring person will review code to the best of their abilities because he wants the end result be the best it can be. A non-caring one will shrug, think “yeah, sure, fine, whatever” and push the accept button, because it’s less work for him and he knows that his merge requests will go in easier if there is a general (though unspoken) consensus of doing things half-assed.

Unfortunately caring is not something you can buy, it is something you must birth. Free food and other services provided by companies such as Valve and Google can be seen as one way of achieving quality. If a company sincerely cares about its employees, they will in return care about the quality of their work.

All that said, here is my proposal for a coder’s mascot:

Introducing libcolumbus, a fast online approximate matching library

Every day computer users do a variation of a simple task: selecting an element from a list of choices. Examples include installing packages with ‘apt-get install packagename’, launching an application from the Dash with its name, selecting your country from a list on web pages and so on.

The common thing in all these use cases is intolerance for errors. If you have just one typo in your text, the correct choice will not be found. The only way around is to erase the query and type it again from scratch. This is something that people have learned to do without thinking.

It’s not very good usability, though. If the user searches for, say, Firefox by typing “friefox” by accident, surely the computer should be able to detect what the user meant and offer that as an alternative.

The first user-facing program in Ubuntu to offer this kind of error tolerance was the HUD. It used the Levenshtein distance as a way of determining user intent. In computer science terminology this is called approximate string matching or, informally, fuzzy matching.

Once the HUD was deployed, the need to have this kind of error correction everywhere became apparent. Thus we sent out to create a library to make error tolerant matching easy to embed. This library is called libcolumbus.

Technical info

Libcolumbus has been designed with the following goals in mind:

  • it must be small
  • it must be fast
  • it must be easy to embed
  • it is optimized for online typing

The last of these means that you can do queries at any time, even if the user is still typing.

At the core of libcolumbus is the Levenshtein distance algorithm. It is a well known and established way of doing fuzzy matching. Implementations are used in lots of different places, ranging from heavy duty document retrieval engines such as Lucene and
Xapian all the way down to Bash command completion. There is even a library that does fuzzy regexp matching.

What sets Columbus apart from these are two things, both of which are well known and documented but less often used: a fast search implementation and custom errors.

The first feature is about performance. The fast Levenshtein implementation in libcolumbus is taken almost verbatim from this public domain implementation. The main speedup comes from using a trie to store the words instead of iterating over all items on every query. As a rough estimate, a brute force implementation can do 50-100 queries a second with a data set of 3000 words. The trie version can do 600 queries/second on a data set of 50000 words.

The second feature is about quality of results. It is best illustrated with an example. Suppose there are two items to choose from, “abc” and “abp”. If the user types “abo”, which one of these should be chosen? In the classical Levenshtein sense both of the choices are identical: they are one replace operation away from the query string.

However from a usability point of view “abp” is the correct answer, because the letter p is right next to the letter o and very far from the letter c. The user probably meant to hit the key o but just missed it slightly. Libcolumbus allows you to set custom errors for these
kinds of substitutions. If the standard substitution error is 100, one could set the error for substitution error for adjacent keys to a smaller value, say 20. This causes words with “simple typos” to be ranked higher automatically.

There are several other uses for custom errors:

  • diacritical characters such as ê, é and è can be mapped to have very small errors to each other
  • fuzzy number pad typing can be enabled by assigning mapping errors from the number to corresponding letters (e.g. ’3′ to ‘d’, ‘e’ and ‘f’) as well as adjacent letters (i.e. those on number keys ’2′ and ’6′)
  • spam can be detected by assigning low errors for letters and numbers that look similar, such as ’1′ -> ‘i’ and ’4′ -> ‘a’ to match ‘v14gr4′ to ‘viagra’

Libcolumbus contains sample implementations for all these except for the last one. It also allows setting insert and delete errors at the beginning and end of the match. When set to low values this makes the algorithm do a fuzzy substring search. The online matching discussed above is implemented with this. It allows the library to match the query term “fier” to “firefox” very fast.

Get the code

Our goal for the coming cycle is to enable error tolerant matching in as many locations as possible. Those developers who wish to try it on their application can get the source code here.

The library is implemented in C++0x. The recommended API to use is the C++ one. However since many applications can not link in C++ libraries, we also provide a plain C API. It is not as extensive as the C++ one, but we hope to provide full coverage there too.

The main thing to understand is the data model. Libcolumbus deals in terms of documents. A document consists of a (user provided) document ID and a named collection of texts. The ID field is guaranteed to be large enough to hold a pointer. Here’s an example of what a document could look like:

id: 42
  name: packagename
  description: This package does something.

Each line is a single word field name followed by the text it contains. A document can contain an arbitrary number of fields.  This is roughly analogous to what MongoDB uses. It should be noted that libcolumbus does not read any data files. The user needs to create document objects programmatically. The example above is just a visualisation.

When the documents are created and passed to the main matcher object for processing, the system is ready to be queried. The result of queries is a list of document IDs and corresponding relevancies. Relevancy is just a number whose meaning is roughly “bigger relevancy means better”. The exact values are arbitrary and may change even between queries. End-user applications usually don’t need to bother with them.

There is one thing to be mindful, though. The current implementation has a memory backend only. Its memory usage is moderate but it has not yet been thoroughly optimized. If your data set size is a few hundred unique words, you probably don’t have to care. A few thousand takes around 5 MB which may be a problem in low memory
devices. Tens of thousands of words take tens of megabytes which may be too much for many use cases. Both memory optimizations and a disk backend are planned but for now you might want to stick to smallish data sets.