The code documentation fallacy

Say you start work on a new code base. Would you, as a user, rather have 90% or 10% of its API functions commented with Doxygen or something similar?

Using my psychic powers I suspect that you chose 90%.

It seems like the obvious thing. Lots of companies even have a mandate that all API functions (or >90% of them) must be documented. Not having comments is just bad. This seems like a perfectly obvious no-brainer issue.

But is it really?

Unfortunately there are some problems with this assumption. The main one being that the comments will be written by human beings. What they probably end up being is something like this.

/*
 * Takes a foobar and frobnicates it.
 *
 * @param f the foobar to be frobnicated.
 * @param strength how strongly to frobnicate.
 * @return the frobnicated result.
 */
int frobnicate_foobar(Foobar f, int strength);

This something I like to call documentation by word order shuffle. Now we can ask the truly relevant question: what additional information does this kind of a comment provide?

The answer is, of course, absolutely nothing. It is only noise. No, actually it is even worse: it is noise that has a very large probability of being wrong. When some coder changes the function, it is very easy to forget to update the comments.

On the other hand, if only 10% of the functions are documented, most functions don’t have any comments, but the ones that do probably have something like this:

/** 
 * The Foobar argument must not have been initialized in a different
 * thread because that can lead to race conditions.
 */ 
int frobnicate_foobar(Foobar f, int strength)

This is the kind of comment that is actually useful. Naturally it would be better to check for the specified condition inside the function but sometimes you can’t. Having it in a comment is the right thing to do in these cases. Not having tons of junk documentation makes these kinds of remarks stand out. This means, paradoxically, that having less comments leads to better documentation and user experience.

As a rough estimate, 95% of functions in any code base should be so simple and specific that their signature is all you need to use them. If they are not, API design has failed: back to the drawing board.

Dealing with bleeding edge software with btrfs snapshots

Developing with the newest of the new packages is always a bit tricky. Every now and then they break in interesting ways. Sometimes they corrupt the system so much that downgrading becomes impossible. Extreme circumstances may corrupt the system’s package database and so on. Traditionally fixing this has meant reinstalling the entire system, which is unpleasant and time consuming. Fortunately there is now a better way: snapshotting the system with btrfs.

The following guide assumes that you are running btrfs as your root file system. Newest quantal can boot off of btrfs root, but there may be issues, so please read the documentation in the wiki.

The basic concept in snapshotting your system is called a subvolume. It is kind of like a subpartition inside the main btrfs partition. By default Ubuntu’s installer creates a btrfs root partition with two subvolumes called @ and @home. The first one of these is mounted as root and the latter as the home directory.

Suppose you are going to do something really risky, and want to preserve your system. First you mount the raw btrfs partition somewhere:

sudo mkdir /mnt/root
sudo mount /dev/sda1 /mnt/root
cd /mnt/root

Here /dev/sda1 is your root partition. You can mount it like this even though the subvolumes are already mounted. If you do an ls, you see two subdirectories, @ and @home. Snapshotting is simple:

sudo btrfs subvolume snapshot @ @snapshot-XXXX

This takes maybe on second and when the command returns the system is secured. You are now free to trash your system in whatever way you want, though you might want to unmount /mnt/root so you don’t accidentally destroy your snapshots.

Restoring the snapshot is just as simple. Mount /mnt/root again and do:

sudo mv @ @broken
sudo subvolume snapshot @snapshot-XXXX @

If you are sure you don’t need @snapshot-XXX any more, you can just rename it @. You can do this even if you are booted in the system, i.e. are using @ as your current system root fs.

Reboot your machine and your system has been restored to the state it was when running the snapshot command. As an added bonus your home directory does not rollback, but retains all changes made during the trashing, which is what you want most of the time. If you want to rollback home as well, just snapshot it at the same time as the root directory.

You can get rid of useless and broken snapshots with this command:

sudo btrfs subvolume delete @useless-snapshot

You can’t remove subvolumes with rm -r, even if run with sudo.

Relative popularity of build systems

The conventional wisdom in build systems is that GNU Autotools is the one true established standard and other ones are used only rarely.

But is this really true?

I created a script that downloads all original source packages from Ubuntu’s main pool. If there were multple versions of the same project, only the newest was chosen. Then I created a second script that goes through those packages and checks what build system they actually use. Here’s the breakdown:

CMake:           348     9%
Autofoo:        1618    45%
SCons:            10     0%
Ant:             149     4%
Maven:            41     1%
Distutil:        313     8%
Waf:               8     0%
Perl:            341     9%
Make(ish):       351     9%
Customconf:       45     1%
Unknown:         361    10%

Here Make(ish) means packages that don’t have any other build system, but do have a makefile. This usually indicates building via custom makefiles. Correspondingly customconf is for projects that don’t have any other build system, but have a configure file. This is usually a handwritten shell or Python file.

This data is skewed by the fact that the pool is just a jumble of packages. It would be interesting to run this analysis separately for precise, oneiric etc to see the progression over time. For truly interesting results you would run it against the whole of Debian.

The relative popularity of CMake and Autotools is roughly 20/80. This shows that Autotools is not the sole dominant player for C/C++ it once was. It’s still far and away the most popular, though.

The unknown set contains stuff such as Rake builds. I simply did not have time to add them all. It also has a lot of fonts, which makes sense, since you don’t really build those in the traditional sense.

The scripts can be downloaded here. A word of warning: to run the analysis you need to download 13 GB of source. Don’t do it just for the heck of it. The parser script does not download anything, it just produces a list of urls. Download the packages with wget -i.

Some orig packages are compressed with xz, which Python’s tarfile module can’t handle. You have to repack them yourself prior to running the analysis script.