Build speed again: intuition lies to you

The problem

Suppose you have a machine with 8 cores. Also suppose you have the following source packages that you want to compile from scratch.

eog_3.8.2.orig.tar.xz
grilo_0.2.6.orig.tar.xz
libxml++2.6_2.36.0.orig.tar.xz
Python-3.3.2.tar.bz2
glib-2.36.4.tar.xz
libjpeg-turbo_1.3.0.orig.tar.gz
wget_1.14.orig.tar.gz
grail-3.1.0.tar.bz2
libsoup2.4_2.42.2.orig.tar.xz

You want to achieve this as fast as possible. How would you do it?

Think carefully before proceeding.

The solution

Most of you probably came up with the basic idea of compiling one after the other with ‘make -j 8′ or equivalent. There are several reasons to do this, the main one being that this saturates the CPU.

The other choice would be to start the compilation on all subdirs at the same time but with ‘make -j 1′. You could also run two parallel build jobs with ‘-j 4′ or four with ‘-j 2′.

But surely that would be pointless. Doing one thing at the time maximises data locality so the different build trees don’t have to compete with each other for cache.

Right?

Well, let’s measure what actually happens.

timez

The first bar shows the time when running with ‘-j 8′. It is slower than all other combinations. In fact it is over 40% (one minute) slower than the fastest one, although all alternatives are roughly as fast.

Why is this?

In addition to compilation and linking processes, there are parts in the build that can not be parallelised. There are two main things in this case. Can you guess what they are?

What all of these projects had in common is that they are built with Autotools. The configure step takes a very long time and can’t be parallelised with -j. When building consecutively, even with perfect parallelisation, the build time can never drop below the sum of configure script run times. This is easily half a minute each on any non-trivial project even on the fastest i7 machine that money can buy.

The second thing is time that is lost inside Make. Its data model makes it very hard to optimize. See all the gory details here.

The end result of all this is a hidden productivity sink, a minute lost here, one there and a third one over there. Sneakily. In secret. In a way people have come to expect.

These are the worst kinds of productivity losses because people honestly believe that this is just the way things are, have always been and shall be evermore. That is what their intuition and experience tells them.

The funny thing about intuition is that it lies to you. Big time. Again and again.

The only way out is measurements.

 

The cost of #include

Using libraries in C++ is simple. You just do #include<libname> and all necessary definitions appear in your source file.

But what is the cost of this single line of code?

Quite a lot, it turns out. Measuring the effect is straightforward. Gcc has a compiler flag -E, which only runs the preprocessor on the given source file. The cost of an include can be measured by writing a source file that has only one command: #include<filename>. The amount of lines in the resulting file tells how much code the compiler needs to parse in order to use the library.

Here is a table with measurements. They were run on a regular desktop PC with 4 GB of RAM and an SSD disk. The tests were run several times to insure that everything was in cache. The machine was running Ubuntu 12/10 64 bit and the compiler was gcc.

Header                     LOC    Time

map                       8751    0.02
unordered_map             9728    0.03
vector                    9964    0.02
Python.h                 11577    0.05
string                   15791    0.07
memory                   17339    0.04
sigc++/sigc++.h          21900    0.05
boost/regex.h            22285    0.06
iostream                 23496    0.06
unity/unity.h            28254    0.14
xapian.h                 36023    0.08
algorithm                40628    0.12
gtk/gtk.h                52379    0.26
gtest/gtest.h            53588    0.12
boost/proto/proto.hpp    78000    0.63
gmock/gmock.h            82021    0.18
QtCore/QtCore            82090    0.22
QtWebKit/QtWebKit        95498    0.23
QtGui/QtGui             116006    0.29
boost/python.hpp        132158    3.41
Nux/Nux.h               158429    0.71

It should be noted that the elapsed time is only the amount it takes to run the code through the preprocessor. This is relatively simple compared to parsing the code and generating the corresponding machine instructions. I ran the test with Clang as well and the times were roughly similar.

Even the most common headers such as vector add almost 10k lines of code whenever they are included. This is quite a lot more than most source files that use them. On the other end of the spectrum is stuff like Boost.Python, which takes over three seconds to include. An interesting question is why it is so much slower than Nux, even though it has less code.

This is the main reason why spurious includes need to be eliminated. Simply having include directives causes massive loss of time, even if the features in question are never used. Placing a slow include in a much used header file can cause massive slowdowns. So if you could go ahead and not do that, that would be great.