What can you achieve with a single thread

Threads are a bit like fetishes: some people can’t get enough of them and other people just can’t see what the point is. This leads to eternal battles between “we need the power” and “this is too complex”. These have a tendency to never end well.

One inescapable fact about multithreaded and asynchronous programming is that it is hard. A rough estimate says that a multithreaded solution is between ten and 1000 times harder to design, write, debug and maintain than a single threaded one. Clearly, this should not be done without heavy duty performance needs. But how much is that?

Let’s do an experiment to find out. Let’s create a simple C++ network echo server the source code of which can be downloaded here. It can serve an arbitrary amount of clients but it uses only one thread to do so. The implementation uses a simple epoll loop over the open connections.

For our test we use 10 clients that do 10 000 queries each. To reduce the effects of network latency, the clients run on the same machine. The test hardware is a Nexus 4 running the latest Ubuntu phone.

The test finishes in 11 seconds, which means that a single threaded server can serve roughly 10 000 requests a second using basic ARM hardware. It should be noted that because the clients run on the same machine, they are stealing CPU time from the server. The service rates would be bigger if the server process got its own processor. It would also be bigger if compiler optimizations had been enabled but who needs those, anyway.

The end result of all this is that unless you need massive amounts of queries per second or your backend is incredibly slow, multithreading probably won’t do you much good and you’ll be much better of doing everything single-threaded. You’ll spend a lot less time in a debugger and will be generally happier as well.

Even if you need these, multithreading might still not be the way to go. There are other ways of parallelization, such as using multiple processes, which provides additional memory safety and error tolerance as well. This is not to say threads are bad. They are a wonderful tool for many different use cases. You should just be aware the some times the best way to use threads is not to use them at all.

Actually, make that “most times”.

Build speed again: intuition lies to you

The problem

Suppose you have a machine with 8 cores. Also suppose you have the following source packages that you want to compile from scratch.

eog_3.8.2.orig.tar.xz
grilo_0.2.6.orig.tar.xz
libxml++2.6_2.36.0.orig.tar.xz
Python-3.3.2.tar.bz2
glib-2.36.4.tar.xz
libjpeg-turbo_1.3.0.orig.tar.gz
wget_1.14.orig.tar.gz
grail-3.1.0.tar.bz2
libsoup2.4_2.42.2.orig.tar.xz

You want to achieve this as fast as possible. How would you do it?

Think carefully before proceeding.

The solution

Most of you probably came up with the basic idea of compiling one after the other with ‘make -j 8′ or equivalent. There are several reasons to do this, the main one being that this saturates the CPU.

The other choice would be to start the compilation on all subdirs at the same time but with ‘make -j 1′. You could also run two parallel build jobs with ‘-j 4′ or four with ‘-j 2′.

But surely that would be pointless. Doing one thing at the time maximises data locality so the different build trees don’t have to compete with each other for cache.

Right?

Well, let’s measure what actually happens.

timez

The first bar shows the time when running with ‘-j 8′. It is slower than all other combinations. In fact it is over 40% (one minute) slower than the fastest one, although all alternatives are roughly as fast.

Why is this?

In addition to compilation and linking processes, there are parts in the build that can not be parallelised. There are two main things in this case. Can you guess what they are?

What all of these projects had in common is that they are built with Autotools. The configure step takes a very long time and can’t be parallelised with -j. When building consecutively, even with perfect parallelisation, the build time can never drop below the sum of configure script run times. This is easily half a minute each on any non-trivial project even on the fastest i7 machine that money can buy.

The second thing is time that is lost inside Make. Its data model makes it very hard to optimize. See all the gory details here.

The end result of all this is a hidden productivity sink, a minute lost here, one there and a third one over there. Sneakily. In secret. In a way people have come to expect.

These are the worst kinds of productivity losses because people honestly believe that this is just the way things are, have always been and shall be evermore. That is what their intuition and experience tells them.

The funny thing about intuition is that it lies to you. Big time. Again and again.

The only way out is measurements.

 

The cost of #include

Using libraries in C++ is simple. You just do #include<libname> and all necessary definitions appear in your source file.

But what is the cost of this single line of code?

Quite a lot, it turns out. Measuring the effect is straightforward. Gcc has a compiler flag -E, which only runs the preprocessor on the given source file. The cost of an include can be measured by writing a source file that has only one command: #include<filename>. The amount of lines in the resulting file tells how much code the compiler needs to parse in order to use the library.

Here is a table with measurements. They were run on a regular desktop PC with 4 GB of RAM and an SSD disk. The tests were run several times to insure that everything was in cache. The machine was running Ubuntu 12/10 64 bit and the compiler was gcc.

Header                     LOC    Time

map                       8751    0.02
unordered_map             9728    0.03
vector                    9964    0.02
Python.h                 11577    0.05
string                   15791    0.07
memory                   17339    0.04
sigc++/sigc++.h          21900    0.05
boost/regex.h            22285    0.06
iostream                 23496    0.06
unity/unity.h            28254    0.14
xapian.h                 36023    0.08
algorithm                40628    0.12
gtk/gtk.h                52379    0.26
gtest/gtest.h            53588    0.12
boost/proto/proto.hpp    78000    0.63
gmock/gmock.h            82021    0.18
QtCore/QtCore            82090    0.22
QtWebKit/QtWebKit        95498    0.23
QtGui/QtGui             116006    0.29
boost/python.hpp        132158    3.41
Nux/Nux.h               158429    0.71

It should be noted that the elapsed time is only the amount it takes to run the code through the preprocessor. This is relatively simple compared to parsing the code and generating the corresponding machine instructions. I ran the test with Clang as well and the times were roughly similar.

Even the most common headers such as vector add almost 10k lines of code whenever they are included. This is quite a lot more than most source files that use them. On the other end of the spectrum is stuff like Boost.Python, which takes over three seconds to include. An interesting question is why it is so much slower than Nux, even though it has less code.

This is the main reason why spurious includes need to be eliminated. Simply having include directives causes massive loss of time, even if the features in question are never used. Placing a slow include in a much used header file can cause massive slowdowns. So if you could go ahead and not do that, that would be great.