Canonical Voices

Posts tagged with 'c++'

niemeyer

A few years ago, when I started pondering about the possibility of porting juju to the Go language, one of the first pieces of the puzzle that were put in place was goyaml: a Go package to parse and serialize a yaml document. This was just an experiment and, as a sane route to get started, a Go layer that does all the language-specific handling was written on top of the libyaml C scanner, parser, and serializer library.

This was a good initial plan, but for a number of reasons the end goal was always to have a pure Go implementation. Having a C layer in a Go program slows down builds significantly due to the time taken to build the C code, makes compiling in other platforms and cross-compiling harder, has certain runtime penalties, and also forces the application to drop the memory safety guarantees offered by Go.

For these reasons, over the last couple of weeks I took a few hours a day to port the C backend to Go. The total time, considering full time work days, would be equivalent to about a week worth of work.

The work started on the scanner and parser side of the library. This took most of the time, not only because it encompassed more than half of the code base, but also because the shared logic had to be ported too, and there was a need to understand which patterns were used in the old code and how they would be converted across in a reasonable way.

The whole scanner and parser plus header files, or around 5000 code lines of C, were ported over in a single shot without intermediate runs. To steer the process in a sane direction, gofmt was called often to reformat the converted code, and then the project was compiled every once in a while to make sure that the pieces were hanging together properly enough.

It’s worth highlighting how useful gofmt was in that process. The C code was converted in the most convenient way to type it, and then gofmt would quickly put it all together in a familiar form for analysis. Not rarely, it would also point out trivial syntactic issues. A double win.

After the scanner and parser were finally converted completely, the pre-existing Go unmarshaling logic was shifted to the new pure implementation, and the reading side of the test suite could run as-is. Naturally, though, it didn’t work out of the box.

To quickly pick up the errors in the new implementation, the C logic and the Go port were put side-by-side to run the same tests, and tracing was introduced in strategic points of the scanner and parser. With that, it was easy to spot where they diverged and pinpoint the human errors.

It took about two hours to get the full suite to run successfully, with a handful of bugs uncovered. Out of curiosity, the issues were:

  • An improperly dropped parenthesis affected the precedence of an expression
  • A slice was being iterated with copying semantics where a reference was necessary
  • A pointer arithmetic conversion missed the base where there was base+offset addressing
  • An inner scoped variable improperly shadowed the outer scope

The same process of porting and test-fixing was then repeated on the the serializing side of the project, in a much shorter time frame for the reasons cited.

The resulting code isn’t yet idiomatic Go. There are several signs in it that it was ported over from C: the name conventions, the use of custom solutions for buffering and reader/writer abstractions, the excessive copying of data due to the need of tracking data ownership so the simple deallocating destructors don’t double-free, etc. It’s also been deoptimized, due to changes such as the removal of macros and in many cases its inlining, and the direct expansion of large unions which causes some core objects to grow significantly.

At this point, though, it’s easy to gradually move the code base towards the common idiom in small increments and as time permits, and cleaning up those artifacts that were left behind.

This code will be made public over the next few days via a new goyaml release. Meanwhile, some quick facts about the process and outcome follows.

Lines of code

According to cloc, there was a total of 7070 lines of C code in .c and .h files. Of those, 6727 were ported, and 342 were 12 functions that were left unconverted as being unnecessary right now. Those 6727 lines of C became 5039 lines of Go code in a mostly one-to-one dumb translation.

That difference comes mainly from garbage collection, lack of forward declarations, standard helpers such as append, range-based for loops, first class slice type with length and capacity, internal OOM handling, and so on.

Future work code can easily increase the difference further by replacing some of the logic ported with more sensible options available in Go, such as standard abstractions for readers and writers, buffered writing support as availalbe in the standard library, etc.

Code clarity and safety

In the specific context of the work done, which is of a scanner, parser and serializer, the slice abstraction is responsible for noticeable clarity gains in the code, when compared to the equivalent logic based on pointer arithmetic. It also gives a much more comforting guarantee of correctness of the written code due to bound-checking.

Performance

While curious, this shouldn’t be taken as a performance comparison between the two languages, as it is comparing a fine tuned C implementation with something that is worse than a direct one-to-one port: not only it hasn’t seen any time at all on preventing waste, but the original logic was deoptimized due to changes such as the removal of inlining macros and the expansion of large unions. There are many obvious changes to be done for improving performance.

With that out of the way, in a simple decoding benchmark the C-backed decoder runs on about 37% of the time taken by the out-of-the-box deoptimized Go port.

Output size

The previous goyaml.a Go package file had 1463kb. The new one has 1016kb. This difference includes glue code generated for the integration.

Considering only the .c and .h files involved in the port, the C object code generated with the standard flags used by the go build tool (-g -O2) sums up to 789kb. The equivalent Go code with the standard settings compiles to 664kb. The 12 functions not ported are also part of that difference, so the difference is pretty much negligible.

Build time

Building the 8 .c files alone takes 3.6 seconds with the standard flags used by the go build tool (-g -O2). After the port, building the entire Go project with the standard settings takes 0.3 seconds.

Mechanical changes

Many of the mechanical changes were done using regular expressions. Excluding the trivial ones, about a dozen regular expressions were used to swap variable and type names, drop parenthesis, place brackets in the right locations, convert function declarations, and so on.

Read more
niemeyer

Last week I was part of a rant with a couple of coworkers around the fact Go handles errors for expected scenarios by returning an error value instead of using exceptions or a similar mechanism. This is a rather controversial topic because people have grown used to having errors out of their way via exceptions, and Go brings back an improved version of a well known pattern previously adopted by a number of languages — including C — where errors are communicated via return values. This means that errors are in the programmer’s face and have to be dealt with all the time. In addition, the controversy extends towards the fact that, in languages with exceptions, every unadorned error comes with a full traceback of what happened and where, which in some cases is convenient.

All this convenience has a cost, though, which is rather simple to summarize:

Exceptions teach developers to not care about errors.

A sad corollary is that this is relevant even if you are a brilliant developer, as you’ll be affected by the world around you being lenient towards error handling. The problem will show up in the libraries that you import, in the applications that are sitting in your desktop, and in the servers that back your data as well.

Raymond Chen described the issue back in 2004 as:

Writing correct code in the exception-throwing model is in a sense harder than in an error-code model, since anything can fail, and you have to be ready for it. In an error-code model, it’s obvious when you have to check for errors: When you get an error code. In an exception model, you just have to know that errors can occur anywhere.

In other words, in an error-code model, it is obvious when somebody failed to handle an error: They didn’t check the error code. But in an exception-throwing model, it is not obvious from looking at the code whether somebody handled the error, since the error is not explicit.
(…)
When you’re writing code, do you think about what the consequences of an exception would be if it were raised by each line of code? You have to do this if you intend to write correct code.

That’s exactly right. Every line that may raise an exception holds a hidden “else” branch for the error scenario that is very easy to forget about. Even if it sounds like a pointless repetitive task to be entering that error handling code, the exercise of writing it down forces developers to keep the alternative scenario in mind, and pretty often it doesn’t end up empty.

It isn’t the first time I write about that, and given the controversy that surrounds these claims, I generally try to find one or two examples that bring the issue home. So here is the best example I could find today, within the pty module of Python’s 3.3 standard library:

def spawn(argv, master_read=_read, stdin_read=_read):
    """Create a spawned process."""
    if type(argv) == type(''):
        argv = (argv,)
    pid, master_fd = fork()
    if pid == CHILD:
        os.execlp(argv[0], *argv)
    (...)

Every time someone calls this logic with an improper executable in argv there will be a new Python process lying around, uncollected, and unknown to the application, because execlp will fail, and the process just forked will be disregarded. It doesn’t matter if a client of that module catches that exception or not. It’s too late. The local duty wasn’t done. Of course, the bug is trivial to fix by adding a try/except within the spawn function itself. The problem, though, is that this logic looked fine for everybody that ever looked at that function since 1994 when Guido van Rossum first committed it!

Here is another interesting one:

$ make clean
Sorry, command-not-found has crashed! Please file a bug report at:

https://bugs.launchpad.net/command-not-found/+filebug

Please include the following information with the report:

command-not-found version: 0.3
Python version: 3.2.3 final 0
Distributor ID: Ubuntu
Description:    Ubuntu 13.04
Release:        13.04
Codename:       raring
Exception information:

unsupported locale setting
Traceback (most recent call last):
  File "/.../CommandNotFound/util.py", line 24, in crash_guard
    callback()
  File "/usr/lib/command-not-found", line 69, in main
    enable_i18n()
  File "/usr/lib/command-not-found", line 40, in enable_i18n
    locale.setlocale(locale.LC_ALL, '')
  File "/usr/lib/python3.2/locale.py", line 541, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting

That’s a pretty harsh crash for the lack of locale data in a system-level application that is, ironically, supposed to tell users what packages to install when commands are missing. Note that at the top of the stack there’s a reference to crash_guard. This function has the intent of catching all exceptions right at the edge of the call stack, and displaying a detailed system specification and traceback to aid in fixing the problem.

Such “parachute catching” is a fairly common pattern in exception-oriented programming and tends to give developers the false sense of having good error handling within the application. Rather than actually guarding the application, though, it’s just a useful way to crash. The proper thing to have done in the case above would be to print a warning, if at all, and then let the program run as usual. This would have been achieved by simply wrapping that one line as in:

try:
    locale.setlocale(locale.LC_ALL, '')
except Exception as e:
    print("Cannot change locale:", e)

Clearly, it was easy to handle that one. The problem, again, is that it was very natural to not do it in the first place. In fact, it’s more than natural: it actually feels good to not be looking at the error path. It’s less code, more linear, and what’s left is the most desired outcome.

The consequence, unfortunately, is that we’re immersing ourselves in a world of brittle software and pretty whales. Although more verbose, the error result style builds the correct mindset: does that function or method have a possible error outcome? How is it being handled? Is that system-interacting function not returning an error? What is being done with the problem that, of course, can happen?

A surprising number of crashes and plain misbehavior is a result of such unconscious negligence.

Read more
niemeyer

This weekend the proper environment settled out for sorting a pet peeve that shows up every once in a while when coding: writing logic that interacts with other applications in the system via their stdin and stdout streams is often more involved than it should be, which seems pretty ironic when sitting in front of a Unix-like system.

Rather than going over the trouble of setting up pipes and hooking them up in a custom way, often applications end up just delegating the job to /bin/sh, which is not ideal for a number of reasons: argument formatting isn’t straightforward, injecting custom application-defined logic is hard, which means even simple tasks that might be easily achieved by the language end up shelling out to further external applications, and so on.

In an attempt to address that, I’ve spent some time working on an experimental Go package that is being released today: pipe.

I hope you like it as well, and please drop me a note if you find any issues.

Read more
Jussi Pakkanen

A relatively large portion of software development time is not spent on writing, running, debugging or even designing code, but waiting for it to finish compiling. This is usually seen as necessary evil and accepted as an unfortunate fact of life. This is a shame, because spending some time optimizing the build system can yield quite dramatic productivity gains.

Suppose a build system takes some thirty seconds to run for even trivial changes. This means that even in theory you can do at most two changes a minute. In practice the rate is a lot lower. If the build step takes only a few seconds, trying out new code becomes a lot faster. It is easier to stay in the zone when you don’t have to pause every so often to wait for your tools to finish doing their thing.

Making fundamental changes in the code often triggers a complete rebuild. If this takes an hour or more (there are code bases that take 10+ hours to build), people try to avoid fundamental changes as much as possible. This causes loss of flexibility. It becomes very tempting to just do a band-aid tweak rather than thoroughly fix the issue at hand. If the entire rebuild could be done in five to ten minutes, this issue would become moot.

In order to make things fast, we first have to understand what is happening when C/C++ software is compiled. The steps are roughly as follows:

  1. Configuration
  2. Build tool startup
  3. Dependency checking
  4. Compilation
  5. Linking

We will now look at each step in more detail focusing on how they can be made faster.

Configuration

This is the first step when starting to build. Usually means running a configure script or CMake, Gyp, SCons or some other tool. This can take anything from one second to several minutes for very large Autotools-based configure scripts.

This step happens relatively rarely. It only needs to be run when changing configurations or changing the build configuration. Short of changing build systems, there is not much to be done to make this step faster.

Build tool startup

This is what happens when you run make or click on the build icon on an IDE (which is usually an alias for make). The build tool binary starts and reads its configuration files as well as the build configuration, which are usually the same thing.

Depending on build complexity and size, this can take anywhere from a fraction of a second to several seconds. By itself this would not be so bad. Unfortunately most make-based build systems cause make to be invocated tens to hundreds of times for every single build. Usually this is caused by recursive use of make (which is bad).

It should be noted that the reason Make is so slow is not an implementation bug. The syntax of Makefiles has some quirks that make a really fast implementation all but impossible. This problem is even more noticeable when combined with the next step.

Dependency checking

Once the build tool has read its configuration, it has to determine what files have changed and which ones need to be recompiled. The configuration files contain a directed acyclic graph describing the build dependencies. This graph is usually built during the configure step. Suppose we have a file called SomeClass.cc which contains this line of code:

#include "OtherClass.hh"

This means that whenever OtherClass.hh changes, the build system needs to rebuild SomeClass.cc. Usually this is done by comparing the timestamp of SomeClass.o against OtherClass.hh. If the object file is older than the source file or any header it includes, the source file is rebuilt.

Build tool startup time and the dependency scanner are run on every single build. Their combined runtime determines the lower bound on the edit-compile-debug cycle. For small projects this time is usually a few seconds or so. This is tolerable.

The problem is that Make scales terribly to large projects. As an example, running Make on the codebase of the Clang compiler with no changes takes over half a minute, even if everything is in cache. The sad truth is that in practice large projects can not be built fast with Make. They will be slow and there’s nothing that can be done about it.

There are alternatives to Make. The fastest of them is Ninja, which was built by Google engineers for Chromium. When run on the same Clang code as above it finishes in one second. The difference is even bigger when building Chromium. This is a massive boost in productivity, it’s one of those things that make the difference between tolerable and pleasant.

If you are using CMake or Gyp to build, just switch to their Ninja backends. You don’t have to change anything in the build files themselves, just enjoy the speed boost. Ninja is not packaged on most distributions, though, so you might have to install it yourself.

If you are using Autotools, you are forever married to Make. This is because the syntax of autotools is defined in terms of Make. There is no way to separate the two without a backwards compatibility breaking complete rewrite. What this means in practice is that Autotool build systems are slow by design, and can never be made fast.

Compilation

At this point we finally invoke the compiler. Cutting some corners, here are the approximate steps taken.

  1. Merging includes
  2. Parsing the code
  3. Code generation/optimization

Let’s look at these one at a time. The explanations given below are not 100% accurate descriptions of what happens inside the compiler. They have been simplified to emphasize the facets important to this discussion. For a more thorough description, have a look at any compiler textbook.

The first step joins all source code in use into one clump. What happens is that whenever the compiler finds an include statement like #include “somefile.h”, it finds that particular source file and replaces the #include with the full contents of that file. If that file contained other #includes, they are inserted recursively. The end result is one big self-contained source file.

The next step is parsing. This means analyzing the source file, splitting it into tokens and building an abstract syntax tree. This step translates the human understandable source code into a computer understandable unambiguous format. It is what allows the compiler to understand what the user wants the code to do.

Code generation takes the syntax tree and transforms it into machine code sequences called object code. This code is almost ready to run on a CPU.

Each one of these steps can be slow. Let’s look at ways to make them faster.

Faster #includes

Including by itself is not slow, slowness comes from the cascade effect. Including even one other file causes everything included in it to be included as well. In the worst case every single source file depends on every header file. This means that touching any header file causes the recompilation of every source file whether they use that particular header’s contents or not.

Cutting down on interdependencies is straightforward. Only #include those headers that you actually use. In addition, header files must not include any other header files if at all possible. The main tool for this is called forward declaration. Basically what it means is that instead of having a header file that looks like this:

#include "SomeClass.hh"

class MyClass {
  SomeClass s;
};

You have this:

class SomeClass;

class MyClass {
  SomeClass *s;
}

Because the definition of SomeClass is not know, you have to use pointers or references to it in the header.

Remember that #including MyClass.hh would have caused SomeClass.hh and all its #includes to be added to the original source file. Now they aren’t, so the compiler’s work has been reduced. We also don’t have to recompile the users of MyClass if SomeClass changes. Cutting the dependency chain like this everywhere in the code base can have a major effect in build time, especially when combined with the next step. For a more detailed analysis including measurements and code, see here.

Faster parsing

The most popular C++ libraries, STL and Boost, are implemented as header only libraries. That is, they don’t have a dynamically linkable library but rather the code is generated anew into every binary file that uses them. Compared to most C++ code, STL and Boost are complex. Really, really complex. In fact they are most likely the hardest pieces of code a C++ compiler has to compile. Boost is often used as a stress test on C++ compilers, because it is so difficult to compile.

It is not an exaggeration to say that for most C++ code using STL, parsing the STL headers is up to 10 times slower than parsing all the rest. This leads to massively slow build times because of class headers like this:

#include <vector>

class SomeClass {
private:
  vector<int> numbers;

public:
  ...
};

As we learned in the previous chapter, this means that every single file that includes this header must parse STL’s vector definition, which is an internal implementation detail of SomeClass and even if they would not use vector themselves. Add some other class include that uses a map, one for unordered_map, a few Boost includes and what do you end up with? A code base where compiling any file requires parsing all of STL and possibly Boost. This is a factor of 3-10 slowdown on compile times.

Getting around this is relatively simple, though takes a bit of work. It is known as the pImpl idiom. One way of achieving it is this:

---header---

struct someClassPrivate;

class SomeClass {
private:
  someClassPrivate *p;
};

---- implementation ---
#include <vector>
struct someClassPrivate {
  vector<int> numbers;
};

SomeClass::SomeClass() {
  p = new someClassPrivate;
}

SomeClass::~SomeClass() {
  delete p;
}

Now the dependency chain is cut and users of SomeClass don’t have to parse vector. As an added bonus the vector can be changed to a map or anything else without needing to recompile files that use SomeClass.

 Faster code generation

Code generation is mostly an implementation detail of the compiler, and there’s not much that can be done about it. There are a few ways to make it faster, though.

Optimizing code is slow. In every day development all optimizations should be disabled. Most build systems do this by default, but Autotools builds optimized binaries by default. In addition to being slow, this makes debugging a massive pain, because most of the time trying to print the value of some variable just prints out “value optimised out”.

Making Autotools build non-optimised binaries is relatively straightforward. You just have to run configure like this: ./configure CFLAGS=’O0 -g’ CXXFLAGS=’-O0 -g’. Unfortunately many people mangle their autotools cflags in config files so the above command might not work. In this case the only fix is to inspect all autotools config files and fix them yourself.

The other trick is about reducing the amount of generated code. If two different source files use vector<int>, the compiler has to generate the complete vector code in both of them. During linking (discussed in the next chapter) one of them is just discarded. There is a way to tell the compiler not to generate the code in the other file using a technique that was introduced in C++0x called extern templates. They are used like this.

file A:

#include <vector>
template class std::vector<int>;

void func() {
  std::vector<int> numbers;
}

file B:

#include <vector>
extern template class std::vector<int>;

void func2() {
  std::vector<int> idList;
}

This instructs the compiler not to generate vector code when compiling file B. The linker makes it use the code generated in file A.

Build speedup tools

CCache is an application that stores compiled object code into a global cache. If the same code is compiled again with the same compiler flags, it grabs the object file from the cache rather than running the compiler. If you have to recompile the same code multiple times, CCache may offer noticeable speedups.

A tool often mentioned alongside CCache is DistCC, which increases parallelism by spreading the build to many different machines. If you have a monster machine it may be worth it. On regular laptop/desktop machines the speed gains are minor (it might even be slower).

Precompiled headers

Precompiled headers is a feature of some C++ compilers that basically serializes the in-memory representation of parsed code into a binary file. This can then be read back directly to memory instead of reparsing the header file when used again. This is a feature that can provide massive speedups.

Out of all the speedup tricks listed in this post, this has by far the biggest payoff. It turns the massively slow STL includes into, effectively, no-ops.

So why is it not used anywhere?

Mostly it comes down to poor toolchain support. Precompiled headers are fickle beasts. For example with GCC they only work between two different compilation units if the compiler switches are exactly the same. Most people don’t know that precompiled headers exist, and those that do don’t want to deal with getting all the details right.

CMake does not have direct support for them. There are a few modules floating around the Internet, but I have not tested them myself. Autotools is extremely precompiled header hostile, because its syntax allows for wacky and dynamic alterations of compiler flags.

Faster Linking

When the compiler compiles a file and comes to a function call that is somewhere outside the current file, such as in the standard library or some other source file, it effectively writes a placeholder saying “at this point jump to function X”. The linker takes all these different compiled files and connects the jump points to their actual locations. When linking is done, the binary is ready to use.

Linking is surprisingly slow. It can easily take minutes on relatively large applications. As an extreme case, linking the Chromium browser on ARM takes 3 gigs of RAM and takes 18 hours.

Yes, hours.

The main reason for this is that the standard GNU linker is quite slow. Fortunately there is a new, faster linker called Gold. It is not the default linker yet, but hopefully it will be soon. In the mean time you can install and use it manually.

A different way of making linking faster is to simply cut down on these symbols using a technique called symbol visibility. The gist of it is that you hide all non-public symbols from the list of exported symbols. This means less work and memory use for the linker, which makes it faster.

Conclusions

Contrary to popular belief, compiling C++ is not actually all that slow. The STL is slow and most build tools used to compile C++ are slow. However there are faster tools and ways to mitigate the slow parts of the language.

Using them takes a bit of elbow grease, but the benefits are undeniable. Faster build times lead to happier developers, more agility and, eventually, better code.

Read more
niemeyer

?Rob Pike just wrote an article/talk that is the best background on the origins of Go yet.

It surprises me how much his considerations match my world view pre-Go, and in a sense give me a fulfilling explanation about why I got hooked into the language. I still recall sitting in a hotel years ago with Jamu Kakar while we went through the upcoming C++0x standard (now C++11) and got perplexed about how someone could think that having details such as rvalue references and move constructors into the language specification was something reasonable.

Rob also expressed again the initial surprise that developers using languages such as Python and Ruby were more often the ones willing to migrate towards Go, rather than ones using C++, with some reasonable explanations about why that is so. While I agree with his considerations, I see Python going through the same kind of issue that caused C++ to be what it is today.

Consider this excerpt from PEP 0380 as evidence:

If yielding of values is the only concern, this can be performed without much difficulty using a loop such as

for v in g:
    yield v

However, if the subgenerator is to interact properly with the caller in the case of calls to send(), throw() and close(), things become considerably more difficult. As will be seen later, the necessary code is very complicated, and it is tricky to handle all the corner cases correctly.

A new syntax will be proposed to address this issue. In the simplest use cases, it will be equivalent to the above for-loop, but it will also handle the full range of generator behaviour, and allow generator code to be refactored in a simple and straightforward way.

This description has the same DNA that creates the C++ problem Rob talks about. Don’t get me wrong, I’m sure yield from will make a lot of people very happy, and that’s exactly the tricky part. It’s easy and satisfying to please a selection of users, but often that leads to isolated solutions that create new cognitive load and new corner cases that in turn lead to new requirements.

The history of generators in Python is specially telling:

  • PEP 0234 [30-Jan-2001] – Iterators – Accepted
  • PEP 0255 [18-May-2001] – Simple Generators – Accepted
  • PEP 0288 [21-Mar-2002] – Generators Attributes and Exceptions – Withdrawn
  • PEP 0289 [30-Jan-2002] – Generator Expressions – Accepted
  • PEP 0325 [25-Aug-2003] – Resource-Release Support for Generators – Rejected
  • PEP 0342 [10-May-2005] – Coroutines via Enhanced Generators – Accepted
  • PEP 0380 [13-Feb-2009] – Syntax for Delegating to a Subgenerator – Accepted

You see the rabbit hole getting deeper? I’ll clarify it further by rephrasing the previous quote from PEP 0380:

If [feature from PEP 0255] is the only concern, this can be performed without much difficulty using a loop [...] However, if the subgenerator is to interact properly with [changes from PEP 0342] things become considerably more difficult. [So we need feature from PEP 0380.]

Yet, while the language grows handling self-inflicted micro-problems, the real issue is still not solved. All of these features are simplistic forms of concurrency and communication, that don’t satisfy the developers, causing community fragmentation.

This happened to C++, to Python, and to many other languages. Go seems slightly special in that regard in the sense that its core development team has an outstanding respect for simplicity, yet dares to solve the difficult problems at their root, while keeping these solutions orthogonal so that they support each other. Less is more, and is not always straightforward.

Read more
Jussi Pakkanen

We all know that compiling C++ is slow.

Fewer people know why, or how to make it faster. Other people do, for example the developers at Remedy made the engine of Alan Wake compile from scratch in five minutes. The payoff for this is increased productivity, because the edit-compile-run cycle gets dramatically faster.

There are several ways to speed up your compiles. This post looks at reworking your #includes.

Quite a bit of C++ compilation time is spent parsing headers for STL, Qt and whatever else you may be using. But how long does it actually take?

To find out, I wrote a script to generate C++ source. You can download it here. What it does is generate source files that have some includes and one dummy function. The point is to simulate two different use cases. In the first each source file includes a random subset of the includes. One file might use std::map and QtCore, another one might use Boost’s strings and so on. In the second case all possible includes are put in a common header which all source files include. This simulates “maximum developer convenience” where all functions are available in all files without any extra effort.

To generate the test data, we run the following commands:

mkdir good bad
./generate_code.py --with-boost --with-qt4 good
./generate_code.py --with-boost --with-qt4 --all-common bad

Compilation is straightforward:

cd good; cmake .; time make; cd ..
cd bad; cmake .; time make; cd ..

By default the script produces 100 source files. When the includes are kept in individual files, compiling takes roughly a minute. When they are in a common header, it takes three minutes.

Remember: the included STL/Boost/Qt4 functionality is not used in the code. This is just the time spent including and parsing their headers. What this example shows is that you can remove 2 minutes of your build time, just by including C++ headers smartly.

The delay scales linearly. For 300 files the build times are 2 minutes 40 seconds and 7 minutes 58 seconds. That’s over five minutes lost on, effectively, no-ops. The good news is that getting rid of this bloat is relatively easy, though it might take some sweat.

  1. Never include any (internal) header in another header if you can use a forward declaration. Include the header in the implementation file.
  2. Never include system headers (STL, etc) in your headers unless absolutely necessary, such as due to inheritance. If your class uses e.g. std::map internally, hide it with pImpl. If your class API requires these headers, change it so that it doesn’t or use something more lightweight (e.g. std::iterator instead of std::vector).
  3. Never, never, ever include system stuff in your public headers. That slows down not just your own compilation time, but also every single user of your library. The only exception is when your library is a plugin or extension to an existing library and even then your includes need to be minimal.

Read more
ThomasVo5

This post explains how to conduct large-scale MOO experiments with the SHARK machine learning library on clusters running Oracle grid engine.

An experiment consists of three phases:

  1. front approximation
  2. performance indicator calculation
  3. result accumulation and statistics calculation

Within this post, I’m going to focus on the first step.

Front Approximation

In this phases, the Pareto front approximations generated by applying multiple multi-objective evolutionary algorithms (MOEAs) to a set of objective functions are recorded.

Here, I assume that we want to evaluate the (µ+1)-MO-CMA-ES relying on the hypervolume indicator on the DTLZ suite of benchmark functions. A ready-to-use command-line application implementing the MO-CMA-ES is bundled with the default Shark installation. The executable is configurable via command-line arguments queryable by passing –help:

  --objectiveFunction arg
  --seed arg (=1)
  --storageInterval arg (=100)
  --searchSpaceDimension arg (=10)
  --maxNoEvaluations arg (=50000)
  --timeLimit arg (=1000)
  --fitnessLimit arg (=1e-10)
  --resultDir arg (=.)
  --algorithmConfigFile arg
  --algorithmUsage 
  --defaultAlgorithmUsage 
  --objectiveSpaceDimension arg (=2)
  --reportFitnessFunctions 

That is, to execute the MO-CMA-ES for DTLZ2 with 3 objectives and terminating after 50000 objective function evaluations, the following call is required:

  SteadyStateMOCMAMain --objectiveFunction=DTLZ2 --objectiveSpaceDimension=3 --maxNoEvaluations=50000 

Note that we do not specify the rng seed explicitly but rely on the default value 1.

For the scenario considered here, we want to run several independent trials of one specific MOEA and one specific objective function in parallel. To this end, we rely on the array job feature of the grid engine and submit an array of 25 independent trials to the grid engine with the following command:

  qsub -N 'DTLZ2_3' -t 1-25 RunAlgo.sh DTLZ2 /globally/known/path 3

Here, the script RunAlgo.sh is defined as follows:

#!/bin/bash
#$ -S /bin/bash
#$ -o /dev/null

SteadyStateMOCMAMain --seed $SGE_TASK_ID --resultDir=$2 --objectiveFunction=$1 --objectiveSpaceDimension=$3

In summary, the script takes care of actually running the algorithm and setting the seed to environment variable $SGE_TASK_ID. The variable is set by the grid engine to the unique job number and thus, we can ensure independent trials. There is one more thing to note: The result dir needs to be known across the whole cluster. Normally, your dev ops provide you with a scratch environment that is accessible from every computing node.

That’s it. Wait a few minutes until the experiment completes and stay tuned for the second post that explains how to evaluate the quality of the Pareto-front approximations.


Read more
ThomasVo5

Taken from the SHARK website:

SHARK is a modular C++ library for the design and optimization of adaptive systems. It provides methods for linear and nonlinear optimization, in particular evolutionary and gradient-based algorithms, kernel-based learning algorithms and neural networks, and various other machine learning techniques. SHARK serves as a toolbox to support real world applications as well as research in different domains of computational intelligence and machine learning. The sources are compatible with the following platforms: Windows, Solaris, MacOS X, and Linux.

The library has been in active development for over 10 years now and is in use by scientists all over the world. Last year, we, the core SHARK developers, decided that a rewrite of the library is necessary to support future use cases and provide a solid platform for users and contributors, alike. Our goals were simple:

  • Unify and simplify the library structure.
  • Rely on established components wherever feasible.
  • Documentation, documentation and again, documentation
  • Focus on quality.

In this post, I would like to dive a little deeper into the topic of quality and the processes that we established to ensure a constant and high level of quality. We decided to address quality both from a technical (read: testable) and from an API point of view.

In terms of API quality, we want the programming interface to be consistent, convenient to use and easy to extend. In equivalence to the user experience, we want potential developers to experience a welcoming and friendly environment. As we are a geographically distributed team of developers and scientists, we decided to go for a pre-commit code review approach implemented with the help of ReviewBoard. Despite initial concerns on behalf of the developers, the review process proved to be one of the most useful tools while rewriting the library with developers starting to like the final “Ship It” quickly.

In terms of “technical” quality, we decided to go for continous integration of all (reviewed) commits to the rewrite branch for all of our supported platforms. With the help of Jenkins and a bunch of virtual machines, we finally realized our idea of continous integration testing to prevent from regressions. Our unit test suite is implemented with the unit testing framework provided by boost. Test execution is handled by CTest. Static and dynamic analysis of the library is carried out with the help of cppcheck and valgrind, respectively. Code coverage metrics are calculated with the help of gcov. Finally, we are integrating all of the testing results in the job-specific views of our Jenkins instance, thereby providing developers a single source of information on the state of the library.


Read more
Gustavo Niemeyer

About 1 year after development started in Ensemble, today the stars finally aligned just the right way (review queue mostly empty, no other pressing needs, etc) for me to start writing the specification about the repository system we’ve been jointly planning for a long time. This is the system that the Ensemble client will communicate with for discovering which formulas are available, for publishing new formulas, for obtaining formula files for deployment, and so on.

We of course would have liked for this part of the project to have been specified and written a while ago, but unfortunately that wasn’t possible for several reasons. That said, there are also good sides of having an important piece flying around in minds and conversations for such a long time: sitting down to specify the system and describe the inner-working details has been a breeze. Even details such as the namespacing of formulas, which hasn’t been entirely clear in my mind, was just streamed into the document as the ideas we’ve been evolving finally got together in a written form.

One curious detail: this is the first long term project at Canonical that will be developed in Go, rather than Python or C/C++, which are the most used languages for projects within Canonical. Not only that, but we’ll also be using MongoDB for a change, rather than the traditional PostgreSQL, and will also use (you guessed) the mgo driver which I’ve been pushing entirely as a personal project for about 8 months now.

Naturally, with so many moving parts that are new to the company culture, this is still being seen as a closely watched experiment. Still, this makes me highly excited, because when I started developing mgo, the MongoDB driver for Go, my hopes that the Go, MongoDB, and mgo trio would eventually be used at Canonical were very low, precisely because they were all alien to the culture. We only got here after quite a lot of internal debate, experiments, and trust too.

All of that means these are happy times. Important feature in Ensemble being specified and written, very exciting tools, home grown software being useful..

Awesomeness.

Read more
niemeyer

Circular buffers are based on an algorithm well known by any developer who’s got past the “Hello world!” days. They offer a number of key characteristics with wide applicability such as constant and efficient memory use, efficient FIFO semantics, etc.

One feature which is not always desired, though, it the fact that circular buffers traditionally will either overwrite the last element, or raise an overflow error, since they are generally implemented as a buffer of constant size. This is an unwanted property when one is attempting to consume items from the buffer and it is not an option to blindly drop items, for instance.

This post presents an efficient (and potentially novel) algorithm for implementing circular buffers which preserves most of the key aspects of the traditional version, while also supporting dynamic expansion when the buffer would otherwise have its oldest entry overwritten. It’s not clear if the described approach is novel or not (most of my novel ideas seem to have been written down 40 years ago), so I’ll publish it below and let you decide.

Traditional circular buffers

Before introducing the variant which can actually expand during use, let’s go through a quick review on traditional circular buffers, so that we can then reuse the nomenclature when extending the concept. All the snippets provided in this post are written in Python, as a better alternative to pseudo-code, but the concepts are naturally portable to any other language.

So, the most basic circular buffer needs the buffer itself, its total capacity, and a position where the next write should occur. The following snippet demonstrates the concept in practice:

buf = [None, None, None, None, None]
bufcap = len(buf)
pushi = 0   

for elem in range(7):
    buf[pushi] = elem
    pushi = (pushi + 1) % bufcap
    
print buf # => [5, 6, 2, 3, 4]

In the example above, the first two elements of the series (0 and 1) were overwritten once the pointer wrapped around. That’s the specific feature of circular buffers which the proposal in this post will offer an alternative for.

The snippet below provides a full implementation of the traditional approach, this time including both the pushing and popping logic, and raising an error when an overflow or underflow would occur. Please note that these snippets are not necessarily idiomatic Python. The intention is to highlight the algorithm itself.

class CircBuf(object):

    def __init__(self):
        self.buf = [None, None, None, None, None]
        self.buflen = self.pushi = self.popi = 0
        self.bufcap = len(self.buf)

    def push(self, x):
        assert self.buflen == 0 or self.pushi != self.popi, 
               "Buffer overflow!"
        self.buf[self.pushi] = x
        self.pushi = (self.pushi + 1) % self.bufcap
        self.buflen += 1

    def pop(self):
        assert self.buflen != 0, "Buffer underflow!"
        x = self.buf[self.popi]
        self.buf[self.popi] = None
        self.buflen -= 1
        self.popi = (self.popi + 1) % self.bufcap
        return x

With the basics covered, let’s look at how to extend this algorithm to support dynamic expansion in case of overflows.

Dynamically expanding a circular buffer

The approach consists in imagining that the same buffer can contain both a circular buffer area (referred to as the ring area from here on), and an overflow area, and that it is possible to transform a mixed buffer back into a pure circular buffer again. To clarify what this means, some examples are presented below. The full algorithm will be presented afterwards.

First, imagine that we have an empty buffer with a capacity of 5 elements as per the snippet above, and then the following operations take place:

for i in range(5):
    circbuf.push(i)

circbuf.pop() # => 0
circbuf.pop() # => 1

circbuf.push(5)
circbuf.push(6)

print circbuf.buf # => [5, 6, 2, 3, 4]

At this point we have a full buffer, and with the original implementation an additional push would raise an assertion error. To implement expansion, the algorithm will be changed so that those items will be appended at the end of the buffer. Following the example, pushing two additional elements would behave the following way:

circbuf.push(7)
circbuf.push(8)

print circbuf.buf # => [5, 6, 2, 3, 4, 7, 8]

In that example, elements 7 and 8 are part of the overflow area, and the ring area remains with the same capacity and length of the original buffer. Let’s perform a few additional operations to see how it would behave when items are popped and pushed while the buffer is split:

circbuf.pop() # => 2
circbuf.pop() # => 3
circbuf.push(9)

print circbuf.buf # => [5, 6, None, None, 4, 7, 8, 9]

In this case, even though there are two free slots available in the ring area, the last item pushed was still appended at the overflow area. That’s necessary to preserve the FIFO semantics of the circular buffer, and means that the buffer may expand more than strictly necessary given the space available. In most cases this should be a reasonable trade off, and should stop happening once the circular buffer size stabilizes to reflect the production vs. consumption pressure (if you have a producer which constantly operates faster than a consumer, though, please look at the literature for plenty of advice on the problem).

The remaining interesting step in that sequence of events is the moment when the ring area capacity is expanded to cover the full allocated buffer again, with the previous overflow area being integrated into the ring area. This will happen when the content of the previous partial ring area is fully consumed, as shown below:

circbuf.pop() # => 4
circbuf.pop() # => 5
circbuf.pop() # => 6
circbuf.push(10)

print circbuf.buf # => [10, None, None, None, None, 7, 8, 9]

At this point, the whole buffer contains just a ring area and the overflow area is again empty, which means it becomes a traditional circular buffer.

Sample algorithm

With some simple modifications in the traditional implementation presented previously, the above semantics may be easily supported. Note how the additional properties did not introduce significant overhead. Of course, this version will incur in additional memory allocation to support the buffer expansion, bu that’s inherent to the problem being solved.

class ExpandingCircBuf(object):

    def __init__(self):
        self.buf = [None, None, None, None, None]
        self.buflen = self.ringlen = self.pushi = self.popi = 0
        self.bufcap = self.ringcap = len(self.buf)

    def push(self, x):
        if self.ringlen == self.ringcap or 
           self.ringcap != self.bufcap:
            self.buf.append(x)
            self.buflen += 1
            self.bufcap += 1
            if self.pushi == 0: # Optimization.
                self.ringlen = self.buflen
                self.ringcap = self.bufcap
        else:
            self.buf[self.pushi] = x
            self.pushi = (self.pushi + 1) % self.ringcap
            self.buflen += 1
            self.ringlen += 1

    def pop(self):
        assert self.buflen != 0, "Buffer underflow!"
        x = self.buf[self.popi]
        self.buf[self.popi] = None
        self.buflen -= 1
        self.ringlen -= 1
        if self.ringlen == 0 and self.buflen != 0:
            self.popi = self.ringcap
            self.pushi = 0
            self.ringlen = self.buflen
            self.ringcap = self.bufcap
        else:
            self.popi = (self.popi + 1) % self.ringcap
        return x

Note that the above algorithm will allocate each element in the list individually, but in sensible situations it may be better to allocate additional space for the overflow area in advance, to avoid potentially frequent reallocation. In a situation when the rate of consumption of elements is about the same as the rate of production, for instance, there are advantages in doubling the amount of allocated memory per expansion. Given the way in which the algorithm works, the previous ring area will be exhausted before the mixed buffer becomes circular again, so with a constant rate of production and an equivalent consumption it will effectively have its size doubled on expansion.

UPDATE: Below is shown a version of the same algorithm which not only allows allocating more than one additional slot at a time during expansion, but also incorporates it in the overflow area immediately so that the allocated space is used optimally.

class ExpandingCircBuf2(object):

    def __init__(self):
        self.buf = []
        self.buflen = self.ringlen = self.pushi = self.popi = 0
        self.bufcap = self.ringcap = len(self.buf)

    def push(self, x):
        if self.ringcap != self.bufcap:
            expandbuf = (self.pushi == 0)
            expandring = False
        elif self.ringcap == self.ringlen:
            expandbuf = True
            expandring = (self.pushi == 0)
        else:
            expandbuf = False
            expandring = False

        if expandbuf:
            self.pushi = self.bufcap
            expansion = [None, None, None]
            self.buf.extend(expansion)
            self.bufcap += len(expansion)
            if expandring:
                self.ringcap = self.bufcap

        self.buf[self.pushi] = x
        self.buflen += 1
        if self.pushi < self.ringcap:
            self.ringlen += 1
        self.pushi = (self.pushi + 1) % self.bufcap

    def pop(self):
        assert self.buflen != 0, "Buffer underflow!"
        x = self.buf[self.popi]
        self.buf[self.popi] = None
        self.buflen -= 1
        self.ringlen -= 1
        if self.ringlen == 0 and self.buflen != 0:
            self.popi = self.ringcap
            self.ringlen = self.buflen
            self.ringcap = self.bufcap
        else:
            self.popi = (self.popi + 1) % self.ringcap
        return x

Conclusion

This blog post presented an algorithm which supports the expansion of circular buffers while preserving most of their key characteristics. When not faced with an overflowing buffer, the algorithm should offer very similar performance characteristics to a normal circular buffer, with a few additional instructions and constant space for registers only. When faced with an overflowing buffer, the algorithm maintains the FIFO property and enables using contiguous allocated memory to maintain both the original circular buffer and the additional elements, and follows up reusing the full area as part of a new circular buffer in an attempt to find the proper size for the given use case.

Read more
Gustavo Niemeyer

ZooKeeper is a clever generic coordination server for distributed systems, and is one of the core softwares which facilitate the development of Ensemble (project for automagic IaaS deployments which we push at Canonical), so it was a natural choice to experiment with.

Gozk is a complete binding for ZooKeeper which explores the native features of Go to facilitate the interaction with a ZooKeeper server. To avoid reimplementing the well tested bits of the protocol in an unstable way, Gozk is built on top of the standard C ZooKeeper library.

The experience of integrating ZooKeeper with Go was certainly valuable on itself, and worked as a nice way to learn the details of integrating the Go language with a C library. If you’re interested in learning a bit about Go, ZooKeeper, or other details related to the creation of bindings and asynchronous programming, please fasten the seatbelt now.

Basics of C wrapping in Go

Creating the binding on itself was a pretty interesting experiment already. I have worked on the creation of quite a few bindings and language bridges before, and must say I was pleasantly surprised with the experience of creating the Go binding. With Cgo, the name given to the “foreign function interface” mechanism for C integration, one basically declares a special import statement which causes a pre-processor to look at the comment preceding it. Something similar to this:

// #include <zookeeper.h>
import "C"

The comment doesn’t have to be restricted to a single line, or to #include statements even. The C code contained in the comment will be transparently inserted into a helper C file which is compiled and linked with the final object file, and the given snippet will also be parsed and inclusions processed. In the Go side, that “C” import is simulated as if it were a normal Go package so that the C functions, types, and values are all directly accessible.

As an example, a C function with this prototype:

int zoo_wexists(zhandle_t *zh, const char *path, watcher_fn watcher,
                void *context, struct Stat *stat);

In Go may be used as:

cstat := C.struct_Stat{}
rc, cerr := C.zoo_wexists(zk.handle, cpath, nil, nil, &cstat)

When the C function is used in a context where two result values are requested, as done above, Cgo will save the well known errno variable after the function has finished executing and will return it wrapped into an os.Errno value.

Also, note how the C struct is defined in a way that can be passed straight to the C function. Interestingly, the allocation of the memory backing the structure is going to be performed and tracked by the Go runtime, and will be garbage collected appropriately once no more references exist within the Go runtime. This fact has to be kept in mind since the application will crash if a value allocated normally within Go is saved with a foreign C function and maintained after all the Go references are gone. The alternative in these cases is to call the usual C functions to get hold of memory for the involved values. That memory won’t be touched by the garbage collector, and, of course, must be explicitly freed when no longer necessary. Here is a simple example showing explicit allocation:

cbuffer := (*C.char)(C.malloc(bufferSize))
defer C.free(unsafe.Pointer(cbuffer))

Note the use of the defer statement above. Even when dealing with foreign functionality, it comes in handy. The above call will ensure that the buffer is deallocated right before the current function returns, for instance, so it’s a nice way to ensure no leaks happen, even if in the future the function suddenly gets a new exit point which didn’t consider the allocation of resources.

In terms of typing, Go is more strict than C, and Cgo-based logic will also ensure that the types returned and passed into the foreign C functions are correctly typed, in the same way done for the native types. Note above, for instance, how the call to the free() function has to explicitly convert the value into an unsafe.Pointer, even though in C no casting would be necessary to pass a pointer into a void * parameter.

The unsafe.Pointer is in fact a very special type within Go. Using it, one can convert any pointer type into any other pointer type in an unsafe way (thus the package name), and also back and forth into a uintptr value with the address of the memory referenced by the pointer. For every other type conversion, Go will ensure at compilation time that doing the conversion at runtime is a safe operation.

With all of these resources, including the ability to use common Go syntax and functionality even when dealing with foreign types, values, and function calls, the integration task turns out to be quite a pleasant experience. That said, some of the things may still require some good thinking to get right, as we’ll see shortly.

Watch callbacks and channels

One of the most interesting (and slightly tricky) aspects of mapping the ZooKeeper concepts into Go was the “watch” functionality. ZooKeeper allows one to attach a “watch” to a node so that the server will report back when changes happen to the given node. In the C library, this functionality is exposed via a callback function which is executed once the monitored node aspect is modified.

It would certainly be possible to offer this functionality in Go using a similar mechanism, but Go channels provide a number of advantages for that kind of asynchronous notification: waiting for multiple events via the select statement, synchronous blocking until the event happens, testing if the event is already available, etc.

The tricky bit, though, isn’t the use of channels. That part is quite simple. The tricky detail is that the C callback function execution happens in a C thread started by the ZooKeeper library, and happens asynchronously, while the Go application is doing its business elsewhere. Right now, there’s no straightforward way to transfer the execution of this asynchronous C function back into the Go land. The solution for this problem was found with some help from the folks at the golang-nuts mailing list, and luckily it’s not that hard to support or understand. That said, this is a good opportunity to get some coffee or your preferred focus-enhancing drink.

The solution works like this: when the ZooKeeper C library gets a watch notification, it executes a C callback function which is inside a Gozk helper file. Rather than transferring control to Go right away, this C function simply appends data about the event onto a queue, and signals a pthread condition variable to notify that an event is available. Then, on the Go side, once the first ZooKeeper connection is initialized, a new goroutine is fired and loops waiting for events to be available. The interesting detail about this loop, is that it blocks within a foreign C function waiting for an event to be available, through the signaling of the shared pthread condition variable. In the Go side, that’s how the call looks like, just to give a more practical feeling:

// This will block until there's a watch available.
data := C.wait_for_watch()

Then, on the C side, here is the function definition:

watch_data *wait_for_watch() {
    watch_data *data = NULL;
    pthread_mutex_lock(&watch_mutex);
    if (first_watch == NULL)
        pthread_cond_wait(&watch_available, &watch_mutex);
    data = first_watch;
    first_watch = first_watch->next;
    pthread_mutex_unlock(&watch_mutex);
    return data;
}

As you can see, not really a big deal. When that kind of blocking occurs inside a foreign C function, the Go runtime will correctly continue the execution of other goroutines within other operating system threads.

The result of this mechanism is a nice to use interface based on channels, which may be explored in different ways depending on the application needs. Here is a simple example blocking on the event synchronously, for instance:

stat, watch, err := zk.ExistsW("/some/path")
if stat == nil && err == nil {
    event := <-watch
    // Use event ...
}

Concluding

Those were some of the interesting aspects of implementing the ZooKeeper binding. I would like to speak about some additional details, but this post is rather long already, so I'll keep that for a future opportunity. The code is available under the LGPL, so if you're curious about some other aspect, or would like to use ZooKeeper with Go, please move on and check it out!

Read more

Just in case people aren’t aware here’s the coding style guide, freshly updated with C++ bits.

Read more
Gustavo Niemeyer

When I started programming in Python long ago, one of the features which really hooked me up was the quality interactive interpreter offered with the language implementation. It was (and still is) a fantastic way to experiment with syntax, semantics, modules, and whatnot. So much so that many first-class Python practitioners will happily tell you that the interactive interpreter is used not only as a programming sandbox, but many times as the their personal calculator too. This kind of interactive interpreter is also known as a REPL, standing for Read Eval Print Loop, and many languages have pretty advanced choices in that area by now.

After much rejoice with Python’s REPL, though, and as a normal human being, I’ve started wishing for more. The problem has a few different levels, which are easy to understand.

First, we’re using Python Twisted in Ensemble, one of the projects being pushed at Canonical. Twisted is an event-driven framework, which among other things means it works a lot with closures and callbacks. Having to redefine multi-line functions frequently to drive experiments isn’t exactly fun in a line-based interactive interpreter. Then, some of the languages I’ve started playing with, such as Erlang, have limited REPLs which differ in functionality significantly compared to what may be done in a text file. And finally, other languages I’ve been programming with recently, such as Go, lack a reasonable REPL altogether (there are only unusable hacks around).

Alright, so here is the idea: what if instead of being given an interactive REPL, you were presented with your favorite text editor, and whenever you wrote the file down, it was executed and results presented? That’s The Hacking Sandbox, or hsandbox. It supports 11 different programming languages out of the box, and given its nature it should be trivial to support any other language.

Here is a screenshot to clarify the idea:

Note that if you open a sandbox for a language like C or Go, the skeleton of what’s needed to run a program will already be in place, so you just have to “fill the blanks”.

For more details and download information, please check the hsandbox web page.

Read more
Gustavo Niemeyer

In my previous post I made an open statement which I’d like to clarify a bit further:

(…) when the rules don’t work for people, the rules should be changed, not the people.

This leaves a lot of room for personal interpretation of what was actually meant, and TIm Hoffman pointed that out nicely with the following questioning in a comment:

I wonder when the rule is important enough to change the people though. For instance [, if your] development process is oriented to TDD and people don’t write the tests or do the job poorly will you change them then?

This is indeed a nice scenario to explore the idea. If it happens at some point that a team claims to be using TDD, but if in practice no developer actually writes tests first, the rules are clearly not working. If everyone in the team hates doing TDD, enforcing it most probably won’t show its intended benefits, and that was the heart of my comment. You can’t simply keep the rule as is if no one follows it, unless you don’t really care about the outcome of the rule.

One interesting point, though, is that when you have a high level of influence over the environment in which people are, it may be possible to tweak the rules or the processes to adapt to reality, and tweaking the processes may change the way that people feel about the rules as a consequence (arguably, changing people as a side effect).

As a more concrete example, if I found myself in the described scenario, I’d try to understand why TDD is not working, and would try to discuss with the team to see how we should change the process so that it starts to work for us somehow. Maybe what would be needed is more discussion to show the value of TDD, and perhaps some pair programming with people that do TDD very well so that the joy of doing it becomes more visible.

In either case, I wouldn’t be simply asking people “Everyone has to do TDD from now on!“, I’d be tweaking the process so that it feels better and more natural to people. Then, if nothing similar works either, well, let’s change the rule. I’d try to use more conventional unit testing or some other system which people do follow more naturally and that presents similar benefits.

Read more
Gustavo Niemeyer

For a long time I’ve been an advocate of Python’s notion of controlling access to private and protected members (attributes, methods, etc) with conventions, by simply naming them like “_name”, with an initial underline.  Even though Python does support the “__name” (with double underscore) for “private” members (this actually mangles the name rather than hiding it), you’ll notice that even this is rarely used in practice, and the largely agreed mantra is that convention should be enough and thus one underscore suffices. This always resonated quite well with me, since I generally prefer to handle situations by agreement rather than enforcement. Well, I’m now changing my opinion.that this works well for this purpose, at least in certain situations.

This methodology may work quite well in situations where the code scope is within a very controlled environment, with one or more teams which follow strictly a single development guideline, and have the power to refactor the affected code base somewhat easily when the original decisions are too limiting.

Having worked on a few major projects now, and some of them being libraries which are used by several teams within the same company or outside, I now perceive that people very often take shortcuts over these decisions for getting their job done quickly. It’s way easier to simply read the code and get to the private guts of a library than to try to get agreement over the right way to do something, or sending a patch with a suggested change which was carefully architected.

Many people by now are probably thinking: “Well, that’s their problem, isn’t it? If their code base breaks on the next upgrade they’ll get burden and won’t be able to upgrade cleanly.”, and I can honestly understand this feeling, since I shared it. But, for a number of reasons, I now understand that this isn’t just their problem, it’s very much my problem too.

Most importantly, on any serious software, these problems will usually come back to the implementors, and many times the problem will have a much larger magnitude by then than they had at the time a change could have been done “the right way” on the implementation, because code dependent on the private bits will have settled.

Most people are optimist by nature and believe that the implementation won’t change, but, of course, one of the reasons why private information is made private in the first place is exactly because the implementor believes that having the freedom to change these details in the future is important, and not rarely there’s already a plan of evolution in place for these private pieces, which may include revamping the implementation entirely for scalability or for other goals.

In the best case, the careless people will get burden on the upgrade and will ask for support or simply won’t upgrade silently, and both cases hurt implementors, because providing support for broken software takes time and energy, and amazingly can even hurt the software image. Lack of upgrades also means more ancient versions in the wild to give support for. Besides these, in the worst case scenario, the careless people have enough influence on the affected project to cause as much burden on it as if the private data was public in the first place.

As much as I’m a believer in handling situation by agreement rather than enforcement, I’m also a believer that when the rules don’t work for people, the rules should be changed, not the people. So my positioning now is that the language supported access constraints (public, protected, private), as available in languages like Java and C++, are a better alternative when compared to convention as used today in Python, since they provide an additional layer of encouragement for people to not break the rules carelessly, and that helps in the maintenance and reuse of software that has greater visibility.

Read more