Canonical Voices

Posts tagged with 'c++'

abeato

Occasionally I find myself processing input data which arrives as a stream, like data from files or from a socket, but that has a known structure that can be modeled with C types. For instance, let’s say we are receiving from a socket a parcel that consists on a header of one byte, and a payload that is an integer. A naive way to handle this is the following (simplified for readability) code snippet:

int main(void)
{
    int fd;
    char *buff;
    struct sockaddr_in addr;
    int vint;
    char vchar;

    fd = socket(AF_INET, SOCK_STREAM, 0);
    buff = malloc(BUFF_SIZE);
    /* Init socket address */
    ...
    connect(fd, (struct sockaddr *) &addr, sizeof(addr));

    read(fd, buff, BUFF_SIZE);

    vchar = buff[0];
    vint  = *(int *) &buff[1];
    /* Do something with extracted data, free resources */
    ...
    return 0;
}

Here we get the raw data with a read() call, we read the first byte, then we read an integer by taking a pointer to the second read byte and casting it to a pointer to an integer. (for this example we are assuming that the integer inserted in the stream has the same size and endianness as the CPU ones).

There is a big issue with this: the cast to int *, which is undefined behavior according to the C standard 1. And it is because things can go wrong in at least two ways, first due to pointer aliasing rules, second due to type alignment.

Strict pointer aliasing tells the compiler that it can assume that pointers to different types point to different places in memory. This allows some optimizations, like reordering. Therefore, we could be in trouble if, say, we take &buff[1] into a char * and use it to write a byte in that location, as reordering could hit us. So just do not do that. Let’s also hope that we have a compiler that is not completely insane and does not move our reading by int pointer before the read() system call. We could also disable strict aliasing if we are using GCC with option -fno-strict-aliasing, which by the way is something that the Linux kernel does. At any rate, this is a complex subject and I will not dig into it this time.

We will concentrate in this article on how to solve the other problem, that is, how to access safely types that are not stored in memory in their natural alignment.

The C Standard-Compliant Solution

Before moving further, keep in mind that it is always possible to be strictly compliant with the standard and access safely memory without breaking language rules or using compiler or machine specific tricks. In the example, we could retrieve vint by doing

    vint  =   buff[1] + (buff[2] << 8)
            + (buff[3] << 16) + (buff[4] << 24);

(supposing stored data is little endian).

The issue here is performance: we are implicitly transforming four bytes to integers, then we have to bit-shift three of them, and finally we have to add them up. Note however that this is what we want if data and CPU have different endianness.

Doing Unaligned Memory Accesses

In all machine architectures there is a natural alignment for the different data types. This alignment is usually the size of the types, for instance in 32 bits architectures the alignment for integers is 4, for doubles it is 8, etc. If instances of these types are not stored in memory positions that are multiple of their alignment, we are talking about unaligned access. If we try to access unaligned data either of these can happen:

  • The hardware let’s us access it – but always at a performance penalty.
  • An exception is triggered by the CPU. This type of exception is called bus error 2.

We might be willing to accept the performance penalty 3, which is mitigated by CPU caches and not that noticeable in certain architectures like x86-64 , but we certainly do not want our program to crash. How possible is this? To be honest it is not something I have seen that often. Therefore, as a first analysis step, I checked how easy it was to get bus errors. To do so, I created the following C++ program, access1.cpp (I could not resist to use templates here to reduce the code size):

#include <iostream>
#include <typeinfo>
#include <cstring>

using namespace std;

template <typename T>
void print_unaligned(char *ptr)
{
    T *val = reinterpret_cast<T *>(ptr);

    cout << "Type is \"" << typeid(T).name()
         << "\" with size " << sizeof(T) << endl;
    cout << val << " *val: " << *val << endl;
}

int main(void)
{
    char *mem = new char[128];

    memset(mem, 0, 128);

    print_unaligned<int>(mem);
    print_unaligned<int>(mem + 1);
    print_unaligned<long long>(mem);
    print_unaligned<long long>(mem + 1);
    print_unaligned<long double>(mem);
    print_unaligned<long double>(mem + 1);

    delete[] mem;
    return 0;
}

The program allocates memory using new char[], which as malloc() in C is guaranteed to allocate memory with the same alignment as the strictest fundamental type. After zeroing the memory, we access mem and mem + 1 by casting to different pointer types, knowing that the second address is odd, and therefore unaligned except for char * access.

I compiled the file with g++ on my laptop, ran it, and got

$ g++ access1.cpp -o access1
$ file access1
access1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=09d0fb19340a10941eef4c3dc4d6eb29383e717d, not stripped
$ ./access1
Type is "i" with size 4
0x16c3c20 *val: 0
Type is "i" with size 4
0x16c3c21 *val: 0
Type is "x" with size 8
0x16c3c20 *val: 0
Type is "x" with size 8
0x16c3c21 *val: 0
Type is "e" with size 16
0x16c3c20 *val: 0
Type is "e" with size 16
0x16c3c21 *val: 0

No error for x86-64. This was expected as Intel architecture is known to support unaligned access by hardware, at a performance penalty (which is apparently quite small these days, see 4).

The second try was with an ARM CPU, compiling for arm-32:

$ g++ access1.cpp -o access1
$ file access1
access1: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=8c3c3e7d77fddd5f95d18dbffe37d67edc716a1c, not stripped
$ ./access1
Type is "i" with size 4
0x47b008 *val: 0
Type is "i" with size 4
0x47b009 *val: 0
Type is "x" with size 8
0x47b008 *val: 0
Type is "x" with size 8
Bus error (core dumped)

Now we get what we were searching for, a legitimate bus error, in this case when accessing a long long from an unaligned address. Commenting out the offending line and letting the program run further showed the error also when accessing a long double from mem + 1.

Fixing Unaligned Memory Accesses

After proving that this could be a real problem, at least for some architectures, I tried to find a solution that would let me do unaligned memory accesses in the most generic way. I could not find anything safe that was strictly following the C standard. However, all C/C++ compilers have ways to define packed structures, and that came to the rescue.

Packed structures are intended to minimize the padding that is introduced by alignment needed by the structure members. They are used when minimizing storage is a big concern. But what is interesting for us is that its members can be unaligned due to the packing, so dereferencing them must take that into account. Therefore, if we are accessing a type in a CPU that does not support unaligned access for that type the compiler must synthesize code that handles this transparently from the point of view of the C program.

To test that this worked as expected, I wrote access2.cpp, which uses GCC attribute __packed__ to define a packed structure:

#include <iostream>
#include <typeinfo>
#include <cstring>

using namespace std;

template <typename T>
struct __attribute__((__packed__)) struct_safe
{
    T val;
};

template <typename T>
void print_unaligned(char *ptr)
{
    struct_safe<T> *safe = reinterpret_cast<struct_safe<T> *>(ptr);

    cout << "Type is \"" << typeid(T).name()
         << "\" with size " << sizeof(T) << endl;
    cout << safe << " safe->val: " << safe->val << endl;
}

int main(void)
{
    char *mem = new char[128];

    memset(mem, 0, 128);

    print_unaligned<int>(mem);
    print_unaligned<int>(mem + 1);
    print_unaligned<long long>(mem);
    print_unaligned<long long>(mem + 1);
    print_unaligned<long double>(mem);
    print_unaligned<long double>(mem + 1);

    delete[] mem;
    return 0;
}

In this case, instead of directly casting to the type, I cast to a pointer to the packed struct and access the type through it.

Compiling and running for x86-64 got the expected result: no error, all worked as before. Then I compiled and ran it in an ARM device:

$ g++ access2.cpp -o access2
$ file access2
access2: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=9a1ee8c2fcd97393a4b53fe12563676d9f2327a3, not stripped
$ ./access2
Type is "i" with size 4
0x391008 safe->val: 0
Type is "i" with size 4
0x391009 safe->val: 0
Type is "x" with size 8
0x391008 safe->val: 0
Type is "x" with size 8
0x391009 safe->val: 0
Type is "e" with size 8
0x391008 safe->val: 0
Type is "e" with size 8
0x391009 safe->val: 0

No bus errors anymore! It worked as expected. To gain some understanding of what was happening behind the curtains, I disassembled the generated ARM binaries. For both access1 and access2, the same instruction was being used when I was getting a value after casting to int: LDR, which unsurprisingly loads a 32-bit word into a register. But for the long long, I found that access1 was using LDRD, which loads double words (8 bytes) from memory, while access2 was using two LDR instructions instead.

This all made a lot of sense, as ARM states that LDR supports access to unaligned data, while LDRD does not 5. Indeed the later is faster, but has this restriction. It was also good to check that there was no penalty for using the packed structure for integers: GCC does a good job to discriminate when the CPU really needs to handle differently unaligned accesses.

GCC cast-align Warning

GCC has a warning that can help to identify points in the code when we might be accessing unaligned data, which is activated with -Wcast-align. It is not part of the warnings that are activated by options -Wall or -Wextra, so we will have to add it explicitly if we want it. The warning is only triggered when compiling for architectures that do not support unaligned access for all types, so you will not see it if compiling only for x86.

When triggered, you will see something like

file.c:28:23: warning: cast increases required alignment of target type [-Wcast-align]
   int *my_int_ptr = (int *) &buf[i];
                     ^

Conclusion

The moral of this post is that you need to be very careful when casting pointers to a type different to the original one 6. When you need to do that, think about alignment issues first, and also think on your target architectures. There are programs that we want to run on more than one CPU type and too many times we only test in our reference.

Unfortunately the C standard does not give us a standard way of doing efficient access to unaligned data, but most if not all compilers seem to provide ways to do this. If we are using GCC, __attribute__((__packed__)) can help us when we might be doing unaligned accesses. The ARM compiler has a __packed attribute for pointers 7, and I am sure other compilers provide similar machinery. I also recommend to activate -Wcast-align if using GCC, which makes easier to spot alignment issues.

Finally, a word of caution: in most cases you should not do this type of casts. Some times you can define structures and read directly data onto them, some times you can use unions. Bear in mind always the strict pointer aliasing rules, which can hit back. To summarize, think twice before using the sort of trick showed in the post, and use them only when really needed.

Read more
niemeyer

Over the last several months there has been noticeable and growing pain associated with the evolving integration tests around snapd, and given the project goal of being a cross-distribution platform, we are very keen on solving this problem appropriately so that stability is guaranteed everywhere.

With that mindset a more focused effort was made over the last few weeks to produce a tool that can get the project out of those problems, and onto a runway of more pleasant stability. Despite the short amount of time, I’m very happy about the Spread project which resulted from this effort.

Spread is not Jenkins or Travis, and is not a language or library either. Spread is a tool that will very conveniently ship your code to one or more systems, in parallel, and then offer the right set of options so you can run whatever you need to run to make sure the logic is working, and drive it all from the local system. That implies you can run Spread inside Travis, Jenkins, or your terminal, in a similar way to how your unit tests work.

Here is a short list of interesting facts about Spread:

  • Full-system tests with on demand machine allocation.
  • Multi-backend with Linode and LXD (for local runs) out of the box for now.
  • Multi-language since it can run arbitrary remote code.
  • Agent-less and driven via embedded ssh (kudos to Go team).
  • Convenient harness with project+backend+suite+test prepare and restore scripts.
  • Variants feature for test duplication without copy & paste.
  • Great debugging support – add -debug and stop with a shell inside every failure.
  • Reuse of servers – server allocation is fast, but not allocating is faster.
  • Reasonable test outputs with the shell’s +x mode on failures.
  • … and so forth.

This is all well documented, so I’ll just provide one example here to offer a real taste of how the system feels like.

This is spread.yaml, put in the project root to define the basics:

project: spread

backends:
    lxd:
        systems:
            - ubuntu-16.04
            - ubuntu-14.04

path: /home/test

prepare: |
    echo Entering project...
restore: |
    echo Leaving project...

suites:
    tests/: 
        summary: Integration tests
        prepare: |
            echo Entering suite...
        restore: |
            echo Leaving suite...

The suite name is also the path under which the tests are found.

Then, this is tests/hello/task.yaml:

summary: Greet the world
prepare: |
    echo "Entering task..."
restore: |
    echo "Leaving task..."
environment:
    FOO/a: one
    FOO/b: two
execute: |
    echo "Hello world!"
    [ $FOO = one ] || exit 1

The outcome should be almost obvious (intended feature :-). The one curious detail here is the FOO/a and FOO/b environment variables. This is how to introduce variants, which means this one test will in fact become two: first with FOO=one, and then with FOO=two. Now consider that such environment variables can be defined at any level – project, backend, suite, and task – and imagine how easy it is to test small variations without any copy & paste. After cascading takes place (project→backend→suite→task) all environment variables using a given variant key will be present at once on the same execution.

Now let’s try to run this configuration, including the -debug flag so we get a shell on the failures. Note how with a single test we get four different jobs, two variants over two systems, with the variant b failing as instructed:

$ spread -debug

2016/06/11 19:09:27 Allocating lxd:ubuntu-14.04...
2016/06/11 19:09:27 Allocating lxd:ubuntu-16.04...
2016/06/11 19:09:41 Waiting for LXD container to have an address...
2016/06/11 19:09:43 Waiting for LXD container to have an address...
2016/06/11 19:09:44 Allocated lxd:ubuntu-14.04.
2016/06/11 19:09:44 Connecting to lxd:ubuntu-14.04...
2016/06/11 19:09:48 Allocated lxd:ubuntu-16.04.
2016/06/11 19:09:48 Connecting to lxd:ubuntu-16.04...
2016/06/11 19:09:52 Connected to lxd:ubuntu-14.04.
2016/06/11 19:09:52 Sending project data to lxd:ubuntu-14.04...
2016/06/11 19:09:53 Connected to lxd:ubuntu-16.04.
2016/06/11 19:09:53 Sending project data to lxd:ubuntu-16.04...

2016/06/11 19:09:54 Error executing lxd:ubuntu-14.04:tests/hello:b :
-----
+ echo Hello world!
Hello world!
+ [ two = one ]
+ exit 1
-----

2016/06/11 19:09:54 Starting shell to debug...

lxd:ubuntu-14.04 ~/tests/hello# echo $FOO
two
lxd:ubuntu-14.04 ~/tests/hello# cat /etc/os-release | grep ^PRETTY
PRETTY_NAME="Ubuntu 14.04.4 LTS"
lxd:ubuntu-14.04 ~/tests/hello# exit
exit

2016/06/11 19:09:55 Error executing lxd:ubuntu-16.04:tests/hello:b :
-----
+ echo Hello world!
Hello world!
+ [ two = one ]
+ exit 1
-----

2016/06/11 19:09:55 Starting shell to debug...

lxd:ubuntu-16.04 ~/tests/hello# echo $FOO
two
lxd:ubuntu-16.04 ~/tests/hello# cat /etc/os-release | grep ^PRETTY
PRETTY_NAME="Ubuntu 16.04 LTS"
lxd:ubuntu-16.04 ~/tests/hello# exit
exit


2016/06/11 19:10:33 Discarding lxd:ubuntu-14.04 (spread-129)...
2016/06/11 19:11:04 Discarding lxd:ubuntu-16.04 (spread-130)...
2016/06/11 19:11:05 Successful tasks
2016/06/11 19:11:05 Aborted tasks: 0
2016/06/11 19:11:05 Failed tasks: 2
    - lxd:ubuntu-14.04:tests/hello:b
    - lxd:ubuntu-16.04:tests/hello:b
error: unsuccessful run

This demonstrates many of the stated goals (parallelism, clarity, convenience, debugging, …) while running on a local system. Running on a remote system is just as easy by using an appropriate backend. The snapd project on GitHub, for example, is hooked up on Travis to run Spread and then ship its tests over to Linode. Here is a real run output with the initial tests being ported, and a basic smoke test.

If you like what you see, by all means please go ahead and make good use of it.

We’re all for more stability and sanity everywhere.

@gniemeyer

Read more
bmichaelsen

They sentenced me to twenty years of boredom
For trying to change the system from within
— Leonard Cohen, I’m your man, First we take Manhattan

Advance warning: This blog post talks about C++ coding style, and given the “expressiveness” (aka a severe infection with TimTowTdi) this is bound to contain significant amounts of bikeshedding, personal opinion/preference. As such, be invited to ignore all this as the ramblings of a raging lunatic.

Anyone who observed me spotting a Pimpl in code will know that I am not a fan of this idom. Its intend is to reduce build times by using a design pattern to move implementation details out of headers — a workaround for C++s misfeature of by default needing a recompile even for changing implementation details only without changing the public interface. Now I personally always thought a pure abstract base class to be a more “native” and less ugly way to tell this to the compiler. However, without real testing, such gut feelings are rarely good advisors in a complex language like C++.

So I did some testing on the real life performance of a pure abstract base class vs. a pimpl (each of course in a different compilation unit to prevent the compiler to optimize away what we want to measure) — and for reference, a class with functions that can be completely inlined. These are the three test implementations, inline:

-- header (hxx) --
class InlineClass final
{
	public:
		InlineClass(int nFirst, int nSecond)
			: m_nFirst(nFirst), m_nSecond(nSecond), m_nResult(0)
		{};
		void Add()
			{ m_nResult = m_nFirst + m_nSecond; };
		int GetResult() const
			{ return m_nResult; };
	private:
		const int m_nFirst;
		const int m_nSecond;
		int m_nResult;
};

Pimpl, as suggested by Effective Modern C++ when using C++11, but not C++14:

-- header (hxx) --
#include <memory>
class PimplClass final
{
	public:
		PimplClass(int nFirst, int nSecond);
		~PimplClass();
		void Add();
		int GetResult() const;
	private:
		struct Impl;
		std::unique_ptr<Impl> m_pImpl;
};
-- implementation (cxx) --
#include "pimpl.hxx"
struct PimplClass::Impl
{
	Impl(int nFirst, int nSecond)
		: m_nFirst(nFirst), m_nSecond(nSecond), m_nResult(0)
	{};
	const int m_nFirst;
	const int m_nSecond;
	int m_nResult;
};
PimplClass::PimplClass(int nFirst, int nSecond)
	: m_pImpl(std::unique_ptr<Impl>(new Impl(nFirst, nSecond)))
{}
PimplClass::~PimplClass()
	{}
void PimplClass::Add()
	{ m_pImpl->m_nResult = m_pImpl->m_nFirst + m_pImpl->m_nSecond; }
int PimplClass::GetResult() const
	{ return m_pImpl->m_nResult; }

Pure abstract base class:

-- header (hxx) --
#include <memory>
struct AbcClass
{
	static std::shared_ptr<AbcClass> Create(int nFirst, int nSecond);
	virtual ~AbcClass() {};
	virtual void Add() =0;
	virtual int GetResult() const =0;
};
-- implementation (cxx) --
#include "abc.hxx"
#include <memory>
struct AbcClassImpl final : public AbcClass
{
	AbcClassImpl(int nFirst, int nSecond)
		: m_nFirst(nFirst), m_nSecond(nSecond)
	{};
	virtual void Add() override
		{ m_nResult = m_nFirst + m_nSecond; };
	virtual int GetResult() const override
		{ return m_nResult; };
	const int m_nFirst;
	const int m_nSecond;
	int m_nResult;
};
std::shared_ptr<AbcClass> AbcClass::Create(int nFirst, int nSecond)
	{ return std::shared_ptr<AbcClass>(new AbcClassImpl(nFirst, nSecond)); }

Comparing these we find:

implementation lines added for GetResult() source entropy added source entropy for GetResult() runtime
inline 2 187 17 100%
Pimpl 3 316 26 168% (174%)
pure ABC 3 295 (273) 19 (16) 158%

So the abstract base class has less complex source code (entropy)1, needs less additional entropy to expand and is still faster in the end on common hardware (Intel i5-4200U) with common compiler optimization switches (-O2)2.

Additionally, in a non-trivial code base you might actually need to use virtual functions for your implementation anyway as you are deriving from or implementing an existing interface. In the Pimpl case, this means using two indirections (resolving the virtual function and then resolving the m_pImpl pointer in that function on top of that). In the abstract base class case thats not happening and in addition, it means that you can spare yourself the pure virtual declarations in the *.hxx (the virtual ... =0 ones), as those are already declared in the class derived from. In LibreOffice, this is true for any class implementing UNO interfaces. So the first numbers are actually biased against an abstract base class for real world code bases — the numbers in parathesis show the results when an interface is already defined elsewhere.

So unless the synthetic example used here is some kind of weird cornercase, this suggests abstract base classes being the better alternative over a Pimpl once the class goes beyond being a plain value type with completely inlineable accessor member functions.

Thanks for bearing with me on this rant about one of my personal pet peeves here!

1 entropy is measured as cat abc.[hc]xx|gzip|wc -c or cat pimpl.[hc]xx|sed -e 's/Pimpl/Abc/g'|gzip|wc -c.
2 Here is the code run for that comparision:

constexpr int repeats = 100000;

int pimplrun(long count)
//int abcrun(long count)
{
        std::vector< std::shared_ptr<PimplClass /* AbcClass */ > > vInstances;
        vInstances.reserve(count);
        while(--count)
                vInstances.emplace_back(std::make_shared<PimplClass>(4711, 4711));
                //vInstances.emplace_back(AbcClass::Create(4711, 4711));
        int result(0);
        count = vInstances.size();
        while(--count)
                for(auto pInstance : vInstances)
                {
                        pInstance->Add();
                        result += pInstance->GetResult();
                }
        return result;
}

Instances are stored in shared pointers as anything that a Pimpl is considered for would be “heavy” enough to be handled by reference instead of by value.

Update 1: Out of curiosity, I looked a bit deeper at this with callgrind. This is what I found for running the above (with 1000 repeats) and --cache-sim=yes:

I1 cache: 32768 B, 64 B, 8-way
D1 cache: 32768 B, 64 B, 8-way
LL cache: 3145728 B, 64 B, 12-way

event inline ABC Pimpl
Ir 23,356,163 38,652,092 38,620,878
Dr 5,066,041 14,109,098 12,107,992
Dw 3,060,033 5,094,790 5,099,991
I1ir 34 127 29
D1mr 499,952 253,006 999,013
D1mw 501,636 998,312 500,097
ILmr 28 126 24
DLmr 2 845 0
DLmw 0 1,285 250

I dont know exactly what to derive from that, but what is clear is that purely by instruction counts Ir this can not be explained. So you need --cache-sim=yes which gives the additional event counts. Actually Pimpl looks slightly better on most stats, so as it is slower in real life, the cache misses on the first level data cache D1mr might have quite an impact?

Update 2: This post made it to reddit, so I looked into some of the feedback from there. A common suggestion was to use for(auto& pInstance : vInstances) instead of for(auto pInstance : vInstances) in the benchmarking function. This had no significant impact on walltime measurements nor made it callgrind event counts show some clearer picture. I also played around with the order of linked objects to see if it has any impact (via cache locality etc.). While runtime measurements fluctuated quite a bit (even when using the same binary), the order was always the same: inlining quickest, then abstract base class and pimpl slowest.


Read more
bmichaelsen

They sentenced me to twenty years of boredom
For trying to change the system from within
— Leonard Cohen, I’m your man, First we take Manhattan

Advance warning: This blog post talks about C++ coding style, and given the “expressiveness” (aka a severe infection with TimTowTdi) this is bound to contain significant amounts of bikeshedding, personal opinion/preference. As such, be invited to ignore all this as the ramblings of a raging lunatic.

Anyone who observed me spotting a Pimpl in code will know that I am not a fan of this idom. Its intend is to reduce build times by using a design pattern to move implementation details out of headers — a workaround for C++s misfeature of by default needing a recompile even for changing implementation details only without changing the public interface. Now I personally always thought a pure abstract base class to be a more “native” and less ugly way to tell this to the compiler. However, without real testing, such gut feelings are rarely good advisors in a complex language like C++.

So I did some testing on the real life performance of a pure abstract base class vs. a pimpl (each of course in a different compilation unit to prevent the compiler to optimize away what we want to measure) — and for reference, a class with functions that can be completely inlined. These are the three test implementations, inline:

-- header (hxx) --
class InlineClass final
{
	public:
		InlineClass(int nFirst, int nSecond)
			: m_nFirst(nFirst), m_nSecond(nSecond), m_nResult(0)
		{};
		void Add()
			{ m_nResult = m_nFirst + m_nSecond; };
		int GetResult() const
			{ return m_nResult; };
	private:
		const int m_nFirst;
		const int m_nSecond;
		int m_nResult;
};

Pimpl, as suggested by Effective Modern C++ when using C++11, but not C++14:

-- header (hxx) --
#include <memory>
class PimplClass final
{
	public:
		PimplClass(int nFirst, int nSecond);
		~PimplClass();
		void Add();
		int GetResult() const;
	private:
		struct Impl;
		std::unique_ptr<Impl> m_pImpl;
};
-- implementation (cxx) --
#include "pimpl.hxx"
struct PimplClass::Impl
{
	Impl(int nFirst, int nSecond)
		: m_nFirst(nFirst), m_nSecond(nSecond), m_nResult(0)
	{};
	const int m_nFirst;
	const int m_nSecond;
	int m_nResult;
};
PimplClass::PimplClass(int nFirst, int nSecond)
	: m_pImpl(std::unique_ptr<Impl>(new Impl(nFirst, nSecond)))
{}
PimplClass::~PimplClass()
	{}
void PimplClass::Add()
	{ m_pImpl->m_nResult = m_pImpl->m_nFirst + m_pImpl->m_nSecond; }
int PimplClass::GetResult() const
	{ return m_pImpl->m_nResult; }

Pure abstract base class:

-- header (hxx) --
#include <memory>
struct AbcClass
{
	static std::shared_ptr<AbcClass> Create(int nFirst, int nSecond);
	virtual ~AbcClass() {};
	virtual void Add() =0;
	virtual int GetResult() const =0;
};
-- implementation (cxx) --
#include "abc.hxx"
#include <memory>
struct AbcClassImpl final : public AbcClass
{
	AbcClassImpl(int nFirst, int nSecond)
		: m_nFirst(nFirst), m_nSecond(nSecond)
	{};
	virtual void Add() override
		{ m_nResult = m_nFirst + m_nSecond; };
	virtual int GetResult() const override
		{ return m_nResult; };
	const int m_nFirst;
	const int m_nSecond;
	int m_nResult;
};
std::shared_ptr<AbcClass> AbcClass::Create(int nFirst, int nSecond)
	{ return std::shared_ptr<AbcClass>(new AbcClassImpl(nFirst, nSecond)); }

Comparing these we find:

implementation lines added for GetResult() source entropy added source entropy for GetResult() runtime
inline 2 187 17 100%
Pimpl 3 316 26 168% (174%)
pure ABC 3 295 (273) 19 (16) 158%

So the abstract base class has less complex source code (entropy)1, needs less additional entropy to expand and is still faster in the end on common hardware (Intel i5-4200U) with common compiler optimization switches (-O2)2.

Additionally, in a non-trivial code base you might actually need to use virtual functions for your implementation anyway as you are deriving from or implementing an existing interface. In the Pimpl case, this means using two indirections (resolving the virtual function and then resolving the m_pImpl pointer in that function on top of that). In the abstract base class case thats not happening and in addition, it means that you can spare yourself the pure virtual declarations in the *.hxx (the virtual ... =0 ones), as those are already declared in the class derived from. In LibreOffice, this is true for any class implementing UNO interfaces. So the first numbers are actually biased against an abstract base class for real world code bases — the numbers in parathesis show the results when an interface is already defined elsewhere.

So unless the synthetic example used here is some kind of weird cornercase, this suggests abstract base classes being the better alternative over a Pimpl once the class goes beyond being a plain value type with completely inlineable accessor member functions.

Thanks for bearing with me on this rant about one of my personal pet peeves here!

1 entropy is measured as cat abc.[hc]xx|gzip|wc -c or cat pimpl.[hc]xx|sed -e 's/Pimpl/Abc/g'|gzip|wc -c.
2 Here is the code run for that comparision:

constexpr int repeats = 100000;

int pimplrun(long count)
//int abcrun(long count)
{
        std::vector< std::shared_ptr<PimplClass /* AbcClass */ > > vInstances;
        vInstances.reserve(count);
        while(--count)
                vInstances.emplace_back(std::make_shared<PimplClass>(4711, 4711));
                //vInstances.emplace_back(AbcClass::Create(4711, 4711));
        int result(0);
        count = vInstances.size();
        while(--count)
                for(auto pInstance : vInstances)
                {
                        pInstance->Add();
                        result += pInstance->GetResult();
                }
        return result;
}

Instances are stored in shared pointers as anything that a Pimpl is considered for would be “heavy” enough to be handled by reference instead of by value.

Update 1: Out of curiosity, I looked a bit deeper at this with callgrind. This is what I found for running the above (with 1000 repeats) and --cache-sim=yes:

I1 cache: 32768 B, 64 B, 8-way
D1 cache: 32768 B, 64 B, 8-way
LL cache: 3145728 B, 64 B, 12-way

event inline ABC Pimpl
Ir 23,356,163 38,652,092 38,620,878
Dr 5,066,041 14,109,098 12,107,992
Dw 3,060,033 5,094,790 5,099,991
I1ir 34 127 29
D1mr 499,952 253,006 999,013
D1mw 501,636 998,312 500,097
ILmr 28 126 24
DLmr 2 845 0
DLmw 0 1,285 250

I dont know exactly what to derive from that, but what is clear is that purely by instruction counts Ir this can not be explained. So you need --cache-sim=yes which gives the additional event counts. Actually Pimpl looks slightly better on most stats, so as it is slower in real life, the cache misses on the first level data cache D1mr might have quite an impact?

Update 2: This post made it to reddit, so I looked into some of the feedback from there. A common suggestion was to use for(auto& pInstance : vInstances) instead of for(auto pInstance : vInstances) in the benchmarking function. This had no significant impact on walltime measurements nor made it callgrind event counts show some clearer picture. I also played around with the order of linked objects to see if it has any impact (via cache locality etc.). While runtime measurements fluctuated quite a bit (even when using the same binary), the order was always the same: inlining quickest, then abstract base class and pimpl slowest.


Read more
bmichaelsen

I would walk 500 miles and I would walk 500 more
The proclaimers, 500 miles

So I recently noted that github reported I have 1337 commits on LibreOffice since I joined Canonical in February 2011. Looking at those stats, it seems I also deleted some net 155,634 lines over that time in the codebase.

LibreOffice commits

Even though I cant find that mail, I seem to remember that Michael Stahl, when joining the LibreOffice project proclaimed his goal to be to contribute ‘a net negative lines of code.’1) Now I have not looked into the details of the above stats — they might very likely reveal to be caused by some bulk change. Which would be lame, unless its the killing of the old build system, for which I think I can claim some credit. But in general I really love the idea of ‘contributing a net negative number of lines of code’.

So, at the last LibreOffice Hackfest in Cambridge 2), I pushed a set of commits refactoring the UNO bindings of writer tables. It all started so innocent. I was actually aiming to do something completely different: namely give the UNO cursors in Writer (SwUnoCrsr) somewhat saner resource management and drag them screaming and kicking out of the 1980ies. However, once in unotbl.cxx, I found more of “determined Real Programmer can write FORTRAN programs in any language” and copypasta there than I could bear. I thought: “This UNO stuff has decent test coverage, you could refactor it a bit quickly.”.

Of course I was wrong with both sides of that statement: On the one hand, when I started the coverage was 70.1% LOC on that file which is not really as high as I expected. On the other hand, I did not end with “a bit quickly”, rather I went on to refactor away:
dc -e "`git log --author Michaelsen -p dc8697e554417d31501a0d90d731403ede223370^..HEAD sw/source/core/unocore/unotbl.cxx|grep ^+|wc -l` `git log --author Michaelsen -p dc8697e554417d31501a0d90d731403ede223370^..HEAD sw/source/core/unocore/unotbl.cxx|grep ^-|wc -l` - p"
-1015

… a thousand lines. On discovering the lacking test-coverage, I quickly added some more tests — bringing coverage to 77.52% LOC at least now.3) And yes, I also silently fixed the one regression I thereby discovered I had introduced, which nobody seemed to have noticed so far. One thing I noticed in this little refactoring spree is that while C++11s features might look tame compared to more modern programming languages in metrics like avoiding boilerplate, it still outclasses what we had before. Beyond the simplifying refactoring, features like lambdas are really nice for non-interactive (test-driven) debugging, including quickly asserting on the state of variables some over some 10 stackframes up or down without going into major contortions in testcode.

1) By the way, a quick:
dc -e "`git log --author Stahl -p |grep ^+|wc -l` `git log --author Stahl -p |grep ^-|wc -l` - p"
-108686

confirms Michael is more than living up to his personal goals.

2) Speaking of the Hackfest: The other thing I did there was helping/observing Sam Tuke getting setup for his first code contribution. While we made great progress in making this easier than it used to be, we could be a lot better there still. Sadly though, I didnt see a shortcut or simplification we could implement right away.

3) And along with that did bring coverage of unochart.cxx from abismal 4.4% LOC to at least 35.31% LOC  as a collateral damage.

addendum: Note that the writer tables core also increased coverage quite a bit from 54.6% LOC to 65% LOC.


Read more
bmichaelsen

I would walk 500 miles and I would walk 500 more
The proclaimers, 500 miles

So I recently noted that github reported I have 1337 commits on LibreOffice since I joined Canonical in February 2011. Looking at those stats, it seems I also deleted some net 155,634 lines over that time in the codebase.

LibreOffice commits

Even though I cant find that mail, I seem to remember that Michael Stahl, when joining the LibreOffice project proclaimed his goal to be to contribute ‘a net negative lines of code.’1) Now I have not looked into the details of the above stats — they might very likely reveal to be caused by some bulk change. Which would be lame, unless its the killing of the old build system, for which I think I can claim some credit. But in general I really love the idea of ‘contributing a net negative number of lines of code’.

So, at the last LibreOffice Hackfest in Cambridge 2), I pushed a set of commits refactoring the UNO bindings of writer tables. It all started so innocent. I was actually aiming to do something completely different: namely give the UNO cursors in Writer (SwUnoCrsr) somewhat saner resource management and drag them screaming and kicking out of the 1980ies. However, once in unotbl.cxx, I found more of “determined Real Programmer can write FORTRAN programs in any language” and copypasta there than I could bear. I thought: “This UNO stuff has decent test coverage, you could refactor it a bit quickly.”.

Of course I was wrong with both sides of that statement: On the one hand, when I started the coverage was 70.1% LOC on that file which is not really as high as I expected. On the other hand, I did not end with “a bit quickly”, rather I went on to refactor away:
dc -e "`git log --author Michaelsen -p dc8697e554417d31501a0d90d731403ede223370^..HEAD sw/source/core/unocore/unotbl.cxx|grep ^+|wc -l` `git log --author Michaelsen -p dc8697e554417d31501a0d90d731403ede223370^..HEAD sw/source/core/unocore/unotbl.cxx|grep ^-|wc -l` - p"
-1015

… a thousand lines. On discovering the lacking test-coverage, I quickly added some more tests — bringing coverage to 77.52% LOC at least now.3) And yes, I also silently fixed the one regression I thereby discovered I had introduced, which nobody seemed to have noticed so far. One thing I noticed in this little refactoring spree is that while C++11s features might look tame compared to more modern programming languages in metrics like avoiding boilerplate, it still outclasses what we had before. Beyond the simplifying refactoring, features like lambdas are really nice for non-interactive (test-driven) debugging, including quickly asserting on the state of variables some over some 10 stackframes up or down without going into major contortions in testcode.

1) By the way, a quick:
dc -e "`git log --author Stahl -p |grep ^+|wc -l` `git log --author Stahl -p |grep ^-|wc -l` - p"
-108686

confirms Michael is more than living up to his personal goals.

2) Speaking of the Hackfest: The other thing I did there was helping/observing Sam Tuke getting setup for his first code contribution. While we made great progress in making this easier than it used to be, we could be a lot better there still. Sadly though, I didnt see a shortcut or simplification we could implement right away.

3) And along with that did bring coverage of unochart.cxx from abismal 4.4% LOC to at least 35.31% LOC  as a collateral damage.

addendum: Note that the writer tables core also increased coverage quite a bit from 54.6% LOC to 65% LOC.


Read more
bmichaelsen

When logic and proportion have fallen sloppy dead
And the white knight is talking backwards
And the red queen’s off with her head
Remember what the dormouse said
Feed your head, feed your head

— Jefferson Airplane, White Rabbit

So, this was intended as a quick and smooth addendum to the “50 ways to fill your vector” post, bringing callgrind into the game and ensuring everyone that its instructions counts are a good proxy for walltime performance of your code. This started out as mostly as expected, when measuring the instructions counts in two scenarios:

implementation/cflags -O2 not inlined -O3 inlined
A1 2610061438 2510061428
A2 2610000025 2510000015
A3 2610000025 2510000015
B1 3150000009 2440000009
B2 3150000009 2440000009
B3 3150000009 2440000009
C1 3150000009 2440000009
C3 3300000009 2440000009

The good news here is, that this mostly faithfully reproduces some general observations on the timings from the last post on this topic, although the differences in callgrind are more pronounced in callgrind than in reality:

  • The A implementations are faster than the B and C implementations on -O2 without inlining
  • The A implementations are slower (by a smaller amount) than the B and C implementations on -O3 with inlining

The last post also suggested the expectation that all implementations could — and with a good compiler: should — have the same code and same speed when everything is inline. Apart from the A implementations still differing from the B and C ones, callgrinds instruction count suggest to actually be the case. Letting gcc compile to assembler and comparing the output, one finds:

  • Inline A1-3 compile to the same output on -Os, -O2, -O3 each. There is no difference between -O2 and -O3 for these.
  • Inline B1-3 compile to the same output on -Os, -O2, -O3 each, but they differ between optimization levels.
  • Inline C3 output differs from the others and between optimization levels.
  • Without inlinable constructors, the picture is the same, except that A3 and B3 now differ slightly from their kin as expected.

So indeed most of the implementations generate the same assembler code. However, this is quite a bit at odd with the significant differences in performance measured in the last post, e.g. B1/B2/B3 on -O2 created widely different walltimes. So time to test the assumption that running one implementation for a minute is producing reasonable statistically stable result, by doing 10 1-minute runs for each implementation and see what the standard deviation is. The following is found for walltimes (no inline constructors):

implementation/cflags -Os -O2 -O3 -O3 -march=
A1 80.6 s 78.9 s 78.9 s 79.0 s
A2 78.7 s 78.1 s 78.0 s 79.2 s
A3 80.7 s 78.9 s 78.9 s 78.9 s
B1 84.8 s 80.8 s 78.0 s 78.0 s
B2 84.8 s 86.0 s 78.0 s 78.1 s
B3 84.8 s 82.3 s 79.7 s 79.7 s
C1 84.4 s 85.4 s 78.0 s 78.0 s
C3 86.6 s 85.7 s 78.0 s 78.9 s
no inline measurements
no inline measurements

And with inlining:

implementation/cflags -Os -O2 -O3 -O3 -march=
A1 76.4 s 74.5 s 74.7 s 73.8 s
A2 75.4 s 73.7 s 73.8 s 74.5 s
A3 76.3 s 74.6 s 75.5 s 73.7 s
B1 80.6 s 77.1 s 72.7 s 73.7 s
B2 81.4 s 78.9 s 72.0 s 72.0 s
B3 80.6 s 78.9 s 72.8 s 73.7 s
C1 81.4 s 78.9 s 72.0 s 72.0 s
C3 79.7 s 80.5 s 72.9 s 77.8 s
inline measurements
inline measurements

The standard deviation for all the above values is less than 0.2 seconds. That is … interesting: For example, on -O2 without inlining, B1 and B2 generate the same assembler output, but execute with a very significant difference in hardware (5.2 s difference, or more than 25 standard deviations). So how have logic and proportion fallen sloppy dead here? If the same code is executed — admittedly from two different locations in the binary — how can that create such a significant difference in walltime performance, while not being visible at all on callgrind? A wild guess, which I have not confirmed yet, is cache locality: When not inlining constructors, those might be in CPU cache from one copy of the code in the binary, but not for the other. And by the way, it might also hint at the reasons for the -march= flag (which creates bigger code) seeming so uneffective. And it might explain, why performance is rather consistent when using inline constructors. If so, the impact of this is certainly interesting. It also suggest that allowing inlining of hotspots, like recently done with the low-level sw::Ring class, produces much more performance improvement on real hardware than the meager results measured with callgrind. And it reinforces the warning made in that post about not falling in the trap of mistaking the map for the territory: callgrind is not a “map in the scale of a mile to the mile”.

Addendum: As said in the previous post, I am still interested in such measurements on other hardware or compilers. All measurements above done with gcc 4.8.3 on Intel i5-4200U@1.6GHz.


Read more
bmichaelsen

When logic and proportion have fallen sloppy dead
And the white knight is talking backwards
And the red queen’s off with her head
Remember what the dormouse said
Feed your head, feed your head

— Jefferson Airplane, White Rabbit

So, this was intended as a quick and smooth addendum to the “50 ways to fill your vector” post, bringing callgrind into the game and ensuring everyone that its instructions counts are a good proxy for walltime performance of your code. This started out as mostly as expected, when measuring the instructions counts in two scenarios:

implementation/cflags -O2 not inlined -O3 inlined
A1 2610061438 2510061428
A2 2610000025 2510000015
A3 2610000025 2510000015
B1 3150000009 2440000009
B2 3150000009 2440000009
B3 3150000009 2440000009
C1 3150000009 2440000009
C3 3300000009 2440000009

The good news here is, that this mostly faithfully reproduces some general observations on the timings from the last post on this topic, although the differences in callgrind are more pronounced in callgrind than in reality:

  • The A implementations are faster than the B and C implementations on -O2 without inlining
  • The A implementations are slower (by a smaller amount) than the B and C implementations on -O3 with inlining

The last post also suggested the expectation that all implementations could — and with a good compiler: should — have the same code and same speed when everything is inline. Apart from the A implementations still differing from the B and C ones, callgrinds instruction count suggest to actually be the case. Letting gcc compile to assembler and comparing the output, one finds:

  • Inline A1-3 compile to the same output on -Os, -O2, -O3 each. There is no difference between -O2 and -O3 for these.
  • Inline B1-3 compile to the same output on -Os, -O2, -O3 each, but they differ between optimization levels.
  • Inline C3 output differs from the others and between optimization levels.
  • Without inlinable constructors, the picture is the same, except that A3 and B3 now differ slightly from their kin as expected.

So indeed most of the implementations generate the same assembler code. However, this is quite a bit at odd with the significant differences in performance measured in the last post, e.g. B1/B2/B3 on -O2 created widely different walltimes. So time to test the assumption that running one implementation for a minute is producing reasonable statistically stable result, by doing 10 1-minute runs for each implementation and see what the standard deviation is. The following is found for walltimes (no inline constructors):

implementation/cflags -Os -O2 -O3 -O3 -march=
A1 80.6 s 78.9 s 78.9 s 79.0 s
A2 78.7 s 78.1 s 78.0 s 79.2 s
A3 80.7 s 78.9 s 78.9 s 78.9 s
B1 84.8 s 80.8 s 78.0 s 78.0 s
B2 84.8 s 86.0 s 78.0 s 78.1 s
B3 84.8 s 82.3 s 79.7 s 79.7 s
C1 84.4 s 85.4 s 78.0 s 78.0 s
C3 86.6 s 85.7 s 78.0 s 78.9 s
no inline measurementsno inline measurements

And with inlining:

implementation/cflags -Os -O2 -O3 -O3 -march=
A1 76.4 s 74.5 s 74.7 s 73.8 s
A2 75.4 s 73.7 s 73.8 s 74.5 s
A3 76.3 s 74.6 s 75.5 s 73.7 s
B1 80.6 s 77.1 s 72.7 s 73.7 s
B2 81.4 s 78.9 s 72.0 s 72.0 s
B3 80.6 s 78.9 s 72.8 s 73.7 s
C1 81.4 s 78.9 s 72.0 s 72.0 s
C3 79.7 s 80.5 s 72.9 s 77.8 s
inline measurementsinline measurements

The standard deviation for all the above values is less than 0.2 seconds. That is … interesting: For example, on -O2 without inlining, B1 and B2 generate the same assembler output, but execute with a very significant difference in hardware (5.2 s difference, or more than 25 standard deviations). So how have logic and proportion fallen sloppy dead here? If the same code is executed — admittedly from two different locations in the binary — how can that create such a significant difference in walltime performance, while not being visible at all on callgrind? A wild guess, which I have not confirmed yet, is cache locality: When not inlining constructors, those might be in CPU cache from one copy of the code in the binary, but not for the other. And by the way, it might also hint at the reasons for the -march= flag (which creates bigger code) seeming so uneffective. And it might explain, why performance is rather consistent when using inline constructors. If so, the impact of this is certainly interesting. It also suggest that allowing inlining of hotspots, like recently done with the low-level sw::Ring class, produces much more performance improvement on real hardware than the meager results measured with callgrind. And it reinforces the warning made in that post about not falling in the trap of mistaking the map for the territory: callgrind is not a “map in the scale of a mile to the mile”.

Addendum: As said in the previous post, I am still interested in such measurements on other hardware or compilers. All measurements above done with gcc 4.8.3 on Intel i5-4200U@1.6GHz.


Read more
Kyle Nitzsche

AptBrowser QML/C++ App

I've made a QML/C++ app called aptBrowser as an exercise in:

  • QML declarative GUI that drives
  • C++ backend threads

That is, the GUI provides buttons (five) that kick off C++ threads that do the backend work and provide the results back to QML.

So the GUI is always responsive (non-blocking).

What aptbrowser does

The user enters a debian package name (and is told if it is not valid) and taps one of five buttons that do the following:
  • Show the packages this package depends on ("Depends")
  • Show the packages this package recommends ("Recommends")
  • Show the packages that depend on this package ("Parent Depends")
  • Show that packages that recommend this package ("Parent Recommends")
  • Show the  apt-cache policy for this package ("Policy")
The data for all but the last ("Policy") are returned as flickable lists of buttons. When you click any one, it becomes the current package and the GUI and displayed data adjusts appropriately.

When you click any of the buttons, the orange indicator square to its left turns purple and starts spinning, and when the c++ backend returns data, its indicator turns orange again and stops spinning.

Note that the Parent Depends and Parent Recommends actions can take a long time. This has nothing to do with this app. This is simply how long it takes to first get a package's parents and then, for each, find its type of relationship (depends or recommends) to our package of interest. Querying the apt cache is time consuming.

Where is aptbrowser

Store

Because the app queries the apt cache, it must run unconfined at the moment, and therefore it cannot go into the store.

The click

This an armhf click pkg for framework ubuntu-sdk.14.10 (compiled against vivid)

    The source 


    • bzr branch lp:aptbrowser

    Screenshots

     



    Read more
    bmichaelsen

    “The problem is all inside your head” she said to me
    “The answer is easy if you take it logically”
    — Paul Simon, 50 ways to leave your lover

    So recently I tweaked around with these newfangled C++11 initializer lists and created an EasyHack to use them to initialize property sequences in a readable way. This caused a short exchange on the LibreOffice mailing list, which I assumed had its part in motivating Stephans interesting post “On filling a vector”. For all the points being made (also in the quick follow up on IRC), I wondered how much the theoretical “can use a move constructor” discussed etc. really meant when the C++ is translated to e.g. GENERIC, then GIMPLE, then amd64 assembler, then to the internal RISC instructions of the CPU — with multiple levels of caching in addition.

    So I quickly wrote the following (thanks so much for C++11 having the nice std::chrono now).

    data.hxx:

    #include <vector>
    struct Data {
        Data();
        Data(int a);
        int m_a;
    };
    void DoSomething(std::vector<Data>&);

    data.cxx:

    #include "data.hxx"
    // noop in different compilation unit to prevent optimizing out what we want to measure
    void DoSomething(std::vector<Data>&) {};
    Data::Data() : m_a(4711) {};
    Data::Data(int a) : m_a(a+4711) {};

    main.cxx:

    #include "data.hxx"
    #include <iostream>
    #include <vector>
    #include <chrono>
    #include <functional>

    void A1(long count) {
        while(--count) {
            std::vector<Data> vec { Data(), Data(), Data() };
            DoSomething(vec);
        }
    }

    void A2(long count) {
        while(--count) {
            std::vector<Data> vec { {}, {}, {} };
            DoSomething(vec);
        }
    }

    void A3(long count) {
        while(--count) {
            std::vector<Data> vec { 0, 0, 0 };
            DoSomething(vec);
        }
    }

    void B1(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.push_back(Data());
            vec.push_back(Data());
            vec.push_back(Data());
            DoSomething(vec);
        }
    }

    void B2(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.push_back({});
            vec.push_back({});
            vec.push_back({});
            DoSomething(vec);
        }
    }

    void B3(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.push_back(0);
            vec.push_back(0);
            vec.push_back(0);
            DoSomething(vec);
        }
    }

    void C1(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.emplace_back(Data());
            vec.emplace_back(Data());
            vec.emplace_back(Data());
            DoSomething(vec);
        }
    }

    void C3(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.emplace_back(0);
            vec.emplace_back(0);
            vec.emplace_back(0);
            DoSomething(vec);
        }
    }

    double benchmark(const char* name, std::function<void (long)> testfunc, const long count) {
        const auto start = std::chrono::system_clock::now();
        testfunc(count);
        const auto end = std::chrono::system_clock::now();
        const std::chrono::duration<double> delta = end-start;
        std::cout << count << " " << name << " iterations took " << delta.count() << " seconds." << std::endl;
        return delta.count();
    }

    int main(int, char**) {
        long count = 10000000;
        while(benchmark("A1", &A1, count) < 60l)
            count <<= 1;
        std::cout << "Going with " << count << " iterations." << std::endl;
        benchmark("A1", &A1, count);
        benchmark("A2", &A2, count);
        benchmark("A3", &A3, count);
        benchmark("B1", &B1, count);
        benchmark("B2", &B2, count);
        benchmark("B3", &B3, count);
        benchmark("C1", &C1, count);
        benchmark("C3", &C3, count);
        return 0;
    }

    Makefile:

    CFLAGS?=-O2
    main: main.o data.o
        g++ -o $@ $^

    %.o: %.cxx data.hxx
        g++ $(CFLAGS) -std=c++11 -o $@ -c $<

    Note the object here is small and trivial to copy as one would expect from objects passed around as values (as expensive to copy objects mostly can be passed around with a std::shared_ptr). So what did this measure? Here are the results:

    Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2) compiled with gcc 4.8.3 without inline constructors:

    implementation / CFLAGS -Os -O2 -O3 -O3 -march=…
    A1 89.1 s 79.0 s 78.9 s 78.9 s
    A2 89.1 s 78.1 s 78.0 s 80.5 s
    A3 90.0 s 78.9 s 78.8 s 79.3 s
    B1 103.6 s 97.8 s 79.0 s 78.0 s
    B2 99.4 s 95.6 s 78.5 s 78.0 s
    B3 107.4 s 90.9 s 79.7 s 79.9 s
    C1 99.4 s 94.4 s 78.0 s 77.9 s
    C3 98.9 s 100.7 s 78.1 s 81.7 s

    creating a three element vector without inlined constructors
    And, for comparison, following are the results, if one allows the constructors to be inlined.
    Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2) compiled with gcc 4.8.3 with inline constructors:

    implementation / CFLAGS -Os -O2 -O3 -O3 -march=…
    A1 85.6 s 74.7 s 74.6 s 74.6 s
    A2 85.3 s 74.6 s 73.7 s 74.5 s
    A3 91.6 s 73.8 s 74.4 s 74.5 s
    B1 93.4 s 90.2 s 72.8 s 72.0 s
    B2 93.7 s 88.3 s 72.0 s 73.7 s
    B3 97.6 s 88.3 s 72.8 s 72.0 s
    C1 93.4 s 88.3 s 72.0 s 73.7 s
    C3 96.2 s 88.3 s 71.9 s 73.7 s

    creating a three element vector without inlined constructors
    Some observations on these measurements:

    • -march=... is at best neutral: The measured times do not change much in general, they only even slightly improve performance in five out of 16 cases, and the two cases with the most significant change in performance (over 3%) are actually hurting the performance. So for the rest of this post, -march=... will be ignored. Sorry gentooers.

      Read more
    bmichaelsen

    “The problem is all inside your head” she said to me
    “The answer is easy if you take it logically”
    — Paul Simon, 50 ways to leave your lover

    So recently I tweaked around with these newfangled C++11 initializer lists and created an EasyHack to use them to initialize property sequences in a readable way. This caused a short exchange on the LibreOffice mailing list, which I assumed had its part in motivating Stephans interesting post “On filling a vector”. For all the points being made (also in the quick follow up on IRC), I wondered how much the theoretical “can use a move constructor” discussed etc. really meant when the C++ is translated to e.g. GENERIC, then GIMPLE, then amd64 assembler, then to the internal RISC instructions of the CPU — with multiple levels of caching in addition.

    So I quickly wrote the following (thanks so much for C++11 having the nice std::chrono now).

    data.hxx:

    #include <vector>
    struct Data {
        Data();
        Data(int a);
        int m_a;
    };
    void DoSomething(std::vector<Data>&);

    data.cxx:

    #include "data.hxx"
    // noop in different compilation unit to prevent optimizing out what we want to measure
    void DoSomething(std::vector<Data>&) {};
    Data::Data() : m_a(4711) {};
    Data::Data(int a) : m_a(a+4711) {};

    main.cxx:

    #include "data.hxx"
    #include <iostream>
    #include <vector>
    #include <chrono>
    #include <functional>

    void A1(long count) {
        while(--count) {
            std::vector<Data> vec { Data(), Data(), Data() };
            DoSomething(vec);
        }
    }

    void A2(long count) {
        while(--count) {
            std::vector<Data> vec { {}, {}, {} };
            DoSomething(vec);
        }
    }

    void A3(long count) {
        while(--count) {
            std::vector<Data> vec { 0, 0, 0 };
            DoSomething(vec);
        }
    }

    void B1(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.push_back(Data());
            vec.push_back(Data());
            vec.push_back(Data());
            DoSomething(vec);
        }
    }

    void B2(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.push_back({});
            vec.push_back({});
            vec.push_back({});
            DoSomething(vec);
        }
    }

    void B3(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.push_back(0);
            vec.push_back(0);
            vec.push_back(0);
            DoSomething(vec);
        }
    }

    void C1(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.emplace_back(Data());
            vec.emplace_back(Data());
            vec.emplace_back(Data());
            DoSomething(vec);
        }
    }

    void C3(long count) {
        while(--count) {
            std::vector<Data> vec;
            vec.reserve(3);
            vec.emplace_back(0);
            vec.emplace_back(0);
            vec.emplace_back(0);
            DoSomething(vec);
        }
    }

    double benchmark(const char* name, std::function<void (long)> testfunc, const long count) {
        const auto start = std::chrono::system_clock::now();
        testfunc(count);
        const auto end = std::chrono::system_clock::now();
        const std::chrono::duration<double> delta = end-start;
        std::cout << count << " " << name << " iterations took " << delta.count() << " seconds." << std::endl;
        return delta.count();
    }

    int main(int, char**) {
        long count = 10000000;
        while(benchmark("A1", &A1, count) < 60l)
            count <<= 1;
        std::cout << "Going with " << count << " iterations." << std::endl;
        benchmark("A1", &A1, count);
        benchmark("A2", &A2, count);
        benchmark("A3", &A3, count);
        benchmark("B1", &B1, count);
        benchmark("B2", &B2, count);
        benchmark("B3", &B3, count);
        benchmark("C1", &C1, count);
        benchmark("C3", &C3, count);
        return 0;
    }

    Makefile:

    CFLAGS?=-O2
    main: main.o data.o
        g++ -o $@ $^

    %.o: %.cxx data.hxx
        g++ $(CFLAGS) -std=c++11 -o $@ -c $<

    Note the object here is small and trivial to copy as one would expect from objects passed around as values (as expensive to copy objects mostly can be passed around with a std::shared_ptr). So what did this measure? Here are the results:

    Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2) compiled with gcc 4.8.3 without inline constructors:

    implementation / CFLAGS -Os -O2 -O3 -O3 -march=…
    A1 89.1 s 79.0 s 78.9 s 78.9 s
    A2 89.1 s 78.1 s 78.0 s 80.5 s
    A3 90.0 s 78.9 s 78.8 s 79.3 s
    B1 103.6 s 97.8 s 79.0 s 78.0 s
    B2 99.4 s 95.6 s 78.5 s 78.0 s
    B3 107.4 s 90.9 s 79.7 s 79.9 s
    C1 99.4 s 94.4 s 78.0 s 77.9 s
    C3 98.9 s 100.7 s 78.1 s 81.7 s

    creating a three element vector without inlined constructors
    And, for comparison, following are the results, if one allows the constructors to be inlined.
    Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2) compiled with gcc 4.8.3 with inline constructors:

    implementation / CFLAGS -Os -O2 -O3 -O3 -march=…
    A1 85.6 s 74.7 s 74.6 s 74.6 s
    A2 85.3 s 74.6 s 73.7 s 74.5 s
    A3 91.6 s 73.8 s 74.4 s 74.5 s
    B1 93.4 s 90.2 s 72.8 s 72.0 s
    B2 93.7 s 88.3 s 72.0 s 73.7 s
    B3 97.6 s 88.3 s 72.8 s 72.0 s
    C1 93.4 s 88.3 s 72.0 s 73.7 s
    C3 96.2 s 88.3 s 71.9 s 73.7 s

    creating a three element vector without inlined constructors
    Some observations on these measurements:

    • -march=... is at best neutral: The measured times do not change much in general, they only even slightly improve performance in five out of 16 cases, and the two cases with the most significant change in performance (over 3%) are actually hurting the performance. So for the rest of this post, -march=... will be ignored. Sorry gentooers. ;)
    • There is no silver bullet with regard to the different implementations: A1, A2 and A3 are the faster implementations when not inlining constructors and using -Os or -O2 (the quickest A* is ~10% faster than the quickest B*/C*). However when inlining constructors and using -O3, the same implementations are the slowest (by 2.4%).
    • Most common release builds are still done with -O2 these days. For those, using initializer lists (A1/A2/A3) seem too have a significant edge over the alternatives, whether constructors are inlined or not. This is in contrast to the conclusions made from “constructor counting”, which assumed these to be slow because of additional calls needed.
    • The numbers printed in bold are either the quickest implementation in a build scenario or one that is within 1.5% of the quickest implementation. A1 and A2 are sharing the title here by being in that group five times each.
    • With constructors inlined, everything in the loop except DoSomething() could be inline. It seems to me that the compiler could — at least in theory — figure out that it is asked the same thing in all cases. Namely, reserve space for three ints on the heap, fill them each with 4711 and make the ::std::vector<int> data structure on the stack reflect that, then hand that to the DoSomething() function that you know nothing about. If the compiler would figure that out, it would take the same time for all implementations. This doesnt happen either on -O2 (differ by ~18% from quickest to slowest) nor on -O3 (differ by ~3.6%).

    One common mantra in applications development is “trust the compiler to optimize”. The above observations show a few cracks in the foundations of that, esp. if you take into account that this is all on the same version of the same compiler running on the same platform and hardware with the same STL implementation. For huge objects with expensive constructors, the constructor counting approach might still be valid. Then again, those are rarely statically initialized as a bigger bunch into a vector. For the more common scenario of smaller objects with cheap constructors, my tentative conclusion so far would be to go with A1/A2/A3 — not so much because they are quickest in the most common build scenarios on my platform, but rather because the readability of them is a value on its own while the performance picture is muddy at best.

    And hey, if you want to run the tests above on other platforms or compilers, I would be interested in results!

    Note: I did these runs for each scenario only once, thus no standard deviation is given. In general, they seemed to be rather stable, but this being wallclock measurements, one or the other might be outliers. caveat emptor.


    Read more
    niemeyer

    As detailed in the preliminary release of qml.v1 for Go a couple of weeks ago, my next task was to finish the improvements in its OpenGL API. Good progress has happened since then, and the new API is mostly done and available for experimentation. At the same time, there’s still work to do on polishing edges and on documenting the extensive API. This blog post aims to present the improvements made, their internal design, and also to invite help for finishing the pending details.

    Before diving into the new, let’s first have a quick look at how a Go application using OpenGL might look like with qml.v0. This is an excerpt from the old painting example:

    import (
            "gopkg.in/qml.v0"
            "gopkg.in/qml.v0/gl"
    )
    
    func (r *GoRect) Paint(p *qml.Painter) {
            width := gl.Float(r.Int("width"))
            height := gl.Float(r.Int("height"))
            gl.Enable(gl.BLEND)
            // ...
    }
    

    The example imports both the qml and the gl packages, and then defines a Paint method that makes use of the GL types, functions, and constants from the gl package. It looks quite reasonable, but there are a few relevant shortcomings.

    One major issue in the current API is that it offers no means to tell even at a basic level what version of the OpenGL API is being coded against, because the available functions and constants are the complete set extracted from the gl.h header. For example, OpenGL 2.0 has GL_ALPHA and GL_ALPHA4/8/12/16 constants, but OpenGL ES 2.0 has only GL_ALPHA. This simplistic choice was a good start, but comes with a number of undesired side effects:

    • Many trivial errors that should be compile errors fail at runtime instead
    • When the code does work, the developer is not sure about which API version it is targeting
    • Symbols for unsupported API versions may not be available for linking, even if unused

    That last point also provides a hint of another general issue: portability. Every system has particularities for how to load the proper OpenGL API entry points. For example, which libraries should be linked with, where they are in the local system, which entry points they support, etc.

    So this is the stage for the improvements that are happening. Before detailing the solution, let’s have a look at the new painting example in qml.v1, that makes use of the improved API:

    import (
            "gopkg.in/qml.v1"
            "gopkg.in/qml.v1/gl/2.0"
    )
    
    func (r *GoRect) Paint(p *qml.Painter) {
            gl := GL.API(p)
            width := float32(r.Int("width"))
            height := float32(r.Int("height"))
            gl.Enable(GL.BLEND)
            // ...
    }
    

    With the new API, rather than importing a generic gl package, a version-specific gl/2.0 package is imported under the name GL. That choice of package name allows preserving familiar OpenGL terms for both the functions and the constants (gl.Enable and GL.BLEND, for example). Inside the new Paint method, the gl value obtained from GL.API holds only the functions that are defined for the specific OpenGL API version imported, and the constants in the GL package are also constrained to those available in the given version. Any improper references become build time errors.

    To support all the various OpenGL versions and profiles, there are 23 independent packages right now. These packages are of course not being hand-built. Instead, they are generated all at once by a tool that gathers information from various sources. The process can be tersely described as:

    1. A ragel-based parser processes Qt’s qopenglfunctions_*.h header files to collect version-specific functions
    2. The Khronos OpenGL Registry XML is parsed to collect version-specific constants
    3. A number of tweaks defined in the tool’s code is applied to the state
    4. Packages are generated by feeding the state to text templates

    Version-specific functions might also be extracted from the Khronos Registry, but there’s a good reason to use information from the Qt headers instead: Qt already solved the portability issue. It works in several platforms, and if somebody is using QML successfully, it means Qt is already using that system’s OpenGL capabilities. So rather than designing a new mechanism to solve the same problem, the qml package now leverages Qt for resolving all the GL function entry points and the linking against available libraries.

    Going back to the example, it also demonstrates another improvement that comes with the new API: plain types that do not carry further meaning such as gl.Float and gl.Int were replaced by their native counterparts, float32 and int32. Richer types such as Enum were preserved, and as suggested by David Crawshaw some new types were also introduced to represent entities such as programs, shaders, and buffers. The custom types are all available under the common gl/glbase package that all version-specific packages make use of.

    So this is all working and available for experimentation right now. What is left to do is almost exclusively improving the list of function tweaks with two goals in mind, which will be highlighted below as those are areas where help would be appreciated, mainly due to the footprint of the API.

    Documentation importing

    There are a few hundred functions to document, but a large number of these are variations of the same function. The previous approach was to simply link to the upstream documentation, but it would be much better to have polished documentation attached to the functions themselves. This is the new documentation for MultMatrixd, for example. For now the documentation is being imported manually, but the final process will likely consist of some automation and some manual polishing.

    Function polishing

    The standard C OpenGL API can often be translated automatically (see BindBuffer or BlendColor), but in other cases the function prototype has to be tweaked to become friendly to Go. The translation tool already has good support for defining most of these tweaks independently from the rest of the machinery. For example, the following logic changes the ShaderSource function from its standard from into something convenient in Go:

    name: "ShaderSource",
    params: paramTweaks{
            "glstring": {rename: "source", retype: "...string"},
            "length":   {remove: true},
            "count":    {remove: true},
    },
    before: `
            count := len(source)
            length := make([]int32, count)
            glstring := make([]unsafe.Pointer, count)
            for i, src := range source {
                    length[i] = int32(len(src))
                    if len(src) > 0 {
                            glstring[i] = *(*unsafe.Pointer)(unsafe.Pointer(&src))
                    } else {
                            glstring[i] = unsafe.Pointer(uintptr(0))
                    }
            }
    `,
    

    Other cases may be much simpler. The MultMatrixd tweak, for instance, simply ensures that the parameter has the proper length, and injects the documentation:

    name: "MultMatrixd",
    before: `
            if len(m) != 16 {
                    panic("parameter m must have length 16 for the 4x4 matrix")
            }
    `,
    doc: `
            multiplies the current matrix with the provided matrix.
            ...
    `,
    

    and as an even simpler example, CreateProgram is tweaked so that it returns a glbase.Program instead of the default uint32.

    name:   "CreateProgram",
    result: "glbase.Program",
    

    That kind of polishing is where contributions would be most appreciated right now. One valid way of doing this is picking a range of functions and importing and polishing their documentation manually, and while doing that keeping an eye on required tweaks that should be performed on the function based on its documentation and prototype.

    If you’d like to help somehow, or just ask questions or report your experience with the new API, please join us in the project mailing list.

    Read more
    niemeyer

    As part of the on going work on Ubuntu Touch phones, I was invited to contribute a Go package to interface with ubuntuoneauth, a C++ and Qt library that authenticates against Ubuntu One using the system account made available by the phone owner. The details of that library and its use case are not interesting for most people right now, but the work to interface with it is a good example to talk about because, besides the result (uoneauth) being an external and independent Go package that extends the qml package, ubuntuoneauth is not a QML library, but rather a plain Qt library. Some of the callbacks use even traditional C++ types that do not inherit from QObject and have no Qt metadata, so offering that functionality from Go nicely takes a bit more work.

    What follows are some of the highlights of that integration logic, to serve as a reference for similar extensions in the future. Note that if your interest is in creating QML applications with Go, none of this is necessary and the documentation is a better place to start.

    As an initial step, the following examples demonstrate how the original C++ library and the Go package being designed are meant to be used. The first snippet contains the relevant logic taken out of the examples/signing-main.cpp file, tweaked for clarity:

    int main(int argc, char *argv[]) {
        (...)
        UbuntuOne::SigningExample *example = new UbuntuOne::SigningExample(&a);
        QTimer::singleShot(0, example, SLOT(doExample()));
        (...)
    }
    
    SigningExample::SigningExample(QObject *parent) : QObject(parent) {
        QObject::connect(&service, SIGNAL(credentialsFound(const Token&)),
                         this, SLOT(handleCredentialsFound(Token)));
        QObject::connect(&service, SIGNAL(credentialsNotFound()),
                         this, SLOT(handleCredentialsNotFound()));
        (...)
    }
    
    void SigningExample::doExample() {
        service.getCredentials();
    }
    
    void SigningExample::handleCredentialsFound(Token token) {
        QString authHeader = token.signUrl(url, QStringLiteral("GET"));
        (...)
    }
     
    void SigningExample::handleCredentialsNotFound() {
        qDebug() << "No credentials were found.";
    }
    

    The example hooks into various signals in the service, one for each possible outcome, and then calls the service’s getCredentials method to initiate the process. If successful, the credentialsFound signal is emitted with a Token value that is able to sign URLs, returning an HTTP header that can authenticate a request.

    That same process is more straightforward when using the library from Go:

    service := uoneauth.NewService(engine)
    token, err := service.Token()
    if err != nil {
            return err
    }
    signature := token.HeaderSignature("GET", url)
    

    Again, this gets a service, a token from it, and signs a URL, in a “forward” way.

    So the goal is turning the initial C++ workflow into this simpler Go API. A good next step is looking into how the NewService function is implemented:

    func NewService(engine *qml.Engine) *Service {
            s := &Service{reply: make(chan reply, 1)}
    
            qml.RunMain(func() {
                    s.obj = *qml.CommonOf(C.newSSOService(), engine)
            })
    
            runtime.SetFinalizer(s, (*Service).finalize)
    
            s.obj.On("credentialsFound", s.credentialsFound)
            s.obj.On("credentialsNotFound", s.credentialsNotFound)
            s.obj.On("twoFactorAuthRequired", s.twoFactorAuthRequired)
            s.obj.On("requestFailed", s.requestFailed)
            return s
    }
    

    NewService creates the service instance, and then asks the qml package to run some logic in the main Qt thread via RunMain. This is necessary because a number of operations in Qt, including the creation of objects, are associated with the currently running thread. Using RunMain in this case ensures that the creation of the C++ object performed by newSSOService happens in the main Qt thread (the “GUI thread”).

    Then, the address of the C++ UbuntuOne::SSOService type is handed to CommonOf to obtain a Common value that implements all the common logic supported by C++ types that inherit from QObject. This is an unsafe operation as there’s no way for CommonOf to guarantee that the provided address indeed points to a C++ value with a type that inherits from QObject, so the call site must necessarily import the unsafe package to provide the unsafe.Pointer parameter. That’s not a problem in this context, though, since such extension packages are necessarily dealing with unsafe logic in either case.

    The obtained Common value is then assigned to the service’s obj field. In most cases, that value is instead assigned to an anonymous Common field, as done in qml.Window for example. Doing so means qml.Window values implement the qml.Object interface, and may be manipulated as a generic object. For the new Service type, though, the fact that this is a generic object won’t be disclosed for the moment, and instead a simpler API will be offered.

    Following the function logic further, a finalizer is then registered to ensure the C++ value gets deallocated if the developer forgets to Close the service explicitly. When doing that, it’s important to ensure the Close method drops the finalizer when called, not only to facilitate the garbage collection of the object, but also to avoid deallocating the same value twice.

    The next four lines in the function should be straightforward: they register methods of the service to be called when the relevant signals are emitted. Here is the implementation of two of these methods:

    func (s *Service) credentialsFound(token *Token) {
            s.sendReply(reply{token: token})
    }
    
    func (s *Service) credentialsNotFound() {
            s.sendReply(reply{err: ErrNoCreds})
    }
    
    func (s *Service) sendReply(r reply) {
            select {
            case s.reply <- r:
            default:
                    panic("internal error: multiple results received")
            }
    }
    

    Handling the signals consists of just sending the reply over the channel to whoever initiated the request. The select statement in sendReply just ensures that the invariant of having a reply per request is not broken without being noticed, as that would require a slightly different design.

    There’s one more point worth observing in this logic: the token value received as a parameter in credentialsFound was already converted into the local Token type. In most cases, this is unnecessary as the parameter is directly useful as a qml.Object or as another native type (int, etc), but in this case UbuntuOne::Token is a plain C++ type that does not inherit from QObject, so the default signal parameter that would arrive in the Go method has only a type name and the value address.

    Instead of taking the plain value, it is turned into a more useful one by registering a converter with the qml package:

    func convertToken(engine *qml.Engine, obj qml.Object) interface{} {
            // Copy as the one held by obj is passed by reference.
            addr := unsafe.Pointer(obj.Property("plainAddr").(uintptr))
            token := &Token{C.tokenCopy(addr)}
            runtime.SetFinalizer(token, (*Token).finalize)
            return token
    }
    
    func init() {
            qml.RegisterConverter("Token", convertToken)
    }
    

    Given that setup, the Service.Token method may simply call the getCredentials method from the underlying UbuntuOne::SSOService method to fire a request, and block waiting for a reply:

    func (s *Service) Token() (*Token, error) {
            s.mu.Lock()
            qml.RunMain(func() {
                    C.ssoServiceGetCredentials(unsafe.Pointer(s.obj.Addr()))
            })
            reply := <-s.reply
            s.mu.Unlock()
            return reply.token, reply.err
    }
    

    The lock ensures that a second request won’t take place before the first one is done, forcing the correct sequencing of replies. Given the current logic, this isn’t strictly necessary since all requests are equivalent, but this will remain correct even if other methods from SSOService are added to this interface.

    The returned Token value may then be used to sign URLs by simply calling the respective underlying method:

    func (t *Token) HeaderSignature(method, url string) string {
            cmethod, curl := C.CString(method), C.CString(url)
            cheader := C.tokenSignURL(t.addr, cmethod, curl, 0)
            defer freeCStrings(cmethod, curl, cheader)
            return C.GoString(cheader)
    }
    

    No care about using qml.RunMain has to be taken in this case, because UbuntuOne::Token is a simple C++ type that does not interact with the Qt machinery.

    This completes the journey of creating a package that provides access to the ubuntuoneauth library from Go code. In many cases it’s a better idea to simply rewrite the logic in Go, but there will be situations similar to this library, where either rewriting would take more time than reasonable, or perhaps delegating the maintenance and stabilization of the underlying logic to a different team is the best thing to do. In those cases, an approach such as the one exposed here can easily solve the problem.

    Read more
    niemeyer

    As recently announced, my latest endeavor at Canonical is to enable graphical client-side development with the Go language via a new qml Go package that integrates the language with Qt’s QML framework.

    The QML framework solves the problem of designing graphic applications via a language that offers a convenient mix of declarative and procedural features. As a very simple example, if the following QML content is loaded by itself, it will draw a square centralized inside a window:

    import QtQuick 2.0
    
    Rectangle {
            width: 640
            height: 280
            color: "black"
    
            Rectangle {
                    width: 250; height: 250
                    anchors.centerIn: parent
                    color: "red"
    
                    MouseArea {
                            anchors.fill: parent
                            onClicked: console.log("Stop poking me!")
                    }
            }
    }
    

    As expected, it would look similar to:

    Centralized Rectangle

    The red rectangle will remain centered as the window is resized, and will print a message every time it is clicked. The clicked signal was hooked into logic implemented in JavaScript in this case, but it might as well have been hooked into custom Go logic with the same ease using the qml package. With very minor changes, it could also be something much more interesting than a red rectangle, such as a modern web browser:

    import QtQuick 2.0
    import QtWebKit 3.0
    
    Rectangle {
            width: 640
            height: 280
            color: "black"
    
            WebView {
                    width: 250; height: 250
                    anchors.centerIn: parent
                    url: "http://golang.org"
            }
    }
    

    The result would be:

    Web Page Rectangle

    It’s worth noting that this wasn’t just a screenshot of a web page, but rather a tiny fully functional web browser accessing a web page, easily embedded into the QML application to satisfy whatever browsey needs the author might have.

    Being able to leverage all these existing building blocks so comfortably is what makes a graphics platform attractive to develop in, and is what inspired the on going effort to have that platform working under the Go language. As we can observe on some of the work published, that side of things seems to be going well.

    Now, the next level up is to enable people to create such building blocks without leaving the Go language. The initial step towards that is already committed to the code repository. It enables Go types to be created and seamlessly integrated into the QML language. For example, consider this simple Go type:

    type GoType struct {
            Text string
    }
    
    func (v *GoType) OnTextChanged() {
            fmt.Println("Text changed...")
    }
    

    If this type is made available to QML content via the qml.RegisterTypes function, it may then be used as any other native type. This would work as a QML file, for instance:

    import GoExtensions 1.0
    
    GoType {
            text: "Have you signed up for GopherCon yet?"
    }
    

    There’s a relevant detail, though: this type has no visible content by itself. This means that displaying something must be done via interactions with other QML items, or via external systems (dbus, for example).

    Solving that problem has been one of my objectives in the last month, and although it’s not yet publicly available, the work is in a good-enough state that I feel comfortable talking about it.

    As described, the main goal is enabling these custom types to paint. This will be achieved by exposing a Go package that offers an OpenGL API, and defining methods that the custom types have to implement for being able to render at appropriate times. Although the details aren’t finalized, the current draft can run familiar OpenGL code similar to the following:

    func (v *GoType) Paint(p *qml.Painter) {
            obj := p.Object()
            width := gl.Float(obj.Int("width"))
            height := gl.Float(obj.Int("height"))
    
            gl.Enable(gl.BLEND)
            gl.BlendFunc(gl.SRC_ALPHA, gl.ONE_MINUS_SRC_ALPHA);
            gl.Color4f(1.0, 1.0, 1.0, 0.8)
            gl.Begin(gl.QUADS)
            gl.Vertex2f(0, 0)
            gl.Vertex2f(width, 0)
            gl.Vertex2f(width, height)
            gl.Vertex2f(0, height)
            gl.End()
    
            gl.LineWidth(2.5)
            gl.Color4f(0.0, 0.0, 0.0, 1.0)
            gl.Begin(gl.LINES)
            gl.Vertex2f(0, 0)
            gl.Vertex2f(width, height)
            gl.Vertex2f(width, 0)
            gl.Vertex2f(0, height)
            gl.End()
    }
    

    One interesting point to realize is that, again, this isn’t just rendering into a lonely OpenGL context. The rendered content goes into a framebuffer that lives within the overall QML scene graph, and has the same integration capabilities as every other QML item.

    In the following animated image, for example, the white square was generated with the above GL code, and was rendered next to other native items with this QML source:

    Go + OpenGL Example

    Note the smooth blending and proper overlapping established in the scene graph based on the ordering of elements in the QML source. The animation was also driven by QML without special handling of the content rendered from Go.

    Another relevant point in terms of the integration with Go is that the GL code demonstrated above is considered old-school these days. The modern version uses fewer calls, making better use of the graphics card memory by transferring, tracking, and handling data such as vertexes in contiguous arrays. This reduces the impact of the cross-context calls that depart and rejoin the Go runtime, and can unlock interesting use cases that the overhead might otherwise prevent.

    In terms of availability of these features, we’re about to enter a holiday season, so they should be polished enough for a first public review at some point in January. Looking forward to it.

    UPDATE (2014-01-27): These features are now publicly available. The painting example demonstrates the exact scenario pictured above.

    UPDATE (2014-02-18): New screencast demonstrates some of the OpenGL features of Go QML.

    Read more
    Jussi Pakkanen

    One of the main ways of reducing code complexity (and thus compile times) in C/C++ is forward declaration. The most basic form of it is this:

    class Foo;

    This tells the compiler that there will be a class called Foo but it does not specify it in more detail. With this declaration you can’t deal with Foo objects themselves but you can form pointers and references to them.

    Typically you would use forward declarations in this manner.

    class Bar;
    
    class Foo {
      void something();
      void method1(Bar *b);
      void method2(Bar &b);
    };

    Correspondingly if you want to pass the objects themselves, you would typically do something like this.

    #include"Bar.h"
    
    class Foo {
      void something();
      void method1(Bar b);
      Bar method2();
    };

    This makes sense because you need to know the binary layout of Bar in order to pass it properly to and from a method. Thus a forward declaration is not enough, you must include the full header, otherwise you can’t use the methods of Foo.

    But what if some class does not use either of the methods that deal with Bars? What if it only calls method something? It would still need to parse all of Bar (and everything it #includes) even though it never uses Bar objects. This seems inefficient.

    It turns out that including Bar.h is not necessary, and you can instead do this:

    class Bar;
    
    class Foo {
      void something();
      void method1(Bar b);
      Bar method2();
    };

    You can define functions taking or returning full objects with forward declarations just fine. The catch is that those users of Foo that use the Bar methods need to include Bar.h themselves. Correspondingly those that do not deal with Bar objects themselves do not need to include Bar.hh ever, even indirectly. If you ever find out that they do, it is proof that your #includes are not minimal. Fixing these include chains will make your source files more isolated and decrease compile times, sometimes dramatically.

    You only need to #include the full definition of Bar if you need:

    • to use its services (constructors, methods, constants, etc)
    • to know its memory layout

    In practice the latter means that you need to either call or implement a function that takes a Bar object rather than a pointer or reference to it.

    For other uses a forward declaration is sufficient.

    Post scriptum

    The discussion above holds even if Foo and Bar are templates, but making template classes as clean can be a lot harder and may in some instances be impossible. You should still try to minimize header includes as much as possible.

    Read more
    Jussi Pakkanen

    With the release of C++11 something quite extraordinary has happened. Its focus on usable libraries, value types and other niceties has turned C++, conceptually, into a scripting language.

    This seems like a weird statement to make, so let’s define exactly what we mean by that. Scripting languages differ from classical compiled languages such as C in the following ways:

    • no need to manually manage memory
    • expressive syntax, complex functionality can be implemented in just a couple of lines of code
    • powerful string manipulation functions
    • large standard library

    As of C++11 all these hold true for C++. Let’s examine this with a simple example. Suppose we want to write a program that reads all lines from a file and writes them in a different file in sorted order. This is classical scripting language territory. In C++11 this code would look something like the following (ignoring error cases such as missing input arguments).

    #include<string>
    #include<vector>
    #include<algorithm>
    #include<fstream>
    
    using namespace std;
    
    int main(int argc, char **argv) {
      ifstream ifile(argv[1]);
      ofstream ofile(argv[2]);
      string line;
      vector<string> data;
      while(getline(ifile, line)) {
        data.push_back(line);
      }
      sort(data.begin(), data.end());
      for(const auto &i : data) {
        ofile << i << std::endl;
      }
      return 0;
    }

    That is some tightly packed code. Ignoring include boilerplate and the like leaves us with roughly ten lines of code. If you were to do this with plain C using only its standard library merely implementing getline functionality reliably would take more lines of code. Not to mention it would be tricky to get right.

    Other benefits include:

    • every single line of code is clear, understandable and expressive
    • memory leaks can not happen, could be reworked into a library function easily
    • smaller memory footprint due to not needing a VM
    • compile time with -O3 is roughly the same as Python VM startup and has to be done only once
    • faster than any non-JITted scripting language

    Now, obviously, this won’t mean that scripting languages will disappear any time soon (you can have my Python when you pry it from my cold, dead hands). What it does do is indicate that C++ is quite usable in fields one traditionally has not expected it to be.

    Read more
    Jussi Pakkanen

    Pimpl is a common idiom in C++. It means hiding the implementation details of a class with a construct that looks like this:

    class pimpl;
    
    class Thing {
    private:
      pimpl *p:
    public:
     ...
    };

    This cuts down on compilation time because you don’t have to #include all headers required for the implementation of this class. The downside is that p needs to be dynamically allocated in the constructor, which means a call to new. For often constructed objects this can be slow and lead to memory fragmentation.

    Getting rid of the allocation

    It turns out that you can get rid of the dynamic allocation with a little trickery. The basic approach is to preserve space in the parent object with, say, a char array. We can then construct the pimpl object there with placement new and delete it by calling the destructor.

    A header file for this kind of a class looks something like this:

    #ifndef PIMPLDEMO_H
    #define PIMPLDEMO_H
    
    #define IMPL_SIZE 24
    
    class PimplDemo {
    private:
      char data[IMPL_SIZE];
    
     public:
      PimplDemo();
      ~PimplDemo();
    
      int getNumber() const;
    };
    
    #endif

    IMPL_SIZE is the size of the pimpl object. It needs to be manually determined. Note that the size may be different on different platforms.

    The corresponding implementation looks like this.

    #include"pimpldemo.h"
    #include<vector>
    
    using namespace std;
    
    class priv {
    public:
      vector<int> foo;
    };
    
    #define P_DEF priv *p = reinterpret_cast<priv*>(data)
    #define P_CONST_DEF const priv *p = reinterpret_cast<const priv*>(data)
    
    PimplDemo::PimplDemo() {
      static_assert(sizeof(priv) == sizeof(data), "Pimpl array has wrong size.");
      P_DEF;
      new(p) priv;
      p->foo.push_back(42); // Just for show.
    }
    
    PimplDemo::~PimplDemo() {
      P_DEF;
      p->~priv();
    }
    
    int PimplDemo::getNumber() const {
      P_CONST_DEF;
      return (int)p->foo.size();
    }

    Here we define two macros that create a variable for accessing the pimpl. At this point we can use it just as if were defined in the traditional way. Note the static assert that checks, at compile time, that the space we have reserved for the pimpl is the same as what the pimpl actually requires.

    We can test that it works with a sample application.

    #include<cstdio>
    #include<vector>
    #include"pimpldemo.h"
    
    int main(int argc, char **argv) {
      PimplDemo p;
      printf("Should be 1: %d\n", p.getNumber());
      return 0;
    }

    The output is 1 as we would expect. The program is also Valgrind clean so it works just the way we want it to.

    When should I use this technique?

    Never!

    Well, ok, never is probably a bit too strong. However this technique should be used very sparingly. Most of the time the new call is insignificant. The downside of this approach is that it adds complexity to the code. You also have to keep the backing array size up to date as you change the contents of the pimpl.

    You should only use this approach if you have an object in the hot path of your application and you really need to squeeze the last bit of efficiency out of your code. As a rough guide only about 1 of every 100 classes should ever need this. And do remember to measure the difference before and after. If there is no noticeable improvement, don’t do it.

    Read more
    niemeyer

    Here is a small programming brain teaser for the weekend:

    Assume uf is an unsigned integer with 64 bits that holds the IEEE-754 representation for a binary floating point number of that size.

    The questions are:

    1. How to tell if uf represents an integer number?

    2. How to serialize the absolute value of such an integer number in the minimum number of bytes possible, using big-endian ordering and the 8th bit as a continuation flag? For example, float64(1<<70 + 3<<21) serializes as:

    "\x81\x80\x80\x80\x80\x80\x80\x83\x80\x80\x00"
    

    The background for this problem is that the current draft of the strepr specification mentions that serialization. Some languages, such as Python and Ruby, implement transparent arbitrary precision integers, and that makes implementing the specification easier.

    For example, here is a simple Python interactive session that arrives at the result provided above exploring the native integer representation.

    >>> f = float((1<<70) + (3<<21))
    >>> v = int(f)
    >>> l = [v&0x7f]
    >>> v >>= 7
    >>> while v > 0:
    ...     l.append(0x80 | (v&0x7f))
    ...     v >>= 7
    ... 
    >>> l.reverse()
    >>> "".join("%02x" % i for i in l)
    '8180808080808083808000'
    

    Python makes the procedure simpler because it is internally converting the float into an integer of appropriate precision via standard C functions, and then offering bit operations on the resulting value.

    The suggested brain teaser can be efficiently solved using just the IEEE-754 representation, though, and it’s relatively easy because the problem is being constrained to the integer space.

    A link to an implementation will be provided next week.

    UPDATE: The logic is now available as part of the reference implementation of strepr.

    Read more
    niemeyer

    Note: This is a candidate version of the specification. This note will be removed once v1 is closed, and any changes will be described at the end. Please get in touch if you’re implementing it.

    Contents


    Introduction

    This specification defines strepr, a stable representation that enables computing hashes and cryptographic signatures out of a defined set of composite values that is commonly found across a number of languages and applications.

    Although the defined representation is a serialization format, it isn’t meant to be used as a traditional one. It may not be seen entirely in memory at once, or written to disk, or sent across the network. Its role is specifically in aiding the generation of hashes and signatures for values that are serialized via other means (JSON, BSON, YAML, HTTP headers or query parameters, configuration files, etc).

    The format is designed with the following principles in mind:

    Understandable — The representation must be easy to understand to increase the chances of it being implemented correctly.

    Portable — The defined logic works properly when the data is being transferred across different platforms and implementations, independently from the choice of protocol and serialization implementation.

    Unambiguous — As a natural requirement for producing stable hashes, there is a single way to process any supported value being held in the native form of the host language.

    Meaning-oriented — The stable representation holds the meaning of the data being transferred, not its type. For example, the number 7 must be represented in the same way whether it’s being held in a float64 or in an uint16.


    Supported values

    The following values are supported:

    • nil: the nil/null/none singleton
    • bool: the true and false singletons
    • string: raw sequence of bytes
    • integers: positive, zero, and negative integer numbers
    • floats: IEEE754 binary floating point numbers
    • list: sequence of values
    • map: associative value→value pairs


    Representation

    nil = 'z'

    The nil/null/none singleton is represented by the single byte 'z' (0x7a).

    bool = 't' / 'f'

    The true and false singletons are represented by the bytes 't' (0x74) and 'f' (0x66), respectively.

    unsigned integer = 'p' <value>

    Positive and zero integers are represented by the byte 'p' (0x70) followed by the variable-length encoding of the number.

    For example, the number 131 is always represented as {0x70, 0x81, 0x03}, independently from the type that holds it in the host language.

    negative integer = 'n' <absolute value>

    Negative integers are represented by the byte 'n' (0x6e) followed by the variable-length encoding of the absolute value of the number.

    For example, the number -131 is always represented as {0x6e, 0x81, 0x03}, independently from the type that holds it in the host language.

    string = 's' <num bytes> <bytes>

    Strings are represented by the byte 's' (0x73) followed by the variable-length encoding of the number of bytes in the string, followed by the specified number of raw bytes. If the string holds a list of Unicode code points, the raw bytes must contain their UTF-8 encoding.

    For example, the string hi is represented as {0x73, 0x02, 'h', 'i'}

    Due to the complexity involved in Unicode normalization, it is not required for the implementation of this specification. Consequently, Unicode strings that if normalized would be equal may have different stable representations.

    binary float = 'd' <binary64>

    32-bit or 64-bit IEEE754 binary floating point numbers that are not holding integers are represented by the byte 'd' (0x64) followed by the big-endian 64-bit IEEE754 binary floating point encoding of the number.

    There are two exceptions to that rule:

    1. If the floating point value is holding a NaN, it must necessarily be encoded by the following sequence of bytes: {0x64, 0x7f, 0xf8, 0x00 0x00, 0x00, 0x00, 0x00, 0x00}. This ensures all NaN values have a single representation.

    2. If the floating point value is holding an integer number it must instead be encoded as an unsigned or negative integer, as appropriate. Floating point values that hold integer numbers are defined as those where floor(v) == v && abs(v) != ∞.

    For example, the value 1.1 is represented as {0x64, 0x3f, 0xf1, 0x99, 0x99, 0x99, 0x99, 0x99, 0x9a}, but the value 1.0 is represented as {0x70, 0x01}, and -0.0 is represented as {0x70, 0x00}.

    This distinction means all supported numbers have a single representation, independently from the data type used by the host language and serialization format.

    list = 'l' <num items> [<item> ...]

    Lists of values are represented by the byte 'l' (0x6c), followed by the variable-length encoding of the number of pairs in the list, followed by the stable representation of each item in the list in the original order.

    For example, the value [131, -131] is represented as {0x6c, 0x70, 0x81, 0x03, 0x6e, 0x81, 0x03, 0x65}

    map = 'm' <num pairs> [<item key> <item value>  ...]

    Associative maps of values are represented by the byte 'm' (0x6d) followed by the variable-length encoding of the number of pairs in the map, followed by an ordered sequence of the stable representation of each key and value in the map. The pairs must be sorted so that the stable representation of the keys is in ascending lexicographical order. A map must not have multiple keys with the same representation.

    For example, the map {"a": 4, 5: "b"} is always represented as {0x6d, 0x02, 0x70, 0x05, 0x73, 0x01, 'b', 0x73, 0x01, 'a', 0x70, 0x04}.


    Variable-length encoding

    Integers are variable-length encoded so that they can be represented in short space and with unbounded size. In an encoded number, the last byte holds the 7 least significant bits of the unsigned value, and zero as the eight bit. If there are remaining non-zero bits, the previous byte holds the next 7 bits, and the eight bit is set on to flag the continuation to the next byte. The process continues until there are non-zero bits remaining. The most significant bits end up in the first byte of the encoded value, which must necessarily not be 0x80.

    For example, the number 128 is variable-length encoded as {0x81, 0x00}.


    Reference implementation

    A reference implementation is available, including a test suite which should be considered when implementing the specification.


    Changes

    draft1 → draft2

    • Enforce the use of UTF-8 for Unicode strings and explain why normalization is being left out.
    • Enforce a single NaN representation for floats.
    • Explain that map key uniqueness refers to the representation.
    • Don’t claim the specification is easy to implement; floats require attention.
    • Mention reference implementation.

    Read more
    niemeyer

    A few years ago, when I started pondering about the possibility of porting juju to the Go language, one of the first pieces of the puzzle that were put in place was goyaml: a Go package to parse and serialize a yaml document. This was just an experiment and, as a sane route to get started, a Go layer that does all the language-specific handling was written on top of the libyaml C scanner, parser, and serializer library.

    This was a good initial plan, but for a number of reasons the end goal was always to have a pure Go implementation. Having a C layer in a Go program slows down builds significantly due to the time taken to build the C code, makes compiling in other platforms and cross-compiling harder, has certain runtime penalties, and also forces the application to drop the memory safety guarantees offered by Go.

    For these reasons, over the last couple of weeks I took a few hours a day to port the C backend to Go. The total time, considering full time work days, would be equivalent to about a week worth of work.

    The work started on the scanner and parser side of the library. This took most of the time, not only because it encompassed more than half of the code base, but also because the shared logic had to be ported too, and there was a need to understand which patterns were used in the old code and how they would be converted across in a reasonable way.

    The whole scanner and parser plus header files, or around 5000 code lines of C, were ported over in a single shot without intermediate runs. To steer the process in a sane direction, gofmt was called often to reformat the converted code, and then the project was compiled every once in a while to make sure that the pieces were hanging together properly enough.

    It’s worth highlighting how useful gofmt was in that process. The C code was converted in the most convenient way to type it, and then gofmt would quickly put it all together in a familiar form for analysis. Not rarely, it would also point out trivial syntactic issues. A double win.

    After the scanner and parser were finally converted completely, the pre-existing Go unmarshaling logic was shifted to the new pure implementation, and the reading side of the test suite could run as-is. Naturally, though, it didn’t work out of the box.

    To quickly pick up the errors in the new implementation, the C logic and the Go port were put side-by-side to run the same tests, and tracing was introduced in strategic points of the scanner and parser. With that, it was easy to spot where they diverged and pinpoint the human errors.

    It took about two hours to get the full suite to run successfully, with a handful of bugs uncovered. Out of curiosity, the issues were:

    • An improperly dropped parenthesis affected the precedence of an expression
    • A slice was being iterated with copying semantics where a reference was necessary
    • A pointer arithmetic conversion missed the base where there was base+offset addressing
    • An inner scoped variable improperly shadowed the outer scope

    The same process of porting and test-fixing was then repeated on the the serializing side of the project, in a much shorter time frame for the reasons cited.

    The resulting code isn’t yet idiomatic Go. There are several signs in it that it was ported over from C: the name conventions, the use of custom solutions for buffering and reader/writer abstractions, the excessive copying of data due to the need of tracking data ownership so the simple deallocating destructors don’t double-free, etc. It’s also been deoptimized, due to changes such as the removal of macros and in many cases its inlining, and the direct expansion of large unions which causes some core objects to grow significantly.

    At this point, though, it’s easy to gradually move the code base towards the common idiom in small increments and as time permits, and cleaning up those artifacts that were left behind.

    This code will be made public over the next few days via a new goyaml release. Meanwhile, some quick facts about the process and outcome follows.

    Lines of code

    According to cloc, there was a total of 7070 lines of C code in .c and .h files. Of those, 6727 were ported, and 342 were 12 functions that were left unconverted as being unnecessary right now. Those 6727 lines of C became 5039 lines of Go code in a mostly one-to-one dumb translation.

    That difference comes mainly from garbage collection, lack of forward declarations, standard helpers such as append, range-based for loops, first class slice type with length and capacity, internal OOM handling, and so on.

    Future work code can easily increase the difference further by replacing some of the logic ported with more sensible options available in Go, such as standard abstractions for readers and writers, buffered writing support as availalbe in the standard library, etc.

    Code clarity and safety

    In the specific context of the work done, which is of a scanner, parser and serializer, the slice abstraction is responsible for noticeable clarity gains in the code, when compared to the equivalent logic based on pointer arithmetic. It also gives a much more comforting guarantee of correctness of the written code due to bound-checking.

    Performance

    While curious, this shouldn’t be taken as a performance comparison between the two languages, as it is comparing a fine tuned C implementation with something that is worse than a direct one-to-one port: not only it hasn’t seen any time at all on preventing waste, but the original logic was deoptimized due to changes such as the removal of inlining macros and the expansion of large unions. There are many obvious changes to be done for improving performance.

    With that out of the way, in a simple decoding benchmark the C-backed decoder runs on about 37% of the time taken by the out-of-the-box deoptimized Go port.

    Output size

    The previous goyaml.a Go package file had 1463kb. The new one has 1016kb. This difference includes glue code generated for the integration.

    Considering only the .c and .h files involved in the port, the C object code generated with the standard flags used by the go build tool (-g -O2) sums up to 789kb. The equivalent Go code with the standard settings compiles to 664kb. The 12 functions not ported are also part of that difference, so the difference is pretty much negligible.

    Build time

    Building the 8 .c files alone takes 3.6 seconds with the standard flags used by the go build tool (-g -O2). After the port, building the entire Go project with the standard settings takes 0.3 seconds.

    Mechanical changes

    Many of the mechanical changes were done using regular expressions. Excluding the trivial ones, about a dozen regular expressions were used to swap variable and type names, drop parenthesis, place brackets in the right locations, convert function declarations, and so on.

    Read more