Canonical Voices

Posts tagged with 'performance'

Colin Ian King

Recently I was playing around with CPU loading and was trying to estimate the number of compute operations being executed on my machine.  In particular, I was interested to see how many instructions per cycle and stall cycles I was hitting on the more demanding instructions.   Fortunately, perf stat allows one to get detailed processor statistics to measure this.

In my first test, I wanted to see how the Intel rdrand instruction performed with 2 CPUs loaded (each with a hyper-thread):


$ perf stat stress-ng --rdrand 4 -t 60 --times
stress-ng: info: [7762] dispatching hogs: 4 rdrand
stress-ng: info: [7762] successful run completed in 60.00s
stress-ng: info: [7762] for a 60.00s run time:
stress-ng: info: [7762] 240.01s available CPU time
stress-ng: info: [7762] 231.05s user time ( 96.27%)
stress-ng: info: [7762] 0.11s system time ( 0.05%)
stress-ng: info: [7762] 231.16s total time ( 96.31%)

Performance counter stats for 'stress-ng --rdrand 4 -t 60 --times':

231161.945062 task-clock (msec) # 3.852 CPUs utilized
18,450 context-switches # 0.080 K/sec
92 cpu-migrations # 0.000 K/sec
821 page-faults # 0.004 K/sec
667,745,260,420 cycles # 2.889 GHz
646,960,295,083 stalled-cycles-frontend # 96.89% frontend cycles idle
stalled-cycles-backend
13,702,533,103 instructions # 0.02 insns per cycle
# 47.21 stalled cycles per insn
6,549,840,185 branches # 28.334 M/sec
2,352,175 branch-misses # 0.04% of all branches

60.006455711 seconds time elapsed

stress-ng's rdrand test just performs a 64 bit rdrand read and loops on this until the data is ready, and performs this 32 times in an unrolled loop.  Perf stat shows that each rdrand + loop sequence on average consumes about 47 stall cycles showing that rdrand is probably just waiting for the PRNG block to produce random data.

My next experiment was to run the stress-ng ackermann stressor; this performs a lot of recursion, hence one should see a predominantly large amount of branching.

$ perf stat stress-ng --cpu 4 --cpu-method ackermann -t 60 --times
stress-ng: info: [7796] dispatching hogs: 4 cpu
stress-ng: info: [7796] successful run completed in 60.03s
stress-ng: info: [7796] for a 60.03s run time:
stress-ng: info: [7796] 240.12s available CPU time
stress-ng: info: [7796] 226.69s user time ( 94.41%)
stress-ng: info: [7796] 0.26s system time ( 0.11%)
stress-ng: info: [7796] 226.95s total time ( 94.52%)

Performance counter stats for 'stress-ng --cpu 4 --cpu-method ackermann -t 60 --times':

226928.278602 task-clock (msec) # 3.780 CPUs utilized
21,752 context-switches # 0.096 K/sec
127 cpu-migrations # 0.001 K/sec
927 page-faults # 0.004 K/sec
594,117,596,619 cycles # 2.618 GHz
298,809,437,018 stalled-cycles-frontend # 50.29% frontend cycles idle
stalled-cycles-backend
845,746,011,976 instructions # 1.42 insns per cycle
# 0.35 stalled cycles per insn
298,414,546,095 branches # 1315.017 M/sec
95,739,331 branch-misses # 0.03% of all branches

60.032115099 seconds time elapsed

..so about 35% of the time is used in branching and we're getting  about 1.42 instructions per cycle and no many stall cycles, so the code is most probably executing inside the instruction cache, which isn't surprising because the test is rather small.

My final experiment was to measure the stall cycles when performing complex long double floating point math operations, again with stress-ng.

$ perf stat stress-ng --cpu 4 --cpu-method clongdouble -t 60 --times
stress-ng: info: [7854] dispatching hogs: 4 cpu
stress-ng: info: [7854] successful run completed in 60.00s
stress-ng: info: [7854] for a 60.00s run time:
stress-ng: info: [7854] 240.00s available CPU time
stress-ng: info: [7854] 225.15s user time ( 93.81%)
stress-ng: info: [7854] 0.44s system time ( 0.18%)
stress-ng: info: [7854] 225.59s total time ( 93.99%)

Performance counter stats for 'stress-ng --cpu 4 --cpu-method clongdouble -t 60 --times':

225578.329426 task-clock (msec) # 3.757 CPUs utilized
38,443 context-switches # 0.170 K/sec
96 cpu-migrations # 0.000 K/sec
845 page-faults # 0.004 K/sec
651,620,307,394 cycles # 2.889 GHz
521,346,311,902 stalled-cycles-frontend # 80.01% frontend cycles idle
stalled-cycles-backend
17,079,721,567 instructions # 0.03 insns per cycle
# 30.52 stalled cycles per insn
2,903,757,437 branches # 12.873 M/sec
52,844,177 branch-misses # 1.82% of all branches

60.048819970 seconds time elapsed

The complex math operations take some time to complete, stalling on average over 35 cycles per op.  Instead of using 4 concurrent processes, I re-ran this using just the two CPUs and eliminating 2 of the hyperthreads.  This resulted in 25.4 stall cycles per instruction showing that hyperthreaded processes are stalling because of contention on the floating point units.

Perf stat is an incredibly useful tool for examining performance issues at a very low level.   It is simple to use and yet provides excellent stats to allow one to identify issues and fine tune performance critical code.  Well worth using.

Read more
Jussi Pakkanen

Pimpl is a common idiom in C++. It means hiding the implementation details of a class with a construct that looks like this:

class pimpl;

class Thing {
private:
  pimpl *p:
public:
 ...
};

This cuts down on compilation time because you don’t have to #include all headers required for the implementation of this class. The downside is that p needs to be dynamically allocated in the constructor, which means a call to new. For often constructed objects this can be slow and lead to memory fragmentation.

Getting rid of the allocation

It turns out that you can get rid of the dynamic allocation with a little trickery. The basic approach is to preserve space in the parent object with, say, a char array. We can then construct the pimpl object there with placement new and delete it by calling the destructor.

A header file for this kind of a class looks something like this:

#ifndef PIMPLDEMO_H
#define PIMPLDEMO_H

#define IMPL_SIZE 24

class PimplDemo {
private:
  char data[IMPL_SIZE];

 public:
  PimplDemo();
  ~PimplDemo();

  int getNumber() const;
};

#endif

IMPL_SIZE is the size of the pimpl object. It needs to be manually determined. Note that the size may be different on different platforms.

The corresponding implementation looks like this.

#include"pimpldemo.h"
#include<vector>

using namespace std;

class priv {
public:
  vector<int> foo;
};

#define P_DEF priv *p = reinterpret_cast<priv*>(data)
#define P_CONST_DEF const priv *p = reinterpret_cast<const priv*>(data)

PimplDemo::PimplDemo() {
  static_assert(sizeof(priv) == sizeof(data), "Pimpl array has wrong size.");
  P_DEF;
  new(p) priv;
  p->foo.push_back(42); // Just for show.
}

PimplDemo::~PimplDemo() {
  P_DEF;
  p->~priv();
}

int PimplDemo::getNumber() const {
  P_CONST_DEF;
  return (int)p->foo.size();
}

Here we define two macros that create a variable for accessing the pimpl. At this point we can use it just as if were defined in the traditional way. Note the static assert that checks, at compile time, that the space we have reserved for the pimpl is the same as what the pimpl actually requires.

We can test that it works with a sample application.

#include<cstdio>
#include<vector>
#include"pimpldemo.h"

int main(int argc, char **argv) {
  PimplDemo p;
  printf("Should be 1: %d\n", p.getNumber());
  return 0;
}

The output is 1 as we would expect. The program is also Valgrind clean so it works just the way we want it to.

When should I use this technique?

Never!

Well, ok, never is probably a bit too strong. However this technique should be used very sparingly. Most of the time the new call is insignificant. The downside of this approach is that it adds complexity to the code. You also have to keep the backing array size up to date as you change the contents of the pimpl.

You should only use this approach if you have an object in the hot path of your application and you really need to squeeze the last bit of efficiency out of your code. As a rough guide only about 1 of every 100 classes should ever need this. And do remember to measure the difference before and after. If there is no noticeable improvement, don’t do it.

Read more
Richard Harding

Network graph of the combo loaded JavaScript.

Updated network graph

Back in January a side project was started to update the JavaScript used in Launchpad. Launchpad has been using YUI 3.3.0 for a long time, very successfully, however recent advances in YUI 3.5 and higher have added some great tools for development that Launchpad can take advantage of. In order to facilitate easier upgrades our YUI library version Launchpad has been moved to using a combo loader for serving out JavaScript.

This means, that instead of a single launchpad.js file that can be upwards of 3MB in size, each request builds a list of JavaScript modules needed for the current page to work, and the combo loader only sends down those modules. This drastically cuts down on the download size of the JavaScript for users. These combo loaded JavaScript files are also cached for speedy serving to other users of Launchpad.

The combo loader also allows us to specify which YUI version to load via a tweak to the url. In this way we can easily test new version of YUI side by side with the current stable version as they come out. This allows Launchpad to keep with future YUI released much faster.

We’re excited that today Launchpad has moved from YUI 3.3.0 to 3.5.1 and is now served by the combo loader. This change provides a faster experience for users along with easier maintenance and new JavaScript library features for developers.

We’ve still got more to do though. YUI just released version 3.7 and we aim to push that into production faster than ever before. Please let us know how these changes work for you.

Launchpad also wants to thank the folks over at YUI for continuing the great work on a tool that Launchpad heavily depends on.

Read more
Laura czajkowski

Parallel testing is live

One of the projects the Launchpad Squads (yellow) have been working on has been the Parallel Testing during the last cycle, this has now been completed and is now in operation. WebOps have today finished setting up parallel testing in buildbot. Buildbot-poll has been updated to know about the new builders, and the developers have  confirmed that [testfix] and automatic stable merging etc. work fine. Nothing should have changed except that builds now take 35 minutes rather than 6.5 hours.

If something goes wrong, http://lpbuildbot.canonical.com/waterfall and https://dev.launchpad.net/yellow/ParallelTestingTroubleshooting may be helpful. The buildbot master is still praseodymium, but the slaves are new: sluagh for devel, and radande for db-devel. If you need packages upgraded on the slaves, poke WebOps as before.

If you would like to follow the discussion on this topic you’ll find more on the Launchpad development mailing list

Read more
Laura czajkowski

Launchpad Builders update

We have recovered some of the affected builders, more will be coming back on later this week with the remainder to come in a few weeks when we have new hardware.  Until then launchpad builders will be at a reduced build capacity. l

We apologise for the inconvenience and we’re sorry for the disruption to your service.

Read more
Colin Ian King

Simple performance test of rdrand

My new Lenovo X230 laptop is equipped with a Intel(R) i5-3210M CPU (2.5 GHz, with 3.1 GHz Turbo) which supports the new Digital Random Number Generator (DRNG) - a high performance entropy and random number generator.  The DNRG is read using the new Intel rdrand instruction which can return 64, 32 or 16 bit random numbers.

The DRNG is described in detail in this article and provides very useful code examples in assembler and C which I used to write a simple and naive test to see how well the rdrand performs on my i5-3210M.

For my test, I simply read 100 million 64 bit random numbers on a single thread. The Intel literature states one can get up to about 70 million rdrand invocations per second on 8 threads, so my simple test is rather naive as it only exercises rdrand on one thread.  For a set of 10 iterations on my test, I'm getting around 40-45 nanoseconds per rdrand, or about 22-25 million rdrands per second, which is really impressive.   The test is a mix of assembler and C, and is not totally optimal, so I am sure I can squeeze a little more performance out with some extra work.

The next test I suspect is to see just random the data is and to see how well it compares to other software random number generators... but I will tinker with that after my vacation.

Anyhow, for reference, the test can be found here in my git repository.

Read more
Raphaël Badin

Edit 2011-11-15 08:18 UTC: The problem is now fixed and we’ve re-enabled the new menu.

Edit 2011-11-11 13:42 UTC: We’ve temporarily disabled the new menu while we fix some unfortunate side effect.

We’ve just deployed a new, simplified version of the branch menu displayed on the right hand side of personal code pages (e.g. personal page for the Launchpad team). It looks like this:

Old menu

New menu

Calculating the number of branches took way too much time for people/teams with a huge number of branches (e.g. https://code.launchpad.net/~ubuntu-branches), up to the point that they were getting timeouts.

The new design, along with optimisations we’ve made to the database queries, should improve performance for everyone.

Read more
jameinel

The last of my patches is queued up to land, so I figured I’d post an update about the performance improvements I’ve been working on. I’m also just excited about how well it has all come together.

There were essentially 3 changes that mattered for performance on large trees.

  1. Fixing iter_entries_by_dir() to preload the data in Repository- optimal ordering rather than by-request ordering. In large trees this was causing us to thrash and become pathologically slow. In the 70,000-file test tree, thrashing took about 3 minutes, the preloading version takes about 15s. This affected a lot of our commands, though I guess the next two fixes would actually reduce the number of commands affected by this.
  2. Fixing several code paths to use optimized iter_changes() rather than the generic iter_changes(). The generic path walks both inventories iter_entries_by_dir() and compares them. Our 2a format Repository can do iter_changes without loading the whole tree. (It internally uses a hash_trie to store the inventory, and so nodes with matching sub-trees can be skipped for comparison.) This generally shows up as something that was taking 15s (to load the whole inventory) dropping to <2s for the improved comparison. (bzr revert and bzr pull were both directly impacted here)
  3. Changing WT.set_parent_trees([one_tree]) to update itself using current_basis.iter_changes(one_tree), rather than setting the state from scratch. This basically adds another case where we can avoid reading the whole inventory state again, which is another 15s to <2s sort of change. This only showed up after fixing (2), because once the tree is loaded, the other actions are generally pretty quick. (bzr up, bzr pull)

This is the chart I put together for “whats-new-in-2.4.txt”. bzr-2.3.2 will have fix (1), but not (2) or (3), to give a feel for how much of an impact different fixes have had.

    bzr-2.3.1 bzr-2.3.2 bzr-2.4  action
    3m39s         1m08s   1m03s  bzr co --lightweight
      38s            8s      2s  bzr revert (in a clean tree)
    4m47s         3m56s     15s  bzr merge
    4m45s           20s      3s  bzr pull
    4m58s         3m00s      2s  bzr up
    9m33s           21s     19s  bzr uncommit (including a merge)
    4m44s           17s      2s  bzr uncommit (simple commit)

So yes, some operations that were taking almost 5 minutes have now dropped down to taking <3s.

You won’t see that dramatic of an improvement for smaller trees, though most cases will have a pleasant improvement. Here is a short list for the ‘Launchpad‘ tree (with ~8k items).

    bzr-2.3.1   bzr-2.4     action
    5.3s        5.2s        bzr co --lightweight
    0.9s        0.3s        bzr revert
    1.4s        0.4s        bzr pull
    3.9s        3.7s        bzr uncommit (with merge)
    0.9s        0.3s        bzr uncommit (without merge)

Anyway, I’m quite happy about how much better bzr-2.4 will be in large trees.

update:Add graphs…

Read more
Curtis Hovey

Users of Launchpad Answers will see that asking a question, editing it, or posting a comment to it is faster. Email about question changes is sent a few minutes latter. Many bugs relating to question emails were fixed as we moved the work of sending emails to the new process.

Users and answer contacts saw slow pages or time out errors when working with questions in large projects. Simple actions like asking a question or providing an answer would fail. It was common to see errors converting bugs into questions. A few weeks ago, we saw that 8 of the top 10 kinds of time outs involved questions; though this ratio was caused in part by the tremendous work of making other parts of Launchpad faster.

The root cause of the slow question pages was sending email to all the subscribers before showing the next page. The solution was to queue the the event to notify subscribers, and send the emails later. While updating the code, there were many opportunities to fix related Answers bugs. I am particularly pleased with the changes to the rules to create a question. There were four lines of code, and while I intended to fix one line, I realised there was a bug related to each line of code. In a matter of minutes I had fixed four bugs. The most obvious change you will see is that question emails will now state that you received the email because you asked the question, where previously you were merely described as a subscriber.

Read more
Matthew Revell

Last week I asked what you’d prefer: faster loading pages that may have slightly inaccurate bug counts or slower loading pages where the bug counts were guaranteed to be accurate always.

Well, here’s the result:

85.9% of respondents wanted faster pages

85.9% of the people who responded to the question said they wanted faster pages, even if it meant that some of the bug counts might be a little inaccurate.

I asked a similar question on Facebook and 88% of the people who replied said they’d rather have faster pages.

We’re not going to do anything just yet. First, we want to do some more research. If we did implement this, though, it’s likely to affect only those projects with private bugs and, even then, show up only in side-boxes such as this:

Bug information box

There’s more in the full discussion on the launchpad-dev list.

Read more
Martin Pool

A nice mail from Martitza this morning:

Hi.  Some of you may recall that I’ve provided measurements of some slow operations in both bzr itself and bzr-explorer based on bzrlib 2.1 and earlier.  I don’t recall doing any testing on bzr 2.2.0 specifically, but now that I’m on bzr 2.2.1 bzr (and bzr-explorer) are noticeably faster and even more pleasant to use now.   So far it seems that each of the add/commit/status operations on sets of 10K+ files (500MB+ branches) takes no more than two or three minutes each.  This is twice as fast as some of my previous measurements on the same or smaller branches.  Also, memory consumption is much reduced, which I suspect helps a lot.  I even have one of these large repos hosted on an old 512MB Pentium 3 machine with no problem today.  This is wonderful!  I hesitate to name the people I know made this possible, because I do not want to leave out the many more people whose contributions I don’t know.  Please accept my thanks to the whole team.


Read more
Martin Pool

Three tips from Leonard’s lightning talk in Prague about writing faster Launchpadlib API clients:

1. Use the latest launchpadlib. It gets faster from one release to the next. (The versions in the current Ubuntu release should be fine; otherwise run from the branch or the latest tarball.)

2. Profile:

    import httplib2
    httplib2.debuglevel = 1

will show each http request and response, so that you can see what’s taking time.

3. Fetch objects only once:

Don’t do this:

    if bug.person is not None:
        print bug.person.name

instead

    p = bug.person
    if p is not None:
        print p.name

In the first case, the client may fetch the Person object twice. (We may fix this in future.)

Read more
Martin Pool

Parth announced bzr-grep 0.2.0.  Amongst other things there are performance enhancements such that Eli says:

Thanks, this looks great. I just tried it on Windows XP searching for a fixed string in the Emacs repository — took 28 seconds with a cold cache and only 7.5 with a warm cache, which is impressive.

By contrast, “grep -F -R” (with suitable –exclude patterns, to prevent it from searching binary files and inside .bzr) took about 12.5 seconds with a warm cache.


Read more