Canonical Voices

Posts tagged with 'performance'

Colin Ian King

Over the last few weeks I have been toying with the idea of adding more performance monitoring to stress-ng so one can see how much a stress test impacts on the CPU. The obvious choice to get such low level data is via Linux perf events using perf_event_open(2).

The man page for perf_event_open() provides plenty of information to get perf working from userspace, however, I was a bit stumped when I used several hardware perf events and then ran out of hardware Perf Monitoring Units (PMUs) resulting in some strange event counter readings. I discovered that when one runs out of PMUs, perf will multiplex event counting and so the perf counters need to be scaled by multiplying by PERF_FORMAT_TOTAL_TIME_ENABLED and divided by PERF_FORMAT_TOTAL_TIME_RUNNING.

Once I had figured this out, it was relatively plain sailing to get perf working in stress-ng.  So stress-ng V0.04.04 now supports the --perf option that just enables perf monitoring on each stress test being run, it is as simple as that. For multiple instances of a stress test, stress-ng will sum all the perf counters of each processes running the stress-test to provide an overall total.

The following example will run the stress-ng cache stress test.  The first run enables cache flushing and so fetches of data will cause cache misses.  The second run has no cache flushing and hence has far lower cache miss rate.

Note how the cache-flushing not only causes a far higher cache miss rate, but also reduces the effective number of instructions per cycle being executed and hence reduces the throughput (as one would expect).  With cache-flushing enabled I was seeing only 17.53 bogo ops per second compared to the 35.97 bogo ops per second with cache-flushing disabled.

The perf stats are enlightening. I still find it incredible that my laptop has so much computing power.  Some of the more compute bound stressors (such as the stress-ng bitops cpu stressor) are hitting over 20 billion instructions per second on my machine, which is rather impressive.  It seems that gcc optimization and the x86 superscaler micro-ops are working efficiently with some of these stress tests.

My hope is that the integrated perf monitoring in stress-ng will be instructive when comparing results on different processor architectures across the range of stress-ng stress tests.

Read more
Colin Ian King

Recently I was playing around with CPU loading and was trying to estimate the number of compute operations being executed on my machine.  In particular, I was interested to see how many instructions per cycle and stall cycles I was hitting on the more demanding instructions.   Fortunately, perf stat allows one to get detailed processor statistics to measure this.

In my first test, I wanted to see how the Intel rdrand instruction performed with 2 CPUs loaded (each with a hyper-thread):

$ perf stat stress-ng --rdrand 4 -t 60 --times
stress-ng: info: [7762] dispatching hogs: 4 rdrand
stress-ng: info: [7762] successful run completed in 60.00s
stress-ng: info: [7762] for a 60.00s run time:
stress-ng: info: [7762] 240.01s available CPU time
stress-ng: info: [7762] 231.05s user time ( 96.27%)
stress-ng: info: [7762] 0.11s system time ( 0.05%)
stress-ng: info: [7762] 231.16s total time ( 96.31%)

Performance counter stats for 'stress-ng --rdrand 4 -t 60 --times':

231161.945062 task-clock (msec) # 3.852 CPUs utilized
18,450 context-switches # 0.080 K/sec
92 cpu-migrations # 0.000 K/sec
821 page-faults # 0.004 K/sec
667,745,260,420 cycles # 2.889 GHz
646,960,295,083 stalled-cycles-frontend # 96.89% frontend cycles idle
13,702,533,103 instructions # 0.02 insns per cycle
# 47.21 stalled cycles per insn
6,549,840,185 branches # 28.334 M/sec
2,352,175 branch-misses # 0.04% of all branches

60.006455711 seconds time elapsed

stress-ng's rdrand test just performs a 64 bit rdrand read and loops on this until the data is ready, and performs this 32 times in an unrolled loop.  Perf stat shows that each rdrand + loop sequence on average consumes about 47 stall cycles showing that rdrand is probably just waiting for the PRNG block to produce random data.

My next experiment was to run the stress-ng ackermann stressor; this performs a lot of recursion, hence one should see a predominantly large amount of branching.

$ perf stat stress-ng --cpu 4 --cpu-method ackermann -t 60 --times
stress-ng: info: [7796] dispatching hogs: 4 cpu
stress-ng: info: [7796] successful run completed in 60.03s
stress-ng: info: [7796] for a 60.03s run time:
stress-ng: info: [7796] 240.12s available CPU time
stress-ng: info: [7796] 226.69s user time ( 94.41%)
stress-ng: info: [7796] 0.26s system time ( 0.11%)
stress-ng: info: [7796] 226.95s total time ( 94.52%)

Performance counter stats for 'stress-ng --cpu 4 --cpu-method ackermann -t 60 --times':

226928.278602 task-clock (msec) # 3.780 CPUs utilized
21,752 context-switches # 0.096 K/sec
127 cpu-migrations # 0.001 K/sec
927 page-faults # 0.004 K/sec
594,117,596,619 cycles # 2.618 GHz
298,809,437,018 stalled-cycles-frontend # 50.29% frontend cycles idle
845,746,011,976 instructions # 1.42 insns per cycle
# 0.35 stalled cycles per insn
298,414,546,095 branches # 1315.017 M/sec
95,739,331 branch-misses # 0.03% of all branches

60.032115099 seconds time elapsed about 35% of the time is used in branching and we're getting  about 1.42 instructions per cycle and no many stall cycles, so the code is most probably executing inside the instruction cache, which isn't surprising because the test is rather small.

My final experiment was to measure the stall cycles when performing complex long double floating point math operations, again with stress-ng.

$ perf stat stress-ng --cpu 4 --cpu-method clongdouble -t 60 --times
stress-ng: info: [7854] dispatching hogs: 4 cpu
stress-ng: info: [7854] successful run completed in 60.00s
stress-ng: info: [7854] for a 60.00s run time:
stress-ng: info: [7854] 240.00s available CPU time
stress-ng: info: [7854] 225.15s user time ( 93.81%)
stress-ng: info: [7854] 0.44s system time ( 0.18%)
stress-ng: info: [7854] 225.59s total time ( 93.99%)

Performance counter stats for 'stress-ng --cpu 4 --cpu-method clongdouble -t 60 --times':

225578.329426 task-clock (msec) # 3.757 CPUs utilized
38,443 context-switches # 0.170 K/sec
96 cpu-migrations # 0.000 K/sec
845 page-faults # 0.004 K/sec
651,620,307,394 cycles # 2.889 GHz
521,346,311,902 stalled-cycles-frontend # 80.01% frontend cycles idle
17,079,721,567 instructions # 0.03 insns per cycle
# 30.52 stalled cycles per insn
2,903,757,437 branches # 12.873 M/sec
52,844,177 branch-misses # 1.82% of all branches

60.048819970 seconds time elapsed

The complex math operations take some time to complete, stalling on average over 35 cycles per op.  Instead of using 4 concurrent processes, I re-ran this using just the two CPUs and eliminating 2 of the hyperthreads.  This resulted in 25.4 stall cycles per instruction showing that hyperthreaded processes are stalling because of contention on the floating point units.

Perf stat is an incredibly useful tool for examining performance issues at a very low level.   It is simple to use and yet provides excellent stats to allow one to identify issues and fine tune performance critical code.  Well worth using.

Read more
Colin Ian King

Simple performance test of rdrand

My new Lenovo X230 laptop is equipped with a Intel(R) i5-3210M CPU (2.5 GHz, with 3.1 GHz Turbo) which supports the new Digital Random Number Generator (DRNG) - a high performance entropy and random number generator.  The DNRG is read using the new Intel rdrand instruction which can return 64, 32 or 16 bit random numbers.

The DRNG is described in detail in this article and provides very useful code examples in assembler and C which I used to write a simple and naive test to see how well the rdrand performs on my i5-3210M.

For my test, I simply read 100 million 64 bit random numbers on a single thread. The Intel literature states one can get up to about 70 million rdrand invocations per second on 8 threads, so my simple test is rather naive as it only exercises rdrand on one thread.  For a set of 10 iterations on my test, I'm getting around 40-45 nanoseconds per rdrand, or about 22-25 million rdrands per second, which is really impressive.   The test is a mix of assembler and C, and is not totally optimal, so I am sure I can squeeze a little more performance out with some extra work.

The next test I suspect is to see just random the data is and to see how well it compares to other software random number generators... but I will tinker with that after my vacation.

Anyhow, for reference, the test can be found here in my git repository.

Read more