Canonical Voices

Colin Ian King

I got bitten this week with the clone() system call returning -EINVAL on aarch64 on code that worked fine on x86.  After re-reading the manual several times and looking at my code, I resorted to shoving in debug into the kernel to track down where the -EINVAL was occurring.

The answer to my issue is in arch/arm64/kernel/process.c, copy_thread():

         if (stack_start) {  
if (is_compat_thread(task_thread_info(p)))
childregs->compat_sp = stack_start;
/* 16-byte aligned stack mandatory on AArch64 */
else if (stack_start & 15)
return -EINVAL;
else
childregs->sp = stack_start;
}

Ahah! The stack being passed into clone() has to be 16 byte aligned.  With this simple fix to my code, clone() worked.   Pity this was not in the documentation.

Read more
Colin Ian King

A frequently used incorrect realloc() idiom

While running static analysis on a lot of C source code, I keep on finding a common incorrect programming idiom used with realloc() allocation failures where a NULL is returned and the code returns with some kind of exit failure status, something like the following:

 ptr = realloc(ptr, new_size);  
if (!ptr)
return -ENOMEM; /* Failed, no memory! */

However, when realloc() fails it returns NULL and the original object remains unchanged and thus it is not freed.  So the above code leaks the memory pointed to by ptr if realloc() returns NULL.

This may be a moot point, since the error handling paths normally abort the program because we are out of memory if can't proceed any further.  However, there are occasions in code where ENOMEM may not be fatal, for example the program may reallocate smaller buffers and retry or free up space on the heap and retry.

A more correct programming idiom for realloc() perhaps should be:

 tmp = realloc(ptr, new_size);  
if (!tmp) {
free(ptr);
return -ENOMEM; /* Failed, no memory! */
}
ptr = tmp;

..which is not aesthetically pleasing, but does the trick of free'ing memory before we return.

Anyhow, it is something to bear in mind next time one uses realloc().

Read more
Colin Ian King

ZFS quick start reference guide

Dustin Kirkland recently announced ZFS being officially supported by Canonical for Ubuntu Xenial 16.04.    We've written a short reference guide to help with getting started, and understanding the ZFS terminology.  This touches the basics on setting up ZFS pools as well as creating and using ZFS file systems. 

If you are new to ZFS, I recommend having a look at the reference guide to get you started.








Read more
Colin Ian King

New "top" mode in eventstat

I wrote eventstat a few years ago to track wakeup events that keep a machine from being fully idle.  For Ubuntu Xenial Xerus 16.04 I've added a 'top' like mode (enabled using the -T option).

 
By widening the terminal one can see more of the Task, Init Function and Callback text, which is useful as these details can be rather lengthy.

Anyhow, just a minor feature change, but hopefully a useful one.

Read more
Colin Ian King

One issue when running parallel processes is contention of shared resources such as the Last Level Cache (aka LLC or L3 Cache).  For example, a server may be running a set of Virtual Machines with processes that are memory and cache intensive hence producing a large amount of cache activity. This can impact on the other VMs and is known as the "Noisy Neighbour" problem.

Fortunately the next generation Intel processors allow one to monitor and also fine tune cache allocation using Intel Cache Monitoring Technology (CMT) and Cache Allocation Technology (CAT).

Intel kindly loaned me a 12 thread development machine with CMT and CAT support to experiment with this technology using the Intel pqos tool.   For my experiment, I installed Ubuntu Xenial Server on the machine. I then installed KVM and an VM instance of Ubuntu Xenial Server.   I then loaded the instance using stress-ng running a memory bandwidth stressor:

 stress-ng --stream 1 -v --stream-l3-size 16M  
..which allocates 16MB in 4 buffers and performs various read/compute and writes to these, hence causing a "noisy neighbour".

Using pqos,  one can monitor and see the cache/memory activity:
sudo apt-get install intel-cmt-cat
sudo modprobe msr
sudo pqos -r
TIME 2016-02-04 10:25:06
CORE IPC MISSES LLC[KB] MBL[MB/s] MBR[MB/s]
0 0.59 168259k 9144.0 12195.0 0.0
1 1.33 107k 0.0 3.3 0.0
2 0.20 2k 0.0 0.0 0.0
3 0.70 104k 0.0 2.0 0.0
4 0.86 23k 0.0 0.7 0.0
5 0.38 42k 24.0 1.5 0.0
6 0.12 2k 0.0 0.0 0.0
7 0.24 48k 0.0 3.0 0.0
8 0.61 26k 0.0 1.6 0.0
9 0.37 11k 144.0 0.9 0.0
10 0.48 1k 0.0 0.0 0.0
11 0.45 2k 0.0 0.0 0.0
Now to run a stress-ng stream stressor on the host and see the performance while the noisy neighbour is also running:
stress-ng --stream 4 --stream-l3-size 2M --perf --metrics-brief -t 60
stress-ng: info: [2195] dispatching hogs: 4 stream
stress-ng: info: [2196] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info: [2196] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info: [2196] stress-ng-stream: Using L3 CPU cache size of 2048K
stress-ng: info: [2196] stress-ng-stream: memory rate: 1842.22 MB/sec, 736.89 Mflop/sec (instance 0)
stress-ng: info: [2198] stress-ng-stream: memory rate: 1847.88 MB/sec, 739.15 Mflop/sec (instance 2)
stress-ng: info: [2199] stress-ng-stream: memory rate: 1833.89 MB/sec, 733.56 Mflop/sec (instance 3)
stress-ng: info: [2197] stress-ng-stream: memory rate: 1847.16 MB/sec, 738.86 Mflop/sec (instance 1)
stress-ng: info: [2195] successful run completed in 60.01s (1 min, 0.01 secs)
stress-ng: info: [2195] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [2195] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [2195] stream 22101 60.01 239.93 0.04 368.31 92.10
stress-ng: info: [2195] stream:
stress-ng: info: [2195] 547,520,600,744 CPU Cycles 9.12 B/sec
stress-ng: info: [2195] 69,959,954,760 Instructions 1.17 B/sec (0.128 instr. per cycle)
stress-ng: info: [2195] 11,066,905,620 Cache References 0.18 B/sec
stress-ng: info: [2195] 11,065,068,064 Cache Misses 0.18 B/sec (99.98%)
stress-ng: info: [2195] 8,759,154,716 Branch Instructions 0.15 B/sec
stress-ng: info: [2195] 2,205,904 Branch Misses 36.76 K/sec ( 0.03%)
stress-ng: info: [2195] 23,856,890,232 Bus Cycles 0.40 B/sec
stress-ng: info: [2195] 477,143,689,444 Total Cycles 7.95 B/sec
stress-ng: info: [2195] 36 Page Faults Minor 0.60 sec
stress-ng: info: [2195] 0 Page Faults Major 0.00 sec
stress-ng: info: [2195] 96 Context Switches 1.60 sec
stress-ng: info: [2195] 0 CPU Migrations 0.00 sec
stress-ng: info: [2195] 0 Alignment Faults 0.00 sec
.. so about 1842 MB/sec memory rate and 736 Mflop/sec per CPU across 4 CPUs.  And pqos shows the cache/memory actitivity as:
sudo pqos -r
TIME 2016-02-04 10:35:27
CORE IPC MISSES LLC[KB] MBL[MB/s] MBR[MB/s]
0 0.14 43060k 1104.0 2487.9 0.0
1 0.12 3981523k 2616.0 2893.8 0.0
2 0.26 320k 48.0 18.0 0.0
3 0.12 3980489k 1800.0 2572.2 0.0
4 0.12 3979094k 1728.0 2870.3 0.0
5 0.12 3970996k 2112.0 2734.5 0.0
6 0.04 20k 0.0 0.3 0.0
7 0.04 29k 0.0 1.9 0.0
8 0.09 143k 0.0 5.9 0.0
9 0.15 0k 0.0 0.0 0.0
10 0.07 2k 0.0 0.0 0.0
11 0.13 0k 0.0 0.0 0.0
Using pqos again, we can find out how much LLC cache the processor has:
sudo pqos -v
NOTE: Mixed use of MSR and kernel interfaces to manage
CAT or CMT & MBM may lead to unexpected behavior.
INFO: Monitoring capability detected
INFO: CPUID.0x7.0: CAT supported
INFO: CAT details: CDP support=0, CDP on=0, #COS=16, #ways=12, ways contention bit-mask 0xc00
INFO: LLC cache size 9437184 bytes, 12 ways
INFO: LLC cache way size 786432 bytes
INFO: L3CA capability detected
INFO: Detected PID API (perf) support for LLC Occupancy
INFO: Detected PID API (perf) support for Instructions/Cycle
INFO: Detected PID API (perf) support for LLC Misses
ERROR: IPC and/or LLC miss performance counters already in use!
Use -r option to start monitoring anyway.
Monitoring start error on core(s) 5, status 6
So this CPU has 12 cache "ways", each of 786432 bytes (768K).  One or more  "Class of Service" (COS)  types can be defined that can use one or more of these ways.  One uses a bitmap with each bit representing a way to indicate how the ways are to be used by a COS.  For example, to use all the 12 ways on my example machine, the bit map is 0xfff  (111111111111).   A way can be exclusively mapped to a COS or shared, or not used at all.   Note that the ways in the bitmap must be contiguously allocated, so a mask such as 0xf3f (111100111111) is invalid and cannot be used.

In my experiment, I want to create 2 COS types, the first COS will have just 1 cache way assigned to it and CPU 0 will be bound to this COS as well as pinning the VM instance to CPU 0  The second COS will have the other 11 cache ways assigned to it, and all the other CPUs can use this COS.

So, create COS #1 with just 1 way of cache, and bind CPU 0 to this COS, and pin the VM to CPU 0:
sudo pqos -e llc:1=0x0001
sudo pqos -a llc:1=0
sudo taskset -apc 0 $(pidof qemu-system-x86_64)
And create COS #2, with 11 ways of cache and bind CPUs 1-11 to this COS:
sudo pqos -e "llc:2=0x0ffe"
sudo pqos -a "llc:2=1-11"
And let's see the new configuration:
sudo pqos  -s
NOTE: Mixed use of MSR and kernel interfaces to manage
CAT or CMT & MBM may lead to unexpected behavior.
L3CA COS definitions for Socket 0:
L3CA COS0 => MASK 0xfff
L3CA COS1 => MASK 0x1
L3CA COS2 => MASK 0xffe
L3CA COS3 => MASK 0xfff
L3CA COS4 => MASK 0xfff
L3CA COS5 => MASK 0xfff
L3CA COS6 => MASK 0xfff
L3CA COS7 => MASK 0xfff
L3CA COS8 => MASK 0xfff
L3CA COS9 => MASK 0xfff
L3CA COS10 => MASK 0xfff
L3CA COS11 => MASK 0xfff
L3CA COS12 => MASK 0xfff
L3CA COS13 => MASK 0xfff
L3CA COS14 => MASK 0xfff
L3CA COS15 => MASK 0xfff
Core information for socket 0:
Core 0 => COS1, RMID0
Core 1 => COS2, RMID0
Core 2 => COS2, RMID0
Core 3 => COS2, RMID0
Core 4 => COS2, RMID0
Core 5 => COS2, RMID0
Core 6 => COS2, RMID0
Core 7 => COS2, RMID0
Core 8 => COS2, RMID0
Core 9 => COS2, RMID0
Core 10 => COS2, RMID0
Core 11 => COS2, RMID0
..showing Core 0 bound to COS1, and Cores 1-11 bound to COS2, with COS1 with 1 cache way and COS2 with the remaining 11 cache ways.
Now re-run the stream stressor and see if the VM has less impact on the LL3 cache:
stress-ng --stream 4 --stream-l3-size 1M --perf --metrics-brief -t 60
stress-ng: info: [2232] dispatching hogs: 4 stream
stress-ng: info: [2233] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info: [2233] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info: [2233] stress-ng-stream: Using L3 CPU cache size of 1024K
stress-ng: info: [2235] stress-ng-stream: memory rate: 2616.90 MB/sec, 1046.76 Mflop/sec (instance 2)
stress-ng: info: [2233] stress-ng-stream: memory rate: 2562.97 MB/sec, 1025.19 Mflop/sec (instance 0)
stress-ng: info: [2234] stress-ng-stream: memory rate: 2541.10 MB/sec, 1016.44 Mflop/sec (instance 1)
stress-ng: info: [2236] stress-ng-stream: memory rate: 2652.02 MB/sec, 1060.81 Mflop/sec (instance 3)
stress-ng: info: [2232] successful run completed in 60.00s (1 min, 0.00 secs)
stress-ng: info: [2232] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [2232] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [2232] stream 62223 60.00 239.97 0.00 1037.01 259.29
stress-ng: info: [2232] stream:
stress-ng: info: [2232] 547,364,185,528 CPU Cycles 9.12 B/sec
stress-ng: info: [2232] 97,037,047,444 Instructions 1.62 B/sec (0.177 instr. per cycle)
stress-ng: info: [2232] 14,396,274,512 Cache References 0.24 B/sec
stress-ng: info: [2232] 14,390,808,440 Cache Misses 0.24 B/sec (99.96%)
stress-ng: info: [2232] 12,144,372,800 Branch Instructions 0.20 B/sec
stress-ng: info: [2232] 1,732,264 Branch Misses 28.87 K/sec ( 0.01%)
stress-ng: info: [2232] 23,856,388,872 Bus Cycles 0.40 B/sec
stress-ng: info: [2232] 477,136,188,248 Total Cycles 7.95 B/sec
stress-ng: info: [2232] 44 Page Faults Minor 0.73 sec
stress-ng: info: [2232] 0 Page Faults Major 0.00 sec
stress-ng: info: [2232] 72 Context Switches 1.20 sec
stress-ng: info: [2232] 0 CPU Migrations 0.00 sec
stress-ng: info: [2232] 0 Alignment Faults 0.00 sec
Now with the noisy neighbour VM constrained to use just 1 way of LL3 cache, the stream stressor on the host now can achieve about 2592 MB/sec and about 1030 Mflop/sec per CPU across 4 CPUs.

This is a relatively simple example.  With the ability to monitor cache and memory bandwidth activity with one can carefully tune a system to make best use of the limited LL3 cache resource and maximise throughput where needed.

There are many applications where Intel CMT/CAT can be useful, for example fine tuning containers or VM instances, or pinning user space networking buffers to cache ways in DPDK for improved throughput.

Read more
Colin Ian King

Pagemon improvements

Over the past month I've been finding the odd moments [1] to add some small improvements and fix a few bugs to pagemon (a tool to monitor process memory).  The original code went from a sketchy proof of concept prototype to a somewhat more usable tool in a few weeks, so my main concern recently was to clean up the code and make it more efficient.

With the use of tools such as valgrind's cachegrind and perf I was able to work on some of the code hot-spots [2] and reduce it from ~50-60% CPU down to 5-9% CPU utilisation on my laptop, so it's definitely more machine friendly now.  In addition I've added the following small features:

  • Now one can specify the name of a process to monitor as well as the PID.  This also allows one to run pagemon on itself(!), which is a bit meta.
  • Perf events showing Page Faults and Kernel Page Allocates and Frees, toggled on/off with the 'p' key.
  • Improved and snappier clean up and exit when a monitored process exits.
  • Far more efficient page map reading and rendering.
  • Out of Memory (OOM) scores added to VM statistics window.
  • Process activity (busy, sleeping, etc) to VM statistics window.
  • Zoom mode min/max with '[' (min) and ']' (max) keys.
  • Close pop-up windows with key 'c'.
  • Improved handling of rapid map expansion and shrinking.
  • Jump to end of map using 'End' key.
  • Improve the man page.
I've tried to keep the tool small and focused and I don't want feature bloat to make it unwieldy and overly complexed.  "Do one job, and do it well" is the philosophy behind pagemon. At just 1500 lines of C, it is as complex as I want it to be for now.

Version 0.01.08 should be hitting the Ubuntu 16.04 Xenial Xerus archive in the next 24 hours or so.  I have also the lastest version in my PPA (ppa:colin-king/pagemon) built for Trusty, Vivid, Wily and Xenial.


Pagemon is useful for spotting unexpected memory activity and it is just interesting watching the behaviour memory hungry processes such as web-browsers and Virtual Machines.

Notes:
[1] Mainly very late at night when I can't sleep (but that's another story...).  The git log says it all.
[2] Reading in /proc/$PID/maps and efficiently reading per page data from /proc/$PID/pagemap

Read more
Colin Ian King

Forcing out bugs with stress-ng

stress-ng logo
Over the past few months I've been adding several new stress tests and a lot more stressor options to stress-ng for Ubuntu 16.04 Xenial Xerus.  I try to track new system calls and features landing in the kernel and where appropriate add a stress test to try and force out bugs.

Stress-ng has found various kernel bugs, such as CVE-2015-1333 and LP:#1526811 as well as bugs in user space (for example, daemons crashing) when memory pressure is very high.  Simple abusive tricks, such as aggressively trying to allocate every free page in memory are useful in finding drivers that don't necessary check for memory allocation failures.  For example, today I was caught out when a USB ethernet dongle driver didn't check for a null pointer due to an allocation failure and stress-ng ended up triggering a kernel oops (fortunately, this bug was fixed in a recent kernel).

The underlying philosophy for stress-ng is "use and abuse standard Linux interfaces and see how far we can push them to destruction".  I'm pretty sure there are plenty of creative folk out there who can dream up dastardly ways to make stress-ng even more stressy, so contributions are always warmly accepted!  I have a mirrorred copy of the git repository on github to make it easy for developers to get their hands on the code.

We've been using stress-ng on ARM based SoC kernels to force out bugs and this has been useful in finding areas where non-swap based systems break. You really don't want your kernel oopsing or processes segfaulting when a IoT device has run low on memory.

My original intent for stress-ng was just to make a system run hot and force thermal overruns. However, I soon discovered it is useful to force kernel bugs out by attempting to (pathologically) thrash most of the system calls.  I've also added perf stats to stress-ng to track performance of standard stress scenarios over kernel versions to get an early warning of any potential performance regressions.  So stress-ng is a bit of a mixed bag of stress tests and performance measuring goodness.

When I get some free time I hope to run stress-ng against a GCOV instrumented kernel at see how much test coverage I get on a kernel. I suspect there are a lot of core kernel functionality still not being touched by stress-ng.

I've also tried to make stress-ng portable, so it can build fine on GNU/Hurd and Debian kFreeBSD (with Linux specific tests not built-in of course). It also contains some architecture specific features, such as handling the data and instruction cache as well as the x86 rdrand instruction and cache line locking. If there are any ARM specific features than can be stressed I'd like to know and perhaps implement stressors for them.

Anyhow, I believe stress-ng is almost feature complete for Ubuntu Xenial, however, I expect it to grow in features over time since there is always new functionality landing in the Linux kernel that needs to be thrashed tested.

Read more
Colin Ian King

While looking at some code in the Linux Kernel this morning I spotted a few FIXME comments and that got me wondering just how many there are in the source code.  After a quick grep I found nearly 4200 in v4.4.0-rc8 and that got me thinking about other similar comment tags such as TODO that are in the source and how this has been changing over time.


So the trends are certainly upwards, but then again, so is the size of the kernel source:

Note: Data gathered using sloccount on the lines of C in the kernel source.

Using the sloccount data I then calculated the number of FIXME and TODOs per 1000 lines of code to see what the underlying trend is:

So FIXMEs are actually dropping in relative terms to the size of the kernel where as TODOs are increasing.

Of course, these statistics are bogus because it is dependent on kernel developers adding and removing FIXMEs and TODOs in a consistent manner, however, it is interesting to see how many comments exist and hence how much work has been tagged in comments as work to be done later. I wonder how this compares to other large open source projects.

Read more
Colin Ian King

While developing stress-ng I wanted to be able to see if the various memory stressors were touching memory in the way I had anticipated.  While digging around in the Linux documentation I discovered the very useful soft/dirty bit on Page Table Entries (PTEs) that get set when a page is written to.  The mechanism to check for the soft/dirty bit is described in Documentation/vm/soft-dirty.txt; one needs to:

  1. Clear the soft-dirty bits on the PTEs on a chosen process by writing "4" to /proc/$PID/clear_refs
  2. Wait a while for some page activity to occur
  3. Read the soft-dirty bits on the PTEs to see which pages got written to.
Not too tricky, so how about using this neat feature? While on rather long and dull flight over the Atlantic back in August I hacked up a very crude ncurses based tool to continually check the PTEs of a given process and display the soft/dirty activity in real time.  During this Christmas break I picked this code up and re-worked into a more polished tool.  One can scroll up/down the memory maps and also select a page and view the contents changing in real time.  The tool identifies the type of memory mapping a page belongs to, so one can easily scan through memory looking at pages of memory belonging data, code, heap, stack, anonymous mappings or even swapped out pages.

Running it on X, compiz, firefox or thunderbird is quite instructive as one can see a lot of page activity on the large heap allocations.  The ability to see pages getting swapped out when memory pressure is high is also rather useful.

Page view of Xorg
Memory view of stack
The code is still early development quality (so expect some buglets!) and I need to work on optimising it in a lot of places, but for now, it works well enough to be a fairly interesting tool. I've currently got a package built for Ubuntu Xenial in ppa:colin-king/pagemon and the source can be cloned from http://kernel.ubuntu.com/git/cking/pagemon.git/

So, to install on Xenial, currently one needs to do:

sudo add-apt-repository ppa:colin-king/pagemon
sudo apt-get update
sudo apt-get install pagemon

I may be adding a few more features in the next few weeks, and then getting the tool into Ubuntu and Debian.

and as an example, running it on Xorg, it is invoked as:

sudo pagemon -p $(pidof Xorg)

Unfortunately sudo is required to allow one to dig so intrusively into a running process. For more details on how to use pagemon consult the pagemon man page, or press "h" or "?" while running pagemon.

Read more
Colin Ian King

The other day I needed to incorporate a large blob of binary data in a C program. One simple way is to use xxd, for example, on the binary data in file "blob", one can do:

xxd --include blob 

unsigned char blob[] = {
0xc8, 0xe5, 0x54, 0xee, 0x8f, 0xd7, 0x9f, 0x18, 0x9a, 0x63, 0x87, 0xbb,
0x12, 0xe4, 0x04, 0x0f, 0xa7, 0xb6, 0x16, 0xd0, 0x70, 0x06, 0xbc, 0x57,
0x4b, 0xaf, 0xae, 0xa2, 0xf2, 0x6b, 0xf4, 0xc6, 0xb1, 0xaa, 0x93, 0xf2,
0x12, 0x39, 0x19, 0xee, 0x7c, 0x59, 0x03, 0x81, 0xae, 0xd3, 0x28, 0x89,
0x05, 0x7c, 0x4e, 0x8b, 0xe5, 0x98, 0x35, 0xe8, 0xab, 0x2c, 0x7b, 0xd7,
0xf9, 0x2e, 0xba, 0x01, 0xd4, 0xd9, 0x2e, 0x86, 0xb8, 0xef, 0x41, 0xf8,
0x8e, 0x10, 0x36, 0x46, 0x82, 0xc4, 0x38, 0x17, 0x2e, 0x1c, 0xc9, 0x1f,
0x3d, 0x1c, 0x51, 0x0b, 0xc9, 0x5f, 0xa7, 0xa4, 0xdc, 0x95, 0x35, 0xaa,
0xdb, 0x51, 0xf6, 0x75, 0x52, 0xc3, 0x4e, 0x92, 0x27, 0x01, 0x69, 0x4c,
0xc1, 0xf0, 0x70, 0x32, 0xf2, 0xb1, 0x87, 0x69, 0xb4, 0xf3, 0x7f, 0x3b,
0x53, 0xfd, 0xc9, 0xd7, 0x8b, 0xc3, 0x08, 0x8f
};
unsigned int blob_len = 128;

..and redirecting the output from xxd into a C source and compiling this simple and easy to do.

However, for large binary blobs, the C source can be huge, so an alternative way is to use the linker ld as follows:

ld -s -r -b binary -o blob.o blob  

...and this generates the blob.o object code. To reference the data in a program one needs to determine the symbol names of the start, end and perhaps the length too. One can use objdump to find this as follows:

 objdump -t blob.o  
blob.o: file format elf64-x86-64
SYMBOL TABLE:
0000000000000000 l d .data 0000000000000000 .data
0000000000000080 g .data 0000000000000000 _binary_blob_end
0000000000000000 g .data 0000000000000000 _binary_blob_start
0000000000000080 g *ABS* 0000000000000000 _binary_blob_size

To access the data in C, use something like the following:

 cat test.c  

#include <stdio.h>
int main(void)
{
extern void *_binary_blob_start, *_binary_blob_end;
void *start = &_binary_blob_start,
*end = &_binary_blob_end;
printf("Data: %p..%p (%zu bytes)\n",
start, end, end - start);
return 0;
}

...and link and run as follows:

 gcc test.c blob.o -o test  
./test
Data: 0x601038..0x6010b8 (128 bytes)

So for large blobs, I personally favour using ld to do the hard work for me since I don't need another tool (such as xxd) and it removes the need to convert a blob into C and then compile this.

Read more
Colin Ian King

Firmware Test Suite, 15.12.00

The Canonical Hardware Enablement Team and myself are continuing the work to enhance the Firmware Test Suite (fwts) on a regular monthly cadence.  The latest changes in FWTS 15.12.00 includes the following new features and changes:

  • ACPI: ASPT (System Performance Tuning Table)
  • Update ACPICA to version 20151124 
  • Boot path sync with UEFI specification 2.5 adding:
    • SD device path 
    • Bluetooth device path
    • Wireless device path
    • Ramdisk device path
  • Mixed tests and test category options, e.g. fwts --uefitests klog cpufreq will run all the UEFI tests as well as klog and cpufreq tests
  • A new --log-level option that allows one to log test that fail at specified a level or higher, e.g. fwts --log-level high will just show high and critical test failures.
  • The apcidump table dump pseudo-test is now aligned with the ACPICA table dumping (disassembly) engine.
  • Various bug fixes.
It is also worth mentioning that the UEFI Board of Directors recommends FWTS as the ACPI v5.1 Self-Certification Test (SCT). This is exciting news and we welcome this decision for FWTS to be recognised in this way.

We are also very grateful for the community contributions to FWTS, this buy-in from community is appreciated and makes FWTS a better tool to support different architectures and systems.

As ever, with new releases, please consult the release notes.

Read more
Colin Ian King

Another seasonal obfuscated C program

During an idle moment while on vacation I was reading the paper "Reliable Two-Dimensional Graphing Methods for Mathematical Formulae with Two Free Variables" by Jeff Tupper and I stumbled upon rather amusing inequality at the end of section 12.   In tribute to this most excellent graphing formula, I felt inspired to use the same concept in my Christmas 2015 obfuscated C offering.

tupper.c

I cheated a little by also using a Makefile, but I hope this also adds to the magic of the resulting code.  To make the program more fun I thought I'd use a lot of confusion logic operator names in the code and mix in some incorrect Roman numeral constants too.  I could have obfuscated the code more and made it smaller, but life is too short. I will leave that as an exercise to the reader.

The source is available in my Christmas Obfuscated C git repository if you want to try it out:

 git clone https://github.com/ColinIanKing/christmas-obfuscated-C.git  
cd christmas-obfuscated-C/2015
make
./tupper | less

Enjoy!

Read more
Colin Ian King

Using PR_SET_PDEATHSIG to reap child processes

The prctl() system call provides a rather useful PR_SET_PDEATHSIG option to allow a signal to be sent to child processes when the parent unexpectedly dies. A quick and dirty mechanism is trigger the SIGHUP or SIGKILL signal to kill the child immediately, or perhaps more elegantly to invoke a resource tidy up before exiting.

In the trivial example below, we use the SIGUSR1 signal to inform the child that the parent has died. I know printf() should not be used in a signal handler, it just makes the example simpler.

 #include <stdlib.h>                                 
#include <unistd.h>
#include <signal.h>
#include <sys/prctl.h>
#include <err.h>

void sigusr1_handler(int dummy)
{
printf("Parent died, child now exiting\n");
exit(0);
}

int main()
{
pid_t pid;

pid = fork();
if (pid < 0)
err(1, "fork failed");
if (pid == 0) {
/* Child */
if (signal(SIGUSR1, sigusr1_handler) == SIG_ERR)
err(1, "signal failed");
if (prctl(PR_SET_PDEATHSIG, SIGUSR1) < 0)
err(1, "prctl failed");

for (;;)
sleep(60);
}
if (pid > 0) {
/* Parent */
sleep(5);
printf("Parent exiting...\n");
}

return 0;
}

..the child process sits in an infinite loop, performing 60 second sleeps.  The parent sleeps for 5 seconds and then exits.  The child is then sent a SIGUSR1 signal and the handler exits.  In practice the signal handler would be used to trigger a more sophisticated clean up of resources if required.

Anyhow, this is a useful Linux feature that seems to be overlooked.

Read more
Colin Ian King

The Intel Platform Shared Resource Monitoring features were introduced in the Intel Xeon E5v3 processor family. These new features provide a mechanism to measure platform shared resources, such as L3 cache occupancy via Cache Monitoring Technology (CMT) and memory bandwidth utilisation via Memory Bandwidth Monitoring (MBM).

Intel have written a Platform Quality of Service Tool (pqos) to use these monitoring features and I've packaged this up for Ubuntu 16.04 Xenial Xerus.

To install, use:

sudo apt-get install intel-cmt-cat

The tool requires access to the Intel MSRs, so one has to also install the msr module if it is not already loaded:

sudo modprobe msr

To see the Last Level Cache (llc) utilisation on a system, listing the most used first, use:

sudo pqos -T

pqos running on a 48 thread Xeon based server

The -p option allows one to specify specific monitoring events for specific process IDs. Event types can be Last Level Cache (llc), Local Memory Bandwidth (mbl) and Remote Memory Bandwidth (mbr).  For example, on a Xeon E5-2680 I have just Last Level Cache monitoring capability, so lets view the llc for stress-ng while running some VM stressor tests:

sudo pqos -T -p llc:$(pidof stress-ng | tr ' ' ',')

pqos showing equally shared cache between two stressor processes

Cache and Memory Bandwidth monitoring is especially useful to examine the impact of memory/cache hogging processes (such as VM instances).  pqos allows one to identify these processes simply and effectively.

Future Intel Xeon processors will provide capabilities to configure cache resources to specific classes of service using Intel Cache Allocation Technology (CAT).  The pqos tool allows one to modify the CAT settings, however, not having access to a CPU with these capabilities I was unable to experiment with this feature.  I refer you to the pqos manual for more details on this useful feature.  The beauty of CAT is that is allows one to tweak and fine tune the cache allocation for specific demanding use cases.  Given that the cache is a shared resource that can be impacted by badly behaving processes, the ability to tune the cache behaviour is potentially a big performance win.

For more details of these features, see the Intel 64 And IA-32 Architecture Software Development manual, section 17.15 "Platform Share Resource Monitoring: Cache Monitoring Technology" and 17.16 "Platform Shared Resource Control: Cache Allocation Technology".

Read more
Colin Ian King

Firmware Test Suite in active development

Another month passes and another release of the Firmware Test Suite is being prepared.  The tool has been growing in functionality (and size!) over time, so I thought I would look at some statistics to see any trends.

There has been a steady growth of the number of authors sending patches to the Firmware Test Suite.  Community contributions to a project is a sign that we have buy-in from different parties, so I'm pleased to see contributions from Intel, Linaro and Redhat.   Patches are always welcome, send them to fwts-devel@ubuntu.com for review and inclusion into the project.

The number of commits is one metric to see if the project is growing healthily. We're adding about 35 patches a month, about 3/4 of which is added functionality, the rest are fixes and general code maintenance.

One more meaningless but interesting metric is code size. I used sloccount to count the lines of C in the project.  We're seeing ~2200 lines of code being added per month, mainly through added test functionality.
Kudos to the Canonical Hardware Enablement firmware folk for wrangling the patches and preparing each FWTS release.

Read more
Colin Ian King

A useful feature on modern x86 CPUs is the Running Average Power Limit (RAPL) that allows one to monitor System on Chip (SoC) power consumption.  Combine this data with the ability to accurately measure CPU cycles and instructions via perf and we can get some way to get a rough estimate energy consumed to perform a single operation on the CPU.

power-calibrate is a simple tool that  hacked up to perform some synthetic loading of the processor, gather the RAPL and CPU stats and using simple linear regression to compute some power related metrics.

In the example below, I run power-calibrate on an Intel  i5-3210M (2 Cores, 4 threads) with each test run taking 10 seconds (-r 10),  using the RAPL interface to measure power and gathering 11 samples on CPU threads 1..4:

power-calibrate -r 10 -R  -s 11
CPU load User Sys Idle Run Ctxt/s IRQ/s Ops/s Cycl/s Inst/s Watts
0% x 1 0.1 0.1 99.8 1.0 181.6 61.1 0.0 2.5K 380.2 2.485
0% x 2 0.0 1.0 98.9 1.2 161.8 63.8 0.0 5.7K 0.8K 2.366
0% x 3 0.1 1.3 98.5 1.1 204.2 75.2 0.0 7.6K 1.9K 2.518
0% x 4 0.1 0.1 99.9 1.0 124.7 44.9 0.0 11.4K 2.7K 2.167
10% x 1 2.4 0.2 97.4 1.5 203.8 104.9 21.3M 123.1M 297.8M 2.636
10% x 2 5.1 0.0 94.9 1.3 185.0 137.1 42.0M 243.0M 0.6B 2.754
10% x 3 7.5 0.2 92.3 1.2 275.3 190.3 58.1M 386.9M 0.8B 3.058
10% x 4 10.0 0.1 89.9 1.9 213.5 206.1 64.5M 486.1M 0.9B 2.826
20% x 1 5.0 0.1 94.9 1.0 288.8 170.0 69.6M 403.0M 1.0B 3.283
20% x 2 10.0 0.1 89.9 1.6 310.2 248.7 96.4M 0.8B 1.3B 3.248
20% x 3 14.6 0.4 85.0 1.7 640.8 450.4 238.9M 1.7B 3.3B 5.234
20% x 4 20.0 0.2 79.8 2.1 633.4 514.6 270.5M 2.1B 3.8B 4.736
30% x 1 7.5 0.2 92.3 1.4 444.3 278.7 149.9M 0.9B 2.1B 4.631
30% x 2 14.8 1.2 84.0 1.2 541.5 418.1 200.4M 1.7B 2.8B 4.617
30% x 3 22.6 1.5 75.9 2.2 960.9 694.3 365.8M 2.6B 5.1B 7.080
30% x 4 30.0 0.2 69.8 2.4 959.2 774.8 421.1M 3.4B 5.9B 5.940
40% x 1 9.7 0.3 90.0 1.7 551.6 356.8 201.6M 1.2B 2.8B 5.498
40% x 2 19.9 0.3 79.8 1.4 668.0 539.4 288.0M 2.4B 4.0B 5.604
40% x 3 29.8 0.5 69.7 1.8 1124.5 851.8 481.4M 3.5B 6.7B 7.918
40% x 4 40.3 0.5 59.2 2.3 1186.4 1006.7 0.6B 4.6B 7.7B 6.982
50% x 1 12.1 0.4 87.4 1.7 536.4 378.6 193.1M 1.1B 2.7B 4.793
50% x 2 24.4 0.4 75.2 2.2 816.2 668.2 362.6M 3.0B 5.1B 6.493
50% x 3 35.8 0.5 63.7 3.1 1300.2 1004.6 0.6B 4.2B 8.2B 8.800
50% x 4 49.4 0.7 49.9 3.8 1455.2 1240.0 0.7B 5.7B 9.6B 8.130
60% x 1 14.5 0.4 85.1 1.8 735.0 502.7 295.7M 1.7B 4.1B 6.927
60% x 2 29.4 1.3 69.4 2.0 917.5 759.4 397.2M 3.3B 5.6B 6.791
60% x 3 44.1 1.7 54.2 3.1 1615.4 1243.6 0.7B 5.1B 9.9B 10.056
60% x 4 58.5 0.7 40.8 4.0 1728.1 1456.6 0.8B 6.8B 11.5B 9.226
70% x 1 16.8 0.3 82.9 1.9 841.8 579.5 349.3M 2.0B 4.9B 7.856
70% x 2 34.1 0.8 65.0 2.8 966.0 845.2 439.4M 3.7B 6.2B 6.800
70% x 3 49.7 0.5 49.8 3.5 1834.5 1401.2 0.8B 5.9B 11.8B 11.113
70% x 4 68.1 0.6 31.4 4.7 1771.3 1572.3 0.8B 7.0B 11.8B 8.809
80% x 1 18.9 0.4 80.7 1.9 871.9 613.0 357.1M 2.1B 5.0B 7.276
80% x 2 38.6 0.3 61.0 2.8 1268.6 1029.0 0.6B 4.8B 8.2B 9.253
80% x 3 58.8 0.3 40.8 3.5 2061.7 1623.3 1.0B 6.8B 13.6B 11.967
80% x 4 78.6 0.5 20.9 4.0 2356.3 1983.7 1.1B 9.0B 16.0B 12.047
90% x 1 21.8 0.3 78.0 2.0 1054.5 737.9 459.3M 2.6B 6.4B 9.613
90% x 2 44.2 1.2 54.7 2.7 1439.5 1174.7 0.7B 5.4B 9.2B 10.001
90% x 3 66.2 1.4 32.4 3.9 2326.2 1822.3 1.1B 7.6B 15.0B 12.579
90% x 4 88.5 0.2 11.4 4.8 2627.8 2219.1 1.3B 10.2B 17.8B 12.832
100% x 1 25.1 0.0 74.8 2.0 135.8 314.0 0.5B 3.1B 7.5B 10.278
100% x 2 50.0 0.0 50.0 3.0 91.9 560.4 0.7B 6.2B 10.4B 10.470
100% x 3 75.1 0.1 24.8 4.0 120.2 824.1 1.2B 8.7B 16.8B 13.028
100% x 4 100.0 0.0 0.0 5.0 76.8 1054.8 1.4B 11.6B 19.5B 13.156

For 4 CPUs (of a 4 CPU system):
Power (Watts) = (% CPU load * 1.176217e-01) + 3.461561
1% CPU load is about 117.62 mW
Coefficient of determination R^2 = 0.809961 (good)

Energy (Watt-seconds) = (bogo op * 8.465141e-09) + 3.201355
1 bogo op is about 8.47 nWs
Coefficient of determination R^2 = 0.911274 (strong)

Energy (Watt-seconds) = (CPU cycle * 1.026249e-09) + 3.542463
1 CPU cycle is about 1.03 nWs
Coefficient of determination R^2 = 0.841894 (good)

Energy (Watt-seconds) = (CPU instruction * 6.044204e-10) + 3.201433
1 CPU instruction is about 0.60 nWs
Coefficient of determination R^2 = 0.911272 (strong)

The results at the end are estimates based on the gathered samples. The samples are compared to the computed linear regression coefficients using the coefficient of determination (R^2);  a value of 1 is a perfect linear fit, less than 1 a poorer fit.

For more accurate results, increase the run time (-r option) and also increase the number of samples (-s option).

Power-calibrate is available in Ubuntu Wily 15.10.  It is just an academic toy for getting some power estimates and may be useful to compare compute vs power metrics across different x86 CPUs.  I've not been able to verify how accurate it really is, so I am interested to see how this works across a range of systems.

Read more
Colin Ian King

NumaTop: A NUMA system monitoring tool

NumaTop is a useful tool developed by Intel for monitoring runtime memory locality and analysis of processes on Non-Uniform Memory Access (NUMA) systems.  NumaTop can identify potential NUMA related performance bottlenecks and hence help one to re-balance memory/CPU allocations to maximise the potential of a NUMA system.

Initial "Top" like process view

One can select specific processes and drill down and characteristics such as memory latencies or call chains to see where code is hot.

Observing a specific process..
..and observing memory latencies
Observing per Node CPU and memory statistics
The tool uses perf to collect deeper system statistics and hence needs to be run with root privileges will only run on NUMA systems. I've recently packaged NumaTop and it is now available in Ubuntu Wily 15.10 and the source is available on github.

Read more
Colin Ian King

light-weight process stats with cpustat

A while ago I was working on identifying busy processes on small Ubuntu devices and required a tool that could look at per process stats (from /proc/$pid/stat) in a fast and efficient way with minimal overhead.   There are plenty of tools such as "top" and "atop" that can show per-process CPU utilisation stats, but most of these aren't useful on really slow low-power devices as they consume several tens of megacycles collecting and displaying the results.

I developed cpustat to be compact and efficient, as well as provide enough stats to allow me to easily identify CPU sucking processes.   To optimise the code, I used tools such as perf to identify code hotspots as well as valgrind's cachegrind to identify poorly designed cache inefficient data structures.

The majority of the savings were in the parsing of data from /proc - originally I used simple fscanf() style parsing; over several optimisation rounds I ended up with hand-crafted numeric and string scanning parsing that saved several hundred thousand cycles per iteration.

I also made some optimisations by tweaking the hash table sizes to match the input data more appropriately.  Also, by careful re-use of heap allocations, I was able to reduce malloc()/free() calls and save some heap management overhead.

Some very frequent string look-ups were replaced with hash lookups and frequently accessed data was duplicated rather than referenced indirectly to keep data local to reduce cache stalls and hence speed up data comparison lookup time.

The source has been statically checked by CoverityScan, cppcheck and also clang's scan-build to check for bugs introduced in the optimisation steps.

Example of cpustat
cpustat is now available in Ubuntu 15.10 Wily Werewolf.   Visit the cpustat project page for more details.

Read more
Colin Ian King

Tweaking the thermald configuration file

The Intel Thermal deamon (aka thermald) actively monitors thermal sensors and will modify cooling controls to try to keep the hardware cool.   By default, thermald will run in a "zero-configuration" mode and attempt to use the available CPU Digital Thermal Sensor(s) (DTS) to sense the temperature and use the P-state driver, Running Average Power Limit (RAPL), PowerClamp and cpufreq to control cooling.

Some systems may not work well in the default mode, perhaps the machine just runs too hot and one would like to tweak the settings to kick in passive or active cooling at a lower temperature than the default configuration. Thermald has a configuration file /etc/thermald/thermal-conf.xml that allows fine tuning of thermald. Essentially one declares the thermal sensors on the machine and a set of thermal zone controls that read these thermal sensors and inform thermald the policy to control cooling when specific temperature thresholds are crossed.

For an example, I've picked on an old Acer Aspire One (AMD C-60). Let's see the sensors for this machine:

find /sys/class/hwmon/* -exec echo -n "{}: " \; -exec cat {}/name \;
/sys/class/hwmon/hwmon0: radeon
/sys/class/hwmon/hwmon1: k10temp
one can use tools such as sensors (from the lm-sensors package) to get an idea of the high and critical trip points for these:
$ sudo apt-get install lm-sensors
$ sensors
radeon-pci-0008
Adapter: PCI adapter
temp1: +60.0°C (crit = +120.0°C, hyst = +90.0°C)

k10temp-pci-00c3
Adapter: PCI adapter
temp1: +60.5°C (high = +70.0°C)
(crit = +115.0°C, hyst = +107.5°C)

So, in this simple example, I will just use the CPU sensor k10temp (from /sys/class/hwmon/hwmon1) as my thermald CPU temperature sensor. Next, I need to define a policy on what to do when this sensor reaches a specific high temperature threshold. In this example, I want to trigger passive (non-fan) cooling by adjusting the CPU frequency using cpufreq and also the ACPI processor sysfs cooling controls when we reach 85 degrees C. I require thermald to control both cooling methods to run together in parallel with 60% of the influence to come from cpufreq and 40% from the ACPI processor cooling controls. My thermald config file for this is as follows:
 <ThermalConfiguration>  
<Platform>
<Name>Aspire One</Name>
<ProductName>*</ProductName>
<Preference>QUIET</Preference>
<ThermalSensors>
<ThermalSensor>
<Type>CPU_TEMP</Type>
<Path>/sys/class/hwmon/hwmon0/temp1_input</Path>
<AsyncCapable>0</AsyncCapable>
</ThermalSensor>
</ThermalSensors>
<ThermalZones>
<ThermalZone>
<Type>cpu package</Type>
<TripPoints>
<TripPoint>
<SensorType>CPU_TEMP</SensorType>
<Temperature>90000</Temperature>
<type>passive</type>
<ControlType>PARALLEL</ControlType>
<CoolingDevice>
<index>1</index>
<type>cpufreq</type>
<influence>60</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
<CoolingDevice>
<index>2</index>
<type>Processor</type>
<influence>40</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
</ThermalZones>
</Platform>
</ThermalConfiguration>
One can observe this working by starting thermald in verbose debug mode:
$ sudo thermald --no-daemon --loglevel=debug
it is worth exercising the machine (I use stress-ng --cpu 0) to ramp up the load and temperature to observe how thermald is working. Once one is happy with the results, one can then start thermald using:
$ sudo systemctl start thermald
More examples can be found in the thermald manual page:
$ man thermal-conf.xml 

Read more
Colin Ian King

static code analysis (revisited)

A while ago I was extolling the virtues of static analysis tools such as cppcheck, smatch and CoverityScan for C and C++ projects.  I've recently added to this armoury the clang analyser scan-build, which has been most helpful in finding even more obscure bugs that the previous three did not catch.

Using scan-build is very simple indeed, install clang and then in your source tree just build your project with scan-build, e.g. for a project built by make, use:

scan-build make
..and at the end of a build one will see a summary message:
scan-build make
scan-build: 366 bugs found.
scan-build: Run 'scan-view /tmp/scan-build-2015-09-08-094505-16657-1' 
to examine bug reports.
scan-build: The analyzer encountered problems on some source files.
scan-build: Preprocessed versions of these sources were deposited in 
'/tmp/scan-build-2015-09-08-094505-16657-1/failures'.
scan-build: Please consider submitting a bug report using these files:
scan-build:   http://clang-analyzer.llvm.org/filing_bugs.html

..and running scan-view will show the issues found.  For an example of the kind of results scan-build can find, I ran it against a systemd build (head commit 4df0514d299e349ce1d0649209155b9e83a23539). 

As one can see, scan-build is a powerful and easy to use open-source static analyser.  I heartily recommend using it on every C and C++ project.

Read more