Canonical Voices

Colin Ian King

The stress-ng logo
The latest release of stress-ng contains a mechanism to measure latencies via a cyclic latency test.  Essentially this is just a loop that cycles around performing high precisions sleeps and measures the (extra overhead) latency taken to perform the sleep compared to expected time.  This loop runs with either one of the Round-Robin (rr) or First-In-First-Out real time scheduling polices.

The cyclic test can be configured to specify the sleep time (in nanoseconds), the scheduling type (rr or fifo),  the scheduling priority (1 to 100) and also the sleep method (explained later).

The first 10,000 latency measurements are used to compute various latency statistics:
  • mean latency (aka the 'average')
  • modal latency (the most 'popular' latency)
  • minimum latency
  • maximum latency
  • standard deviation
  • latency percentiles (25%, 50%, 75%, 90%, 95.40%, 99.0%, 99.5%, 99.9% and 99.99%
  • latency distribution (enabled with the --cyclic-dist option)
The latency percentiles indicate the latency at which a percentage of the samples fall into.  For example, the 99% percentile for the 10,000 samples is the latency at which 9,900 samples are equal to or below.

The latency distribution is shown when the --cyclic-dist option is used; one has to specify the distribution interval in nanoseconds and up to the first 100 values in the distribution are output.

For an idle machine, one can invoke just the cyclic measurements with stress-ng as follows:

 sudo stress-ng --cyclic 1 --cyclic-policy fifo \
--cyclic-prio 100 --cyclic-method --clock_ns \
--cyclic-sleep 20000 --cyclic-dist 1000 -t 5
stress-ng: info: [27594] dispatching hogs: 1 cyclic
stress-ng: info: [27595] stress-ng-cyclic: sched SCHED_FIFO: 20000 ns delay, 10000 samples
stress-ng: info: [27595] stress-ng-cyclic: mean: 5242.86 ns, mode: 4880 ns
stress-ng: info: [27595] stress-ng-cyclic: min: 3050 ns, max: 44818 ns, std.dev. 1142.92
stress-ng: info: [27595] stress-ng-cyclic: latency percentiles:
stress-ng: info: [27595] stress-ng-cyclic: 25.00%: 4881 us
stress-ng: info: [27595] stress-ng-cyclic: 50.00%: 5191 us
stress-ng: info: [27595] stress-ng-cyclic: 75.00%: 5261 us
stress-ng: info: [27595] stress-ng-cyclic: 90.00%: 5368 us
stress-ng: info: [27595] stress-ng-cyclic: 95.40%: 6857 us
stress-ng: info: [27595] stress-ng-cyclic: 99.00%: 8942 us
stress-ng: info: [27595] stress-ng-cyclic: 99.50%: 9821 us
stress-ng: info: [27595] stress-ng-cyclic: 99.90%: 22210 us
stress-ng: info: [27595] stress-ng-cyclic: 99.99%: 36074 us
stress-ng: info: [27595] stress-ng-cyclic: latency distribution (1000 us intervals):
stress-ng: info: [27595] stress-ng-cyclic: latency (us) frequency
stress-ng: info: [27595] stress-ng-cyclic: 0 0
stress-ng: info: [27595] stress-ng-cyclic: 1000 0
stress-ng: info: [27595] stress-ng-cyclic: 2000 0
stress-ng: info: [27595] stress-ng-cyclic: 3000 82
stress-ng: info: [27595] stress-ng-cyclic: 4000 3342
stress-ng: info: [27595] stress-ng-cyclic: 5000 5974
stress-ng: info: [27595] stress-ng-cyclic: 6000 197
stress-ng: info: [27595] stress-ng-cyclic: 7000 209
stress-ng: info: [27595] stress-ng-cyclic: 8000 100
stress-ng: info: [27595] stress-ng-cyclic: 9000 50
stress-ng: info: [27595] stress-ng-cyclic: 10000 10
stress-ng: info: [27595] stress-ng-cyclic: 11000 9
stress-ng: info: [27595] stress-ng-cyclic: 12000 2
stress-ng: info: [27595] stress-ng-cyclic: 13000 2
stress-ng: info: [27595] stress-ng-cyclic: 14000 1
stress-ng: info: [27595] stress-ng-cyclic: 15000 9
stress-ng: info: [27595] stress-ng-cyclic: 16000 1
stress-ng: info: [27595] stress-ng-cyclic: 17000 1
stress-ng: info: [27595] stress-ng-cyclic: 18000 0
stress-ng: info: [27595] stress-ng-cyclic: 19000 0
stress-ng: info: [27595] stress-ng-cyclic: 20000 0
stress-ng: info: [27595] stress-ng-cyclic: 21000 1
stress-ng: info: [27595] stress-ng-cyclic: 22000 1
stress-ng: info: [27595] stress-ng-cyclic: 23000 0
stress-ng: info: [27595] stress-ng-cyclic: 24000 1
stress-ng: info: [27595] stress-ng-cyclic: 25000 2
stress-ng: info: [27595] stress-ng-cyclic: 26000 0
stress-ng: info: [27595] stress-ng-cyclic: 27000 1
stress-ng: info: [27595] stress-ng-cyclic: 28000 1
stress-ng: info: [27595] stress-ng-cyclic: 29000 2
stress-ng: info: [27595] stress-ng-cyclic: 30000 0
stress-ng: info: [27595] stress-ng-cyclic: 31000 0
stress-ng: info: [27595] stress-ng-cyclic: 32000 0
stress-ng: info: [27595] stress-ng-cyclic: 33000 0
stress-ng: info: [27595] stress-ng-cyclic: 34000 0
stress-ng: info: [27595] stress-ng-cyclic: 35000 0
stress-ng: info: [27595] stress-ng-cyclic: 36000 1
stress-ng: info: [27595] stress-ng-cyclic: 37000 0
stress-ng: info: [27595] stress-ng-cyclic: 38000 0
stress-ng: info: [27595] stress-ng-cyclic: 39000 0
stress-ng: info: [27595] stress-ng-cyclic: 40000 0
stress-ng: info: [27595] stress-ng-cyclic: 41000 0
stress-ng: info: [27595] stress-ng-cyclic: 42000 0
stress-ng: info: [27595] stress-ng-cyclic: 43000 0
stress-ng: info: [27595] stress-ng-cyclic: 44000 1
stress-ng: info: [27594] successful run completed in 5.00s


Note that stress-ng needs to be invoked using sudo to enable the Real Time FIFO scheduling for the cyclic measurements.

The above example uses the following options:

  • --cyclic 1
    • starts one instance of the cyclic measurements (1 is always recommended)
  • --cyclic-policy fifo 
    • use the real time First-In-First-Out scheduling for the cyclic measurements
  • --cyclic-prio 100 
    • use the maximum scheduling priority  
  • --cyclic-method clock_ns
    • use the clock_nanoseconds(2) system call to perform the high precision duration sleep
  • --cyclic-sleep 20000 
    • sleep for 20000 nanoseconds per cyclic iteration
  • --cyclic-dist 1000 
    • enable latency distribution statistics with an interval of 1000 nanoseconds between each data point.
  • -t 5
    • run for just 5 seconds
From the run above, we can see that 99.5% of latencies were less than 9821 nanoseconds and most clustered around the 4880 nanosecond model point. The distribution data shows that there is some clustering around the 5000 nanosecond point and the samples tail off with a bit of a long tail.

Now for the interesting part. Since stress-ng is packed with many different stressors we can run these while performing the cyclic measurements, for example, we can tell stress-ng to run *all* the virtual memory related stress tests and see how this affects the latency distribution using the following:

 sudo stress-ng --cyclic 1 --cyclic-policy fifo \  
--cyclic-prio 100 --cyclic-method clock_ns \
--cyclic-sleep 20000 --cyclic-dist 1000 \
--class vm --all 1 -t 60s

..the above invokes all the vm class of stressors to run all at the same time (with just one instance of each stressor) for 60 seconds.

The --cyclic-method specifies the delay used on each of the 10,000 cyclic iterations used.  The default (and recommended method) is clock_ns, using the high precision delay.  The available cyclic delay methods are:
  • clock_ns (use the clock_nanosecond() sleep)
  • posix_ns (use the POSIX nanosecond() sleep)
  • itimer (use a high precision clock timer and pause to wait for a signal to measure latency)
  • poll (busy spin-wait on clock_gettime() to eat cycles for a delay.
All the delay mechanisms use the CLOCK_REALTIME system clock for timing.

I hope this is plenty of cyclic measurement functionality to get some useful latency benchmarks against various kernel components when using some or a mix of the stress-ng stressors.  Let me know if I am missing some other cyclic measurement options and I can see if I can add them in.

Keep stressing and measuring those systems!

Read more
Colin Ian King

What is new in FWTS 17.05.00?

Version 17.05.00 of the Firmware Test Suite was released this week as part of  the regular end-of-month release cadence. So what is new in this release?

  • Alex Hung has been busy bringing the SMBIOS tests in-sync with the SMBIOS 3.1.1 standard
  • IBM provided some OPAL (OpenPower Abstraction Layer) Firmware tests:
    • Reserved memory DT validation tests
    • Power management DT Validation tests
  • The first fwts snap was created
  •  Over 40 bugs were fixed
As ever, we are grateful for all the community contributions to FWTS.  The full release details are available from the fwts-devel mailing list.

I expect that the next upcoming ACPICA release will be integrated into the 17.06.00 FWTS release next month.

Read more
Colin Ian King

The Firmware Test Suite (FWTS) has an easy to use text based front-end that is primarily used by the FWTS Live-CD image but it can also be used in the Ubuntu terminal.

To install and run the front-end use:

 sudo apt-get install fwts-frontend  
sudo fwts-frontend-text

..and one should see a menu of options:


In this demonstration, the "All Batch Tests" option has been selected:


Tests will be run one by one and a progress bar shows the progress of each test. Some tests run very quickly, others can take several minutes depending on the hardware configuration (such as number of processors).

Once the tests are all complete, the following dialogue box is displayed:


The test has saved several files into the directory /fwts/15052017/1748/ and selecting Yes one can view the results log in a scroll-box:


Exiting this, the FWTS frontend dialog is displayed:


Press enter to exit (note that the Poweroff option is just for the fwts Live-CD image version of fwts-frontend).

The tool dumps various logs, for example, the above run generated:

 ls -alt /fwts/15052017/1748/  
total 1388
drwxr-xr-x 5 root root 4096 May 15 18:09 ..
drwxr-xr-x 2 root root 4096 May 15 17:49 .
-rw-r--r-- 1 root root 358666 May 15 17:49 acpidump.log
-rw-r--r-- 1 root root 3808 May 15 17:49 cpuinfo.log
-rw-r--r-- 1 root root 22238 May 15 17:49 lspci.log
-rw-r--r-- 1 root root 19136 May 15 17:49 dmidecode.log
-rw-r--r-- 1 root root 79323 May 15 17:49 dmesg.log
-rw-r--r-- 1 root root 311 May 15 17:49 README.txt
-rw-r--r-- 1 root root 631370 May 15 17:49 results.html
-rw-r--r-- 1 root root 281371 May 15 17:49 results.log

acpidump.log is a dump of the ACPI tables in format compatible with the ACPICA acpidump tool.  The results.log file is a copy of the results generated by FWTS and results.html is a HTML formatted version of the log.

Read more
Colin Ian King

Simple job scripting in stress-ng 0.08.00

The latest release of stress-ng 0.08.00 now contains a new job scripting feature. Jobs allow one to bundle up a set of stress options  into a script rather than cram them all onto the command line.  One can now also run multiple invocations of a stressor with the latest version of stress-ng and conbined with job scripts we now have a powerful way of running more complex stress tests.

The job script commands are essentially the stress-ng long options without the need for the '--' option characters.  One option per line is allowed.

For example:

 $ stress-ng --cpu 1 --matrix 1 --verbose --tz --timeout 60s --cpu 1 --matrix -1 --icache 1 

would become:

 $cat example.job  
verbose
tz
timeout 60
cpu 1
matrix 1
icache 1

One can also add comments using the # character prefix.   By default the stressors will be run in parallel, but one can use the "run sequential" command in the job script to run the stressors sequentially.

The following script runs the mmap stressor multiple times using more memory on each run:

 $ cat mmap.job  
run sequential # one job at a time
timeout 2m # run for 2 minutes
verbose # verbose output
#
# run 4 invocations and increase memory each time
#
mmap 1
mmap-bytes 25%
mmap 1
mmap-bytes 50%
mmap 1
mmap-bytes 75%
mmap 1
mmap-bytes 100%

Some of the stress-ng stressors have various "methods" that allow one to modify the way the stressor behaves.  The following example shows how job scripts can be uses to exercise a system using different stressor methods:

 $ cat /usr/share/stress-ng/example-jobs/matrix-methods.job   
#
# hot-cpu class stressors:
# various options have been commented out, one can remove the
# proceeding comment to enable these options if required.
#
# run the following tests in parallel or sequentially
#
run sequential
# run parallel
#
# verbose
# show all debug, warnings and normal information output.
#
verbose
#
# run each of the tests for 60 seconds
# stop stress test after N seconds. One can also specify the units
# of time in seconds, minutes, hours, days or years with the suf‐
# fix s, m, h, d or y.
#
timeout 1m
# tz
# collect temperatures from the available thermal zones on the
# machine (Linux only). Some devices may have one or more thermal
# zones, where as others may have none.
tz
#
# matrix stressor with examples of all the methods allowed
#
# start N workers that perform various matrix operations on float‐
# ing point values. By default, this will exercise all the matrix
# stress methods one by one. One can specify a specific matrix
# stress method with the --matrix-method option.
#
#
# Method Description
# all iterate over all the below matrix stress methods
# add add two N × N matrices
# copy copy one N × N matrix to another
# div divide an N × N matrix by a scalar
# hadamard Hadamard product of two N × N matrices
# frobenius Frobenius product of two N × N matrices
# mean arithmetic mean of two N × N matrices
# mult multiply an N × N matrix by a scalar
# prod product of two N × N matrices
# sub subtract one N × N matrix from another N × N matrix
# trans transpose an N × N matrix
#
matrix 0
matrix-method all
matrix 0
matrix-method add
matrix 0
matrix-method copy
matrix 0
matrix-method div
matrix 0
matrix-method frobenius
matrix 0
matrix-method hadamard
matrix 0
matrix-method mean
matrix 0
matrix-method mult
matrix 0
matrix-method prod
matrix 0
matrix-method sub
matrix 0
matrix-method trans

Various example job scripts can be found in /usr/share/stress-ng/example-job, one can use these as a base for writing more complex stressors.  The example jobs have all the options commented (using the text from the stress-ng manual) to make it easier to see how each stressor can be run.

Version 0.08.00 landed in Ubuntu 17.10 Artful Aardvark and is available as a snap and I've got backports in ppa:colin-king/white for older releases of Ubuntu.

Read more
Colin Ian King

Tracking CoverityScan issues on Linux-next

Over the past 6 months I've been running static analysis on linux-next with CoverityScan on a regular basis (to find new issues and fix some of them) as well as keeping a record of the defect count.


Since the beginning of September over 2000 defects have been eliminated by a host of upstream developers and the steady downward trend of outstanding issues is good to see.  A proportion of the outstanding defects are false positives or issues where the code is being overly zealous, for example, bounds checking where some conditions can never happen. Considering there are millions of lines of code, the defect rate is about average for such a large project.

I plan to keep the static analysis running long term and I'll try and post stats every 6 months or so to see how things are progressing.

Read more
Colin Ian King

The BPF Compiler Collection (BCC) is a toolkit for building kernel tracing tools that leverage the functionality provided by the Linux extended Berkeley Packet Filters (BPF).

BCC allows one to write BPF programs with front-ends in Python or Lua with kernel instrumentation written in C.  The instrumentation code is built into sandboxed eBPF byte code and is executed in the kernel.

The BCC github project README file provides an excellent overview and description of BCC and the various available BCC tools.  Building BCC from scratch can be a bit time consuming, however,  the good news is that the BCC tools are now available as a snap and so BCC can be quickly and easily installed just using:

 sudo snap install --devmode bcc  

There are currently over 50 BCC tools in the snap, so let's have a quick look at a few:

cachetop allows one to view the top page cache hit/miss statistics. To run this use:

 sudo bcc.cachetop  



The funccount tool allows one to count the number of times specific functions get called.  For example, to see how many kernel functions with the name starting with "do_" get called per second one can use:

 sudo bcc.funccount "do_*" -i 1  


To see how to use all the options in this tool, use the -h option:

 sudo bcc.funccount -h  

I've found the funccount tool to be especially useful to check on kernel activity by checking on hits on specific function names.

The slabratetop tool is useful to see the active kernel SLAB/SLUB memory allocation rates:

 sudo bcc.slabratetop  


If you want to see which process is opening specific files, one can snoop on open system calls use the opensnoop tool:

 sudo bcc.opensnoop -T


Hopefully this will give you a taste of the useful tools that are available in BCC (I have barely scratched the surface in this article).  I recommend installing the snap and giving it a try.

As it stands,BCC provides a useful mechanism to develop BPF tracing tools and I look forward to regularly updating the BCC snap as more tools are added to BCC. Kudos to Brendan Gregg for BCC!

Read more
Colin Ian King

Kernel printk statements

The kernel contains tens of thousands of statements that may print various errors, warnings and debug/information messages to the kernel log.  Unsurprisingly, as the kernel grows in size, so does the quantity of these messages.  I've been scraping the kernel source for various kernel printk style statements and macros and scanning these for various typos and spelling mistakes and to make this easier I hacked up kernelscan (a quick and dirty parser) that helps me find literal strings from the kernel for spell checking.

Using kernelscan, I've gathered some statistics for the number of kernel print statements for various kernel releases:


As one can see, we have over 200,000 messages in the 4.9 kernel(!).  Given the kernel growth, we can see this seems to roughly correlate with the kernel source size:



So how many lines of code in the kernel do we have per kernel printk messages over time?


..showing that the trend is to have more lines of code per frequent printk statements over time.  I didn't differentiate between different types of printk message, so it is hard to see any deeper trends on what kinds of messages are being logged more or less frequently over each release, for example,  perhaps there are less debug messages landing in the kernel nowadays.

I find it quite amazing that the kernel contains quite so many printk messages; it would be useful to see just how many of these are actually in a production kernel. I suspect quite large number are for driver debugging and may be conditionally omitted at build time.

Read more
Colin Ian King

Another year passes and once more I have another seasonal obfuscated C program.  I was caught short on free time this year to heavily obfuscate the code which is a shame. However, this year I worked a bit harder at animating the output, so hopefully that will make up for lack of obfuscation.

The source is available on github to eyeball.  I've had criticism on previous years that it is hard to figure out the structure of my obfuscated code, so this year I made sure that the if statements were easier to see and hence understand the flow of the code.

This year I've snapped up all my seasonal obfuscated C programs and put them into the snap store as the christmas-obfuscated-c snap.

Below is a video of the program running; it is all ASCII art and one can re-size the window while it is running.


Unlike previous years, I have the pre-obfuscated version of the code available in the git repository at commit c98376187908b2cf8c4d007445b023db67c68691 so hopefully you can see the original hacky C source.

Have a great Christmas and a most excellent New Year. 

Read more
Colin Ian King

stress-ng 0.07.07 released

stress-ng is a tool that I have been developing on-and-off for a few years. It is designed to stress kernels to force out bugs, stress CPU and memory and also contains some performance benchmarking metrics too.

stress-ng is now entering the maturity part of the development phase, however, there is always scope to add new stressors and generally improve the tool.   I've just released version 0.07.07 for the Ubuntu Zesty 17.04 release and it contains a few additional features:

  • SIGUSR2 sent to stress-ng will dump out the current system load and memory statistics
  • Sched policy stress tests for different scheduler configurations
  • Add a missing --sockfd-port option
And various bug fixes:
  • Fixed up some minor memory leaks
  • Missing counter stats on bind-mount, fp-error, personality and resources stressors
  • Fix the --fiemap-bytes option
  • Fix up build warnings with various compilers and static analyzers
The major change to stress-ng over the past month was an internal re-working of system call and GNU features to abstract these into a shim layer to reduce the number build conditional #ifdef paths around code. This simplifies portability, so the code now builds more easily across a range of systems and with various versions of gcc and clang and fixes some issues on older kernels too.   This makes the code also faster to statically analyze with cppcheck.

For more details, visit the stress-ng project page or the quick help guide.

Read more
Colin Ian King

Over the past month I've been hitting excessive thermal heating on my laptop and kidle_inject has been kicking in to try and stop the CPU from overheating (melting!).  A quick double-check with older kernels showed me that this issue was not thermal/performance regression caused by software - instead it was time to clean my laptop and renew the thermal paste.

After some quick research, I found that Artic MX-4 Thermal Compound provided an excellent thermal conductivity rating of 8.5W/mK so I ordered a 4g sample as well as a can of pressurized gas cleaner to clean out dust.

The X230 has an excellent hardware maintenance manual, and following the instructions I stripped the laptop right down so I could pop the heat pipe contacts off the CPU and GPU.  I carefully cleaned off the old dry and cracked thermal paste and applied about 0.2g of MX-4 thermal compound to the CPU and GPU and re-seated the heat pipe.  With the pressurized gas I cleaned out the fan and airways to maximize airflow over the heatpipe.   The entire procedure took about an hour to complete and for once I didn't have any screws left over after re-assembly!

I normally take photos of the position of components during the strip down of a laptop for reference in case I cannot figure out exactly how parts are meant to fix on the re-assembly phase.  In this case, the X230 maintenance manual is sufficiently detailed so I didn't take any photos this time.

I'm glad to report that my X230 is now no-longer overheating. Heat is being effectively pumped away from the CPU and GPU and one can feel the additional heat being pushed out of the laptop.  Once again I can fully max out the CPU and GPU without passive thermal cooling mechanisms being kicked into action, so I've now got 100% of my CPU performance back again; as good as new!

Now and again I see laptop overheating bugs being filed in LaunchPad.  While some are legitimate issues with broken software, I do wonder if the majority of issues with the older laptops is simply due to accumulation of dust and/or old and damaged thermal paste.

Read more
Colin Ian King

Scanning the Linux kernel for error messages

The Linux kernel contains lots of error/warning/information messages; over 130,000 in the current 4.7 kernel.  One of the tests in the Firmware Test Suite (FWTS) is to find BIOS/ACPI/UEFI related kernel error messages in the kernel log and try to provide some helpful advice on each error message since some can be very cryptic to the untrained eye.

The FWTS kernel error log database is currently approaching 800 entries and I have been slowly working through another 800 or so more relevant and recently added messages.  Needless to say, this is taking a while to complete.  The hardest part was finding relevant error messages in the kernel as they appear in different forms (e.g. printk(), dev_err(), ACPI_ERROR() etc).

In order to scrape the Linux kernel source for relevant error messages I hacked up the kernelscan parser to find error messages and dump these to stdout.  kernelscan can scan 43,000 source files (17,900,000 lines of source) in under 10 seconds on my Lenovo X230 laptop, so it is relatively fast.

I also have been using kernelscan to find spelling mistakes in kernel messages and I've been punting trivial fixes upstream to fix these.  These mistakes are small and petty, but I find it a little irksome when I see the kernel emit a message that contains a typo or spelling mistake - it just looks a bit unprofessional.

I've created a kernelscan snap (which was really easy and fast to do using scancraft), so it is now available Ubuntu.  The source code is also available from the kernel team git web at http://kernel.ubuntu.com/git/cking/kernelscan.git/

The code is designed to only parse kernel source, and it is a very rough and ready parser designed for speed;  fundamentally, it is a big quick hack.  When I get a few spare minutes I will try and see if there is any correlation between the number of error messages with the size of the kernel over the various releases.

Read more
Colin Ian King

What's new in stress-ng 0.06.07?

Since my last blog post about stress-ng, I've pushed out several more small releases that incorporate new features and (as ever) a bunch more bug fixes.  I've been eyeballing gcov kernel coverage stats to find more regions in the kernel where stress-ng needs to exercise.   Also, testing on a range of hardware (arm64, s390x, etc) and a range of kernels has eeked out some bugs and helped me to improve stress-ng.  So what's new?

New stressors:

  • ioprio  - exercises ioprio_get(2) and ioprio_set(2) (I/O scheduling classes and priorities)
  • opcode - generates random object code and executes this, generating and catching illegal instructions, bus errors,  segmentation  faults,  traps and floating  point errors.
  • stackmmap - allocates a 2MB stack that is memory mapped onto a temporary file. A recursive function works down the stack and flushes dirty stack pages back to the memory mapped file using msync(2) until the end of the stack is reached (stack overflow). This exercises dirty page and stack exception handling.
  • madvise - applies random madvise(2) advise settings on pages of a 4MB file backed shared memory mapping.
  • pty - exercise pseudo terminal operations.
  • chown - trivial chown(2) file ownership exerciser.
  • seal - fcntl(2) file SEALing exerciser.
  • locka - POSIX advisory locking exerciser.
  • lockofd - fcntl(2) F_OFD_SETLK/GETLK open file description lock exerciser.
Improved stressors:
  • msg: add in IPC_INFO, MSG_INFO, MSG_STAT msgctl calls
  • vecmath: add more ops to make vecmath more demanding
  • socket: add --sock-type socket type option, e.g. stream or seqpacket
  • shm and shm-sysv: add msync'ing on the shm regions
  • memfd: add hole punching
  • mremap: add MAP_FIXED remappings
  • shm: sync, expand, shrink shm regions
  • dup: use dup2(2)
  • seek: add SEEK_CUR, SEEK_END seek options
  • utime: exercise UTIME_NOW and UTIME_OMIT settings
  • userfaultfd: add zero page handling
  • cache:  use cacheflush() on systems that provide this syscall
  • key:  add request_key system call
  • nice: add some randomness to the delay to unsync nicenesses changes
If any new features land in Linux 4.8 I may add stressors for them, but for now I suspect that's about it for the big changes for stress-ng for the Ubuntu Yakkey 16.10 release.

Read more
Colin Ian King

Recently I've been adding a few more features into stress-ng to get improve kernel code coverage.   I'm currently using a kernel built with gcov enabled and using the most excellent lcov tool to collate the coverage data and produce some rather useful coverage charts.

With a gcov enabled kernel, gathering coverage stats is a trivial process with lcov:

 sudo apt-get install lcov  
sudo lcov --zerocounters
stress-ng --seq 0 -t 60
sudo lcov -c -o kernel.info
sudo genhtml -o html kernel.info

..and the html output appears in the html directory.

In the latest 0.06.00 release of stress-ng, the following new features have been introduced:

  • af-alg stressor, added skciphers and rngs
  • new Translation Lookaside Buffer (TLB) shootdown stressor
  • new /dev/full stressor
  • hdd stressor now works through all the different hdd options if --maximize is used
  • wider procfs stressing
  • added more keyctl commands to the key stressor
  • new msync stressor, exercise msync of mmap'd memory back to file and from file back to memory.
  • Real Time Clock (RTC) stressor (via /dev/rtc and /proc/driver/rtc)
  • taskset option, allowing one to run stressors on specific CPUs (affinity setting)
  • inotify stressor now also exercises the FIONREAD ioctl()
  • and some bug fixes found when testing stress-ng on various architectures.
The --taskset option allows one to keep stress-ng stressors bound to specific CPUs, for example, to run 5 CPU stressors tied to CPUs 1, 3, 5, 6 and 7:

 stress-ng --taskset 1,3,5-7 --cpu 5  

..thanks to Jim Rowan (Intel) for the CPU affinity ideas.

stress-ng 0.06.00 will be landing in Ubunty Yakkety soon, and also in my power utilities PPA ppa:colin-king/white

Read more
Colin Ian King

I got bitten this week with the clone() system call returning -EINVAL on aarch64 on code that worked fine on x86.  After re-reading the manual several times and looking at my code, I resorted to shoving in debug into the kernel to track down where the -EINVAL was occurring.

The answer to my issue is in arch/arm64/kernel/process.c, copy_thread():

         if (stack_start) {  
if (is_compat_thread(task_thread_info(p)))
childregs->compat_sp = stack_start;
/* 16-byte aligned stack mandatory on AArch64 */
else if (stack_start & 15)
return -EINVAL;
else
childregs->sp = stack_start;
}

Ahah! The stack being passed into clone() has to be 16 byte aligned.  With this simple fix to my code, clone() worked.   Pity this was not in the documentation.

Read more
Colin Ian King

A frequently used incorrect realloc() idiom

While running static analysis on a lot of C source code, I keep on finding a common incorrect programming idiom used with realloc() allocation failures where a NULL is returned and the code returns with some kind of exit failure status, something like the following:

 ptr = realloc(ptr, new_size);  
if (!ptr)
return -ENOMEM; /* Failed, no memory! */

However, when realloc() fails it returns NULL and the original object remains unchanged and thus it is not freed.  So the above code leaks the memory pointed to by ptr if realloc() returns NULL.

This may be a moot point, since the error handling paths normally abort the program because we are out of memory if can't proceed any further.  However, there are occasions in code where ENOMEM may not be fatal, for example the program may reallocate smaller buffers and retry or free up space on the heap and retry.

A more correct programming idiom for realloc() perhaps should be:

 tmp = realloc(ptr, new_size);  
if (!tmp) {
free(ptr);
return -ENOMEM; /* Failed, no memory! */
}
ptr = tmp;

..which is not aesthetically pleasing, but does the trick of free'ing memory before we return.

Anyhow, it is something to bear in mind next time one uses realloc().

Read more
Colin Ian King

ZFS quick start reference guide

Dustin Kirkland recently announced ZFS being officially supported by Canonical for Ubuntu Xenial 16.04.    We've written a short reference guide to help with getting started, and understanding the ZFS terminology.  This touches the basics on setting up ZFS pools as well as creating and using ZFS file systems. 

If you are new to ZFS, I recommend having a look at the reference guide to get you started.








Read more
Colin Ian King

New "top" mode in eventstat

I wrote eventstat a few years ago to track wakeup events that keep a machine from being fully idle.  For Ubuntu Xenial Xerus 16.04 I've added a 'top' like mode (enabled using the -T option).

 
By widening the terminal one can see more of the Task, Init Function and Callback text, which is useful as these details can be rather lengthy.

Anyhow, just a minor feature change, but hopefully a useful one.

Read more
Colin Ian King

One issue when running parallel processes is contention of shared resources such as the Last Level Cache (aka LLC or L3 Cache).  For example, a server may be running a set of Virtual Machines with processes that are memory and cache intensive hence producing a large amount of cache activity. This can impact on the other VMs and is known as the "Noisy Neighbour" problem.

Fortunately the next generation Intel processors allow one to monitor and also fine tune cache allocation using Intel Cache Monitoring Technology (CMT) and Cache Allocation Technology (CAT).

Intel kindly loaned me a 12 thread development machine with CMT and CAT support to experiment with this technology using the Intel pqos tool.   For my experiment, I installed Ubuntu Xenial Server on the machine. I then installed KVM and an VM instance of Ubuntu Xenial Server.   I then loaded the instance using stress-ng running a memory bandwidth stressor:

 stress-ng --stream 1 -v --stream-l3-size 16M  
..which allocates 16MB in 4 buffers and performs various read/compute and writes to these, hence causing a "noisy neighbour".

Using pqos,  one can monitor and see the cache/memory activity:
sudo apt-get install intel-cmt-cat
sudo modprobe msr
sudo pqos -r
TIME 2016-02-04 10:25:06
CORE IPC MISSES LLC[KB] MBL[MB/s] MBR[MB/s]
0 0.59 168259k 9144.0 12195.0 0.0
1 1.33 107k 0.0 3.3 0.0
2 0.20 2k 0.0 0.0 0.0
3 0.70 104k 0.0 2.0 0.0
4 0.86 23k 0.0 0.7 0.0
5 0.38 42k 24.0 1.5 0.0
6 0.12 2k 0.0 0.0 0.0
7 0.24 48k 0.0 3.0 0.0
8 0.61 26k 0.0 1.6 0.0
9 0.37 11k 144.0 0.9 0.0
10 0.48 1k 0.0 0.0 0.0
11 0.45 2k 0.0 0.0 0.0
Now to run a stress-ng stream stressor on the host and see the performance while the noisy neighbour is also running:
stress-ng --stream 4 --stream-l3-size 2M --perf --metrics-brief -t 60
stress-ng: info: [2195] dispatching hogs: 4 stream
stress-ng: info: [2196] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info: [2196] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info: [2196] stress-ng-stream: Using L3 CPU cache size of 2048K
stress-ng: info: [2196] stress-ng-stream: memory rate: 1842.22 MB/sec, 736.89 Mflop/sec (instance 0)
stress-ng: info: [2198] stress-ng-stream: memory rate: 1847.88 MB/sec, 739.15 Mflop/sec (instance 2)
stress-ng: info: [2199] stress-ng-stream: memory rate: 1833.89 MB/sec, 733.56 Mflop/sec (instance 3)
stress-ng: info: [2197] stress-ng-stream: memory rate: 1847.16 MB/sec, 738.86 Mflop/sec (instance 1)
stress-ng: info: [2195] successful run completed in 60.01s (1 min, 0.01 secs)
stress-ng: info: [2195] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [2195] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [2195] stream 22101 60.01 239.93 0.04 368.31 92.10
stress-ng: info: [2195] stream:
stress-ng: info: [2195] 547,520,600,744 CPU Cycles 9.12 B/sec
stress-ng: info: [2195] 69,959,954,760 Instructions 1.17 B/sec (0.128 instr. per cycle)
stress-ng: info: [2195] 11,066,905,620 Cache References 0.18 B/sec
stress-ng: info: [2195] 11,065,068,064 Cache Misses 0.18 B/sec (99.98%)
stress-ng: info: [2195] 8,759,154,716 Branch Instructions 0.15 B/sec
stress-ng: info: [2195] 2,205,904 Branch Misses 36.76 K/sec ( 0.03%)
stress-ng: info: [2195] 23,856,890,232 Bus Cycles 0.40 B/sec
stress-ng: info: [2195] 477,143,689,444 Total Cycles 7.95 B/sec
stress-ng: info: [2195] 36 Page Faults Minor 0.60 sec
stress-ng: info: [2195] 0 Page Faults Major 0.00 sec
stress-ng: info: [2195] 96 Context Switches 1.60 sec
stress-ng: info: [2195] 0 CPU Migrations 0.00 sec
stress-ng: info: [2195] 0 Alignment Faults 0.00 sec
.. so about 1842 MB/sec memory rate and 736 Mflop/sec per CPU across 4 CPUs.  And pqos shows the cache/memory actitivity as:
sudo pqos -r
TIME 2016-02-04 10:35:27
CORE IPC MISSES LLC[KB] MBL[MB/s] MBR[MB/s]
0 0.14 43060k 1104.0 2487.9 0.0
1 0.12 3981523k 2616.0 2893.8 0.0
2 0.26 320k 48.0 18.0 0.0
3 0.12 3980489k 1800.0 2572.2 0.0
4 0.12 3979094k 1728.0 2870.3 0.0
5 0.12 3970996k 2112.0 2734.5 0.0
6 0.04 20k 0.0 0.3 0.0
7 0.04 29k 0.0 1.9 0.0
8 0.09 143k 0.0 5.9 0.0
9 0.15 0k 0.0 0.0 0.0
10 0.07 2k 0.0 0.0 0.0
11 0.13 0k 0.0 0.0 0.0
Using pqos again, we can find out how much LLC cache the processor has:
sudo pqos -v
NOTE: Mixed use of MSR and kernel interfaces to manage
CAT or CMT & MBM may lead to unexpected behavior.
INFO: Monitoring capability detected
INFO: CPUID.0x7.0: CAT supported
INFO: CAT details: CDP support=0, CDP on=0, #COS=16, #ways=12, ways contention bit-mask 0xc00
INFO: LLC cache size 9437184 bytes, 12 ways
INFO: LLC cache way size 786432 bytes
INFO: L3CA capability detected
INFO: Detected PID API (perf) support for LLC Occupancy
INFO: Detected PID API (perf) support for Instructions/Cycle
INFO: Detected PID API (perf) support for LLC Misses
ERROR: IPC and/or LLC miss performance counters already in use!
Use -r option to start monitoring anyway.
Monitoring start error on core(s) 5, status 6
So this CPU has 12 cache "ways", each of 786432 bytes (768K).  One or more  "Class of Service" (COS)  types can be defined that can use one or more of these ways.  One uses a bitmap with each bit representing a way to indicate how the ways are to be used by a COS.  For example, to use all the 12 ways on my example machine, the bit map is 0xfff  (111111111111).   A way can be exclusively mapped to a COS or shared, or not used at all.   Note that the ways in the bitmap must be contiguously allocated, so a mask such as 0xf3f (111100111111) is invalid and cannot be used.

In my experiment, I want to create 2 COS types, the first COS will have just 1 cache way assigned to it and CPU 0 will be bound to this COS as well as pinning the VM instance to CPU 0  The second COS will have the other 11 cache ways assigned to it, and all the other CPUs can use this COS.

So, create COS #1 with just 1 way of cache, and bind CPU 0 to this COS, and pin the VM to CPU 0:
sudo pqos -e llc:1=0x0001
sudo pqos -a llc:1=0
sudo taskset -apc 0 $(pidof qemu-system-x86_64)
And create COS #2, with 11 ways of cache and bind CPUs 1-11 to this COS:
sudo pqos -e "llc:2=0x0ffe"
sudo pqos -a "llc:2=1-11"
And let's see the new configuration:
sudo pqos  -s
NOTE: Mixed use of MSR and kernel interfaces to manage
CAT or CMT & MBM may lead to unexpected behavior.
L3CA COS definitions for Socket 0:
L3CA COS0 => MASK 0xfff
L3CA COS1 => MASK 0x1
L3CA COS2 => MASK 0xffe
L3CA COS3 => MASK 0xfff
L3CA COS4 => MASK 0xfff
L3CA COS5 => MASK 0xfff
L3CA COS6 => MASK 0xfff
L3CA COS7 => MASK 0xfff
L3CA COS8 => MASK 0xfff
L3CA COS9 => MASK 0xfff
L3CA COS10 => MASK 0xfff
L3CA COS11 => MASK 0xfff
L3CA COS12 => MASK 0xfff
L3CA COS13 => MASK 0xfff
L3CA COS14 => MASK 0xfff
L3CA COS15 => MASK 0xfff
Core information for socket 0:
Core 0 => COS1, RMID0
Core 1 => COS2, RMID0
Core 2 => COS2, RMID0
Core 3 => COS2, RMID0
Core 4 => COS2, RMID0
Core 5 => COS2, RMID0
Core 6 => COS2, RMID0
Core 7 => COS2, RMID0
Core 8 => COS2, RMID0
Core 9 => COS2, RMID0
Core 10 => COS2, RMID0
Core 11 => COS2, RMID0
..showing Core 0 bound to COS1, and Cores 1-11 bound to COS2, with COS1 with 1 cache way and COS2 with the remaining 11 cache ways.
Now re-run the stream stressor and see if the VM has less impact on the LL3 cache:
stress-ng --stream 4 --stream-l3-size 1M --perf --metrics-brief -t 60
stress-ng: info: [2232] dispatching hogs: 4 stream
stress-ng: info: [2233] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info: [2233] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info: [2233] stress-ng-stream: Using L3 CPU cache size of 1024K
stress-ng: info: [2235] stress-ng-stream: memory rate: 2616.90 MB/sec, 1046.76 Mflop/sec (instance 2)
stress-ng: info: [2233] stress-ng-stream: memory rate: 2562.97 MB/sec, 1025.19 Mflop/sec (instance 0)
stress-ng: info: [2234] stress-ng-stream: memory rate: 2541.10 MB/sec, 1016.44 Mflop/sec (instance 1)
stress-ng: info: [2236] stress-ng-stream: memory rate: 2652.02 MB/sec, 1060.81 Mflop/sec (instance 3)
stress-ng: info: [2232] successful run completed in 60.00s (1 min, 0.00 secs)
stress-ng: info: [2232] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [2232] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [2232] stream 62223 60.00 239.97 0.00 1037.01 259.29
stress-ng: info: [2232] stream:
stress-ng: info: [2232] 547,364,185,528 CPU Cycles 9.12 B/sec
stress-ng: info: [2232] 97,037,047,444 Instructions 1.62 B/sec (0.177 instr. per cycle)
stress-ng: info: [2232] 14,396,274,512 Cache References 0.24 B/sec
stress-ng: info: [2232] 14,390,808,440 Cache Misses 0.24 B/sec (99.96%)
stress-ng: info: [2232] 12,144,372,800 Branch Instructions 0.20 B/sec
stress-ng: info: [2232] 1,732,264 Branch Misses 28.87 K/sec ( 0.01%)
stress-ng: info: [2232] 23,856,388,872 Bus Cycles 0.40 B/sec
stress-ng: info: [2232] 477,136,188,248 Total Cycles 7.95 B/sec
stress-ng: info: [2232] 44 Page Faults Minor 0.73 sec
stress-ng: info: [2232] 0 Page Faults Major 0.00 sec
stress-ng: info: [2232] 72 Context Switches 1.20 sec
stress-ng: info: [2232] 0 CPU Migrations 0.00 sec
stress-ng: info: [2232] 0 Alignment Faults 0.00 sec
Now with the noisy neighbour VM constrained to use just 1 way of LL3 cache, the stream stressor on the host now can achieve about 2592 MB/sec and about 1030 Mflop/sec per CPU across 4 CPUs.

This is a relatively simple example.  With the ability to monitor cache and memory bandwidth activity with one can carefully tune a system to make best use of the limited LL3 cache resource and maximise throughput where needed.

There are many applications where Intel CMT/CAT can be useful, for example fine tuning containers or VM instances, or pinning user space networking buffers to cache ways in DPDK for improved throughput.

Read more
Colin Ian King

Pagemon improvements

Over the past month I've been finding the odd moments [1] to add some small improvements and fix a few bugs to pagemon (a tool to monitor process memory).  The original code went from a sketchy proof of concept prototype to a somewhat more usable tool in a few weeks, so my main concern recently was to clean up the code and make it more efficient.

With the use of tools such as valgrind's cachegrind and perf I was able to work on some of the code hot-spots [2] and reduce it from ~50-60% CPU down to 5-9% CPU utilisation on my laptop, so it's definitely more machine friendly now.  In addition I've added the following small features:

  • Now one can specify the name of a process to monitor as well as the PID.  This also allows one to run pagemon on itself(!), which is a bit meta.
  • Perf events showing Page Faults and Kernel Page Allocates and Frees, toggled on/off with the 'p' key.
  • Improved and snappier clean up and exit when a monitored process exits.
  • Far more efficient page map reading and rendering.
  • Out of Memory (OOM) scores added to VM statistics window.
  • Process activity (busy, sleeping, etc) to VM statistics window.
  • Zoom mode min/max with '[' (min) and ']' (max) keys.
  • Close pop-up windows with key 'c'.
  • Improved handling of rapid map expansion and shrinking.
  • Jump to end of map using 'End' key.
  • Improve the man page.
I've tried to keep the tool small and focused and I don't want feature bloat to make it unwieldy and overly complexed.  "Do one job, and do it well" is the philosophy behind pagemon. At just 1500 lines of C, it is as complex as I want it to be for now.

Version 0.01.08 should be hitting the Ubuntu 16.04 Xenial Xerus archive in the next 24 hours or so.  I have also the lastest version in my PPA (ppa:colin-king/pagemon) built for Trusty, Vivid, Wily and Xenial.


Pagemon is useful for spotting unexpected memory activity and it is just interesting watching the behaviour memory hungry processes such as web-browsers and Virtual Machines.

Notes:
[1] Mainly very late at night when I can't sleep (but that's another story...).  The git log says it all.
[2] Reading in /proc/$PID/maps and efficiently reading per page data from /proc/$PID/pagemap

Read more
Colin Ian King

Forcing out bugs with stress-ng

stress-ng logo
Over the past few months I've been adding several new stress tests and a lot more stressor options to stress-ng for Ubuntu 16.04 Xenial Xerus.  I try to track new system calls and features landing in the kernel and where appropriate add a stress test to try and force out bugs.

Stress-ng has found various kernel bugs, such as CVE-2015-1333 and LP:#1526811 as well as bugs in user space (for example, daemons crashing) when memory pressure is very high.  Simple abusive tricks, such as aggressively trying to allocate every free page in memory are useful in finding drivers that don't necessary check for memory allocation failures.  For example, today I was caught out when a USB ethernet dongle driver didn't check for a null pointer due to an allocation failure and stress-ng ended up triggering a kernel oops (fortunately, this bug was fixed in a recent kernel).

The underlying philosophy for stress-ng is "use and abuse standard Linux interfaces and see how far we can push them to destruction".  I'm pretty sure there are plenty of creative folk out there who can dream up dastardly ways to make stress-ng even more stressy, so contributions are always warmly accepted!  I have a mirrorred copy of the git repository on github to make it easy for developers to get their hands on the code.

We've been using stress-ng on ARM based SoC kernels to force out bugs and this has been useful in finding areas where non-swap based systems break. You really don't want your kernel oopsing or processes segfaulting when a IoT device has run low on memory.

My original intent for stress-ng was just to make a system run hot and force thermal overruns. However, I soon discovered it is useful to force kernel bugs out by attempting to (pathologically) thrash most of the system calls.  I've also added perf stats to stress-ng to track performance of standard stress scenarios over kernel versions to get an early warning of any potential performance regressions.  So stress-ng is a bit of a mixed bag of stress tests and performance measuring goodness.

When I get some free time I hope to run stress-ng against a GCOV instrumented kernel at see how much test coverage I get on a kernel. I suspect there are a lot of core kernel functionality still not being touched by stress-ng.

I've also tried to make stress-ng portable, so it can build fine on GNU/Hurd and Debian kFreeBSD (with Linux specific tests not built-in of course). It also contains some architecture specific features, such as handling the data and instruction cache as well as the x86 rdrand instruction and cache line locking. If there are any ARM specific features than can be stressed I'd like to know and perhaps implement stressors for them.

Anyhow, I believe stress-ng is almost feature complete for Ubuntu Xenial, however, I expect it to grow in features over time since there is always new functionality landing in the Linux kernel that needs to be thrashed tested.

Read more