Canonical Voices

Colin Ian King

Linux I/O Schedulers

The Linux kernel I/O schedulers attempt to balance the need to get the best possible I/O performance while also trying to ensure the I/O requests are "fairly" shared among the I/O consumers.  There are several I/O schedulers in Linux, each try to solve the I/O scheduling issues using different mechanisms/heuristics and each has their own set of strengths and weaknesses.

For traditional spinning media it makes sense to try and order I/O operations so that they are close together to reduce read/write head movement and hence decrease latency.  However, this reordering means that some I/O requests may get delayed, and the usual solution is to schedule these delayed requests after a specific time.   Faster non-volatile memory devices can generally handle random I/O requests very easily and hence do not require reordering.

Balancing the fairness is also an interesting issue.  A greedy I/O consumer should not block other I/O consumers and there are various heuristics used to determine the fair sharing of I/O.  Generally, the more complex and "fairer" the solution the more compute is required, so selecting a very fair I/O scheduler with a fast I/O device and a slow CPU may not necessarily perform as well as a simpler I/O scheduler.

Finally, the types of I/O patterns on the I/O devices influence the I/O scheduler choice, for example, mixed random read/writes vs mainly sequential reads and occasional random writes.

Because of the mix of requirements, there is no such thing as a perfect all round I/O scheduler.  The defaults being used are chosen to be a good best choice for the general user, however, this may not match everyone's needs.   To clarify the choices, the Ubuntu Kernel Team has provided a Wiki page describing the choices and how to select and tune the various I/O schedulers.  Caveat emptor applies, these are just guidelines and should be used as a starting point to finding the best I/O scheduler for your particular need.

Read more
Colin Ian King

New features in Forkstat

Forkstat is a simple utility I wrote a while ago that can trace process activity using the rather useful Linux NETLINK_CONNECTOR API.   Recently I have added two extra features that may be of interest:

1.  Improved output using some UTF-8 glyphs.  These are used to show process parent/child relationships and various process events, such as termination, core dumping and renaming.   Use the new -g (glyph) option to enable this mode. For example:


In the above example, the program "wobble" was started and forks off a child process.  The parent then renames itself to wibble (indicated by a turning arrow). The child then segfaults and generates a core dump (indicated by a skull and crossbones), triggering apport to investigate the crash.  After this, we observe NetworkManager creating a thread that runs for a very short period of time.   This kind of activity is normally impossible to spot while running conventions tools such as ps or top.

2. By default, forkstat will show the process name using the contents of /proc/$PID/cmdline.  The new -c option allows one to instead use the 16 character task "comm" field, and this can be helpful for spotting process name changes on PROC_EVENT_COMM events.

These are small changes, but I think they make forkstat more useful.  The updated forkstat will be available in Ubuntu 19.04 "Disco Dingo".

Read more
Colin Ian King

High-level tracing with bpftrace

Bpftrace is a new high-level tracing language for Linux using the extended Berkeley packet filter (eBPF).  It is a very powerful and flexible tracing front-end that enables systems to be analyzed much like DTrace.

The bpftrace tool is now installable as a snap. From the command line one can install it and enable it to use system tracing as follows:

 sudo snap install bpftrace  
sudo snap connect bpftrace:system-trace

To illustrate the power of bpftrace, here are some simple one-liners:

 # trace openat() system calls
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%d %s %s\n", pid, comm, str(args->filename)); }'
Attaching 1 probe...
1080 irqbalance /proc/interrupts
1080 irqbalance /proc/stat
2255 dmesg /etc/ld.so.cache
2255 dmesg /lib/x86_64-linux-gnu/libtinfo.so.5
2255 dmesg /lib/x86_64-linux-gnu/librt.so.1
2255 dmesg /lib/x86_64-linux-gnu/libc.so.6
2255 dmesg /lib/x86_64-linux-gnu/libpthread.so.0
2255 dmesg /usr/lib/locale/locale-archive
2255 dmesg /lib/terminfo/l/linux
2255 dmesg /home/king/.config/terminal-colors.d
2255 dmesg /etc/terminal-colors.d
2255 dmesg /dev/kmsg
2255 dmesg /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache

 # count system calls using tracepoints:  
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
@[tracepoint:syscalls:sys_enter_getsockname]: 1
@[tracepoint:syscalls:sys_enter_kill]: 1
@[tracepoint:syscalls:sys_enter_prctl]: 1
@[tracepoint:syscalls:sys_enter_epoll_wait]: 1
@[tracepoint:syscalls:sys_enter_signalfd4]: 2
@[tracepoint:syscalls:sys_enter_utimensat]: 2
@[tracepoint:syscalls:sys_enter_set_robust_list]: 2
@[tracepoint:syscalls:sys_enter_poll]: 2
@[tracepoint:syscalls:sys_enter_socket]: 3
@[tracepoint:syscalls:sys_enter_getrandom]: 3
@[tracepoint:syscalls:sys_enter_setsockopt]: 3
...

Note that it is recommended to use bpftrace with Linux 4.9 or higher.

The bpftrace github project page has an excellent README guide with some worked examples and is a very good place to start.  There is also a very useful reference guide and one-liner tutorial too.

If you have any useful btftrace one-liners, it would be great to share them. This is an amazingly powerful tool, and it would be interesting to see how it will be used.

Read more
Colin Ian King

Static Analysis Trends on Linux Next

I've been running static analysis using CoverityScan on linux-next for 2 years with the aim to find bugs (and try to fix some) before they are merged into Linux.  I have also been gathering the defect count data and tracking the defect trends:

As one can see from above, CoverityScan has found a considerable amount of defects and these are being steadily fixed by the Linux developer community.  The encouraging fact is that the outstanding issues are reducing over time. Some of the spikes in the data are because of changes in the analysis that I'm running (e.g. getting more coverage), but even so, one can see a definite trend downwards in the total defects in the Kernel.

With static analysis, some of these reported defects are false positives or corner cases that are in fact impossible to occur in real life and I am slowly working through these and annotating them so they don't get reported in the defect count.

It must be also noted that over these two years the kernel has grown from around 14.6 million to 17.1 million lines of code so the defect count has dropped from 1 defect in every ~2100 lines to 1 defect in every ~3000 lines over the past 2 years.  All in all, it is a remarkable improvement for such a large and complex codebase that is growing in size at such rate.

Read more
Colin Ian King

The low-latency kernel offering with Ubuntu provides a kernel tuned for low-latency environments using low-latency kernel configuration options.  The x86 kernels by default run with the Intel-Pstate CPU scheduler set to run with the powersave scaling governor biased towards power efficiency.

While power efficiency is fine for most use-cases, it can introduce latencies due to the fact that the CPU can be running at a low frequency to save power and also switching from a deep C state when idle to a higher C state when servicing an event can also increase on latencies.

In a somewhat contrived experiment, I rigged up an i7-3770 to collect latency timings of clock_nanosleep() wake-ups with timer event coalescing disabled (timer_slack set to zero) over 60 seconds across a range of CPU scheduler and governor settings on a 4.15 low-latency kernel.  This can be achieved using stress-ng, for example:

 sudo stress-ng --cyclic 1 --cyclic-dist 100 –cyclic-sleep=10000 --cpu 1 -l 0 -v \
--cyclic-policy rr --cyclic-method clock_ns --cpu 0 -t 60 --timer-slack 0

..the above runs a cyclic measurement collecting latency counts in 100ns buckets with a clock_nanosecond wakeup interval of 10,000 nanoseconds with zero % load CPU stressor and timer slack set to 0 nanoseconds.  This dumps latency distribution stats that can be plotted to see where the modal latency points occur and the latency characteristics of the CPU scheduler.

I also used powerstat to measure the power consumed by the CPU package over a 60 second interval.  Measurements for the Intel-Pstate CPU scheduler [performance, powersave] and the ACPI CPU scheduler (intel_pstate=disabled) [performance, powersave, conservative and ondemand] were taken for 1,000,000 down to 10,000 nanosecond timer delays.

1,000,000 nanosecond timer delays (1 millisecond)

Strangely the powersave Intel-Pstate is using the most power (not what I expected).

The ACPI CPU scheduler in performance mode has the best latency distribution followed by the Intel-Pstate CPU scheduler also in performance mode.

100,000 nanosecond timer delays (100 microseconds)

Note that Intel-Pstate performance consumes the most power...
...and also has the most responsive low-latency distribution.

10,000 nanosecond timer delays (10 microseconds)

In this scenario, the ACPI CPU scheduler in performance mode was consuming the most power and had the best latency distribution.

It is clear that the best latency responses occur when a CPU scheduler is running in performance mode and this consumes a little more power than other CPU scheduler modes.  However, it is not clear which CPU scheduler (Intel-Pstate or ACPI) is best in specific use-cases.

The conclusion is rather obvious;  but needs to be stated.  For best low-latency response, set the CPU governor to the performance mode at the cost of higher power consumption.  Depending on the use-case, the extra power cost is probably worth the improved latency response.

As mentioned earlier, this is a somewhat contrived experiment, only one CPU was being exercised with a predictable timer wakeup.  A more interesting test would be with data handling, such as incoming packet handling over ethernet at different rates; I will probably experiment with that if and when I get more time.  Since this was a synthetic test using stress-ng, it does not represent real world low-latency scenarios, however, it may be worth exploring CPU scheduler settings to tune a low-latency configuration rather than relying on the default CPU scheduler setting.

Read more
Colin Ian King

Kernel Commits with "Fixes" tag

Over the past 5 years there has been a steady increase in the number of kernel bug fix commits that use the "Fixes" tag.  Kernel developers use this annotation on a commit to reference an older commit that originally introduced the bug, which is obviously very useful for bug tracking purposes. What is interesting is that there has been a steady take-up of developers using this annotation:

With the 4.15 release, 1859 of the 16223 commits (11.5%) were tagged as "Fixes", so that's a fair amount of work going into bug fixing.  I suspect there are more commits that are bug fixes, but aren't using the "Fixes" tag, so it's hard to tell for certain how many commits are fixes without doing deeper analysis.  Probably over time this tag will be widely adopted for all bug fixes and the trend line will level out and we will have a better idea of the proportion of commits per release that are just devoted to fixing issues.  Let's see how this looks in another 5 years time,  I'll keep you posted!

Read more
Colin Ian King

Linux Kernel Module Growth

The Linux kernel grows at an amazing pace, each kernel release adds more functionality, more drivers and hence more kernel modules.  I recently wondered what the trend was for kernel module growth per release, so I performed module builds on kernels v2.6.24 through to v4.16-rc2 for x86-64 to get a better idea of growth rates:


..as one can see, the rate of growth is relatively linear with about 89 modules being added to each kernel release, which is not surprising as the size of the kernel is growing at a fairly linear rate too.  It is interesting to see that the number of modules has easily more than tripled in the 10 years between v2.6.24 and v4.16-rc2,  with a rate of about 470 new modules per year. At this rate, Linux will see the 10,000th module land in around the year 2025.

Read more
Colin Ian King

stress-ng V0.09.15

It has been a while since my last post about stress-ng so I thought it would be useful to provide an update on the changes since V0.08.09.

I have been focusing on making stress-ng more portable so it can build with various versions of clang and gcc as well as run against a wide range of kernels.   The portability shims and config detection added to stress-ng allow it to build and run on a wide range of Linux systems, as well as GNU/HURD, Minix, Debian kFreeBSD, various BSD systems, OpenIndiana and OS X.

Enabling stress-ng to work on a wide range of architectures and kernels with a range of compiler versions has helped me to find and fix various corner case bugs.  Also, static analysis with a various set of tools has helped to drive up the code quality. As ever, I thoroughly recommend using static analysis tools on any project to find bugs.

Since V0.08.09 I've added the following stressors:

  • inode-flags  - (using the FS_IOC_GETFLAGS/FS_IOC_SETFLAGS ioctl, see ioctl_iflags(2) for more details.
  • sockdiag - exercise the Linux sock_diag netlink socket diagnostics
  • branch - exercise branch prediction
  • swap - exercise adding and removing variously sized swap partitions
  • ioport - exercise I/O port read/writes to try and cause CPU I/O bus delays
  • hrtimers - high resolution timer stressor
  • physpage - exercise the lookup of a physical page address and page count of a virtual page
  • mmapaddr - mmap pages to randomly unused VM addresses and exercise mincore and segfault handling
  • funccall - exercise function calling with a range of function arguments types and sizes, for benchmarking stack/CPU/cache and compiler.
  • tree - BSD tree (red/black and splay) stressor, good for exercising memory/cache
  • rawdev - exercise raw block device I/O reads
  • revio - reverse file offset random writes, causes lots of fragmentation and hence many file extents
  • mmap-fixed - stress fixed address mmaps, with a wide range of VM addresses
  • enosys - exercise a wide range of random system call numbers that are not wired up, hence generating ENOSYS errors
  • sigpipe - stress SIGPIPE signal generation and handling
  • vm-addr - exercise a wide range of VM addresses for fixed address mmaps with thorough address bit patterns stressing
Stress-ng has nearly 200 stressors and many of these have various stress methods than can be selected to perform specific stress testing.  These are all documented in the manual.  I've also updated the stress-ng project page with various links to academic papers and presentations that have used stress-ng in various ways to stress computer systems.  It is useful to find out how stress-ng is being used so that I can shape this tool in the future.

As ever, patches for fixes and improvements are always appreciated.  Keep on stressing!

Read more
Colin Ian King

Static analysis on the Linux kernel

There are a wealth of powerful static analysis tools available nowadays for analyzing C source code. These tools help to find bugs in code by just analyzing the source code without actually having to execute the code.   Over that past year or so I have been running the following static analysis tools on linux-next every weekday to find kernel bugs:

Typically each tool can take 10-25+ hours of compute time to analyze the kernel source; fortunately I have a large server at hand to do this.  The automated analysis creates an Ubuntu server VM, installs the required static analysis tools, clones linux-next and then runs the analysis.  The VMs are configured to minimize write activity to the host and run with 48 threads and plenty of memory to try to speed up the analysis process.

At the end of each run, the output from the previous run is diff'd against the new output and generates a list of new and fixed issues.  I then manually wade through these and try to fix some of the low hanging fruit when I can find free time to do so.

I've been gathering statistics from the CoverityScan builds for the past 12 months tracking the number of defects found, outstanding issues and number of defects eliminated:

As one can see, there are a lot of defects getting fixed by the Linux developers and the overall trend of outstanding issues is downwards, which is good to see.  The defect rate in linux-next is currently 0.46 issues per 1000 lines (out of over 13 million lines that are being scanned). A typical defect rate for a project this size is 0.5 issues per 1000 lines.  Some of these issues are false positives or very minor / insignficant issues that will not cause any run time issues at all, so don't be too alarmed by the statistics.

Using a range of static analysis tools is useful because each one has it's own strengths and weaknesses.  For example smatch and sparse are designed for sanity checking the kernel source, so they have some smarts that detect kernel specific semantic issues.  CoverityScan is a commercial product however they allow open source projects the size of the linux-kernel to be built daily and the web based bug tracking tool is very easy to use and CoverityScan does manage to reliably find bugs that other tools can't reach.  Cppcheck is useful as scans all the code paths by forcibly trying all the #ifdef'd variations of code - which is useful on the more obscure CONFIG mixes.

Finally, I use clang's scan-build and the latest verion of gcc to try and find the more typical warnings found by the static analysis built into modern open source compilers.

The more typical issues being found by static analysis are ones that don't generally appear at run time, such as in corner cases like error handling code paths, resource leaks or resource failure conditions, uninitialized variables or dead code paths.

My intention is to continue this process of daily checking and I hope to report back next September to review the CoverityScan trends for another year.

Read more
Colin Ian King

The latest release of stress-ng V0.08.09 incorporates new stressors and a handful of bug fixes. So what is new in this release?

  • memrate stressor to exercise and measure memory read/write throughput
  • matrix yx option to swap order of matrix operations
  • matrix stressor size can now be 8192 x 8192 in size
  • radixsort stressor (using the BSD library radixsort) to exercise CPU and memory
  • improved job script parsing and error reporting
  • faster termination of rmap stressor (this was slow inside VMs)
  • icache stressor now calls cacheflush()
  • anonymous memory mappings are now private allowing hugepage madvise
  • fcntl stressor exercises the 4.13 kernel F_GET_FILE_RW_HINT and F_SET_FILE_RW_HINT
  • stream and vm stressors have new mdavise options
The new memrate stressor performs 64/32/16/8 bit reads and writes to a large memory region.  It will attempt to get some statistics on the memory bandwidth for these simple reads and writes.  One can also specify the read/write rates in terms of MB/sec using the --memrate-rd-mbs and --memrate-wr-mbs options, for example:

 stress-ng --memrate 1 --memrate-bytes 1G \  
--memrate-rd-mbs 1000 --memrate-wr-mbs 2000 -t 60
stress-ng: info: [22880] dispatching hogs: 1 memrate
stress-ng: info: [22881] stress-ng-memrate: write64: 1998.96 MB/sec
stress-ng: info: [22881] stress-ng-memrate: read64: 998.61 MB/sec
stress-ng: info: [22881] stress-ng-memrate: write32: 1999.68 MB/sec
stress-ng: info: [22881] stress-ng-memrate: read32: 998.80 MB/sec
stress-ng: info: [22881] stress-ng-memrate: write16: 1999.39 MB/sec
stress-ng: info: [22881] stress-ng-memrate: read16: 999.66 MB/sec
stress-ng: info: [22881] stress-ng-memrate: write8: 1841.04 MB/sec
stress-ng: info: [22881] stress-ng-memrate: read8: 999.94 MB/sec
stress-ng: info: [22880] successful run completed in 60.00s (1 min, 0.00 secs)

...the memrate stressor will attempt to limit the memory rates but due to scheduling jitter and other memory activity it may not be 100% accurate.  By careful setting of the size of the memory being exercised with the --memrate-bytes option one can exercise the L1/L2/L3 caches and/or the entire memory.

By default, matrix stressor will perform matrix operations with optimal memory access to memory.  The new --matrix-yx option will instead perform matrix operations in a y, x rather than an x, y matrix order, causing more cache stalls on larger matrices.  This can be useful for exercising cache misses.

To complement the heapsort, mergesort and qsort memory/CPU exercising sort stressors I've added the BSD library radixsort stressor to exercise sorting of hundreds of thousands of small text strings.

Finally, while exercising various hugepage kernel configuration options I was inspired to make stress-ng mmap's to work better with hugepage madvise hints, so where possible all anonymous memory mappings are now private to allow hugepage madvise to work.  The stream and vm stressors also have new madvise options to allow one to chose hugepage, nohugepage or normal hints.

No big changes as per normal, just small incremental improvements to this all purpose stress tool.

Read more
Colin Ian King

New features in forkstat V0.02.00

The forkstat mascot
Forkstat is a tiny utility I wrote a while ago to monitor process activity via the process events connector. Recently I was sent a patch from Philipp Gesang to add a new -l option to switch to line buffered output to reduce the delay on output when redirecting stdout, which is a useful addition to the tool.   During some spare time I looked at the original code and noticed that I had overlooked some of lesser used process event types:
  • STAT_PTRC - ptrace attach/detach events
  • STAT_UID - UID (and GID) change events
  • STAT_SID - SID change events
..so I've now added support for these events too.
    I've also added some extra per-process information on each event. The new -x "extra info" option will now also display the UID of the process and where possible the TTY it is associated with.  This allows one to easily detect who is responsible for generating the process events.

    The following example shows fortstat being used to detect when a process is being traced using ptrace:

     sudo ./forkstat -x -e ptrce  
    Time Event PID UID TTY Info Duration Process
    11:42:31 ptrce 17376 0 pts/15 attach strace -p 17350
    11:42:31 ptrce 17350 1000 pts/13 attach top
    11:42:37 ptrce 17350 1000 pts/13 detach

    Process 17376 runs strace on process 17350 (top). We can see the ptrace attach event on the process and also then a few seconds later the detach event.  We can see that the strace was being run from pts/15 by root.   Using forkstat we can now snoop on users who are snooping on other user's processes.

    I use forkstat mainly to capture busy process fork/exec/exit activity that tools such as ps and top cannot see because of the very sort duration of some processes or threads. Sometimes processes are created rapidly that one needs to run forkstat with a high priority to capture all the events, and so the new -r option will run forkstat with a high real time scheduling priority to try and capture all the events.

    These new features landed in forkstat V0.02.00 for Ubuntu 17.10 Aardvark.

    Read more
    Colin Ian King

    The stress-ng logo
    The latest release of stress-ng contains a mechanism to measure latencies via a cyclic latency test.  Essentially this is just a loop that cycles around performing high precisions sleeps and measures the (extra overhead) latency taken to perform the sleep compared to expected time.  This loop runs with either one of the Round-Robin (rr) or First-In-First-Out real time scheduling polices.

    The cyclic test can be configured to specify the sleep time (in nanoseconds), the scheduling type (rr or fifo),  the scheduling priority (1 to 100) and also the sleep method (explained later).

    The first 10,000 latency measurements are used to compute various latency statistics:
    • mean latency (aka the 'average')
    • modal latency (the most 'popular' latency)
    • minimum latency
    • maximum latency
    • standard deviation
    • latency percentiles (25%, 50%, 75%, 90%, 95.40%, 99.0%, 99.5%, 99.9% and 99.99%
    • latency distribution (enabled with the --cyclic-dist option)
    The latency percentiles indicate the latency at which a percentage of the samples fall into.  For example, the 99% percentile for the 10,000 samples is the latency at which 9,900 samples are equal to or below.

    The latency distribution is shown when the --cyclic-dist option is used; one has to specify the distribution interval in nanoseconds and up to the first 100 values in the distribution are output.

    For an idle machine, one can invoke just the cyclic measurements with stress-ng as follows:

     sudo stress-ng --cyclic 1 --cyclic-policy fifo \
    --cyclic-prio 100 --cyclic-method --clock_ns \
    --cyclic-sleep 20000 --cyclic-dist 1000 -t 5
    stress-ng: info: [27594] dispatching hogs: 1 cyclic
    stress-ng: info: [27595] stress-ng-cyclic: sched SCHED_FIFO: 20000 ns delay, 10000 samples
    stress-ng: info: [27595] stress-ng-cyclic: mean: 5242.86 ns, mode: 4880 ns
    stress-ng: info: [27595] stress-ng-cyclic: min: 3050 ns, max: 44818 ns, std.dev. 1142.92
    stress-ng: info: [27595] stress-ng-cyclic: latency percentiles:
    stress-ng: info: [27595] stress-ng-cyclic: 25.00%: 4881 us
    stress-ng: info: [27595] stress-ng-cyclic: 50.00%: 5191 us
    stress-ng: info: [27595] stress-ng-cyclic: 75.00%: 5261 us
    stress-ng: info: [27595] stress-ng-cyclic: 90.00%: 5368 us
    stress-ng: info: [27595] stress-ng-cyclic: 95.40%: 6857 us
    stress-ng: info: [27595] stress-ng-cyclic: 99.00%: 8942 us
    stress-ng: info: [27595] stress-ng-cyclic: 99.50%: 9821 us
    stress-ng: info: [27595] stress-ng-cyclic: 99.90%: 22210 us
    stress-ng: info: [27595] stress-ng-cyclic: 99.99%: 36074 us
    stress-ng: info: [27595] stress-ng-cyclic: latency distribution (1000 us intervals):
    stress-ng: info: [27595] stress-ng-cyclic: latency (us) frequency
    stress-ng: info: [27595] stress-ng-cyclic: 0 0
    stress-ng: info: [27595] stress-ng-cyclic: 1000 0
    stress-ng: info: [27595] stress-ng-cyclic: 2000 0
    stress-ng: info: [27595] stress-ng-cyclic: 3000 82
    stress-ng: info: [27595] stress-ng-cyclic: 4000 3342
    stress-ng: info: [27595] stress-ng-cyclic: 5000 5974
    stress-ng: info: [27595] stress-ng-cyclic: 6000 197
    stress-ng: info: [27595] stress-ng-cyclic: 7000 209
    stress-ng: info: [27595] stress-ng-cyclic: 8000 100
    stress-ng: info: [27595] stress-ng-cyclic: 9000 50
    stress-ng: info: [27595] stress-ng-cyclic: 10000 10
    stress-ng: info: [27595] stress-ng-cyclic: 11000 9
    stress-ng: info: [27595] stress-ng-cyclic: 12000 2
    stress-ng: info: [27595] stress-ng-cyclic: 13000 2
    stress-ng: info: [27595] stress-ng-cyclic: 14000 1
    stress-ng: info: [27595] stress-ng-cyclic: 15000 9
    stress-ng: info: [27595] stress-ng-cyclic: 16000 1
    stress-ng: info: [27595] stress-ng-cyclic: 17000 1
    stress-ng: info: [27595] stress-ng-cyclic: 18000 0
    stress-ng: info: [27595] stress-ng-cyclic: 19000 0
    stress-ng: info: [27595] stress-ng-cyclic: 20000 0
    stress-ng: info: [27595] stress-ng-cyclic: 21000 1
    stress-ng: info: [27595] stress-ng-cyclic: 22000 1
    stress-ng: info: [27595] stress-ng-cyclic: 23000 0
    stress-ng: info: [27595] stress-ng-cyclic: 24000 1
    stress-ng: info: [27595] stress-ng-cyclic: 25000 2
    stress-ng: info: [27595] stress-ng-cyclic: 26000 0
    stress-ng: info: [27595] stress-ng-cyclic: 27000 1
    stress-ng: info: [27595] stress-ng-cyclic: 28000 1
    stress-ng: info: [27595] stress-ng-cyclic: 29000 2
    stress-ng: info: [27595] stress-ng-cyclic: 30000 0
    stress-ng: info: [27595] stress-ng-cyclic: 31000 0
    stress-ng: info: [27595] stress-ng-cyclic: 32000 0
    stress-ng: info: [27595] stress-ng-cyclic: 33000 0
    stress-ng: info: [27595] stress-ng-cyclic: 34000 0
    stress-ng: info: [27595] stress-ng-cyclic: 35000 0
    stress-ng: info: [27595] stress-ng-cyclic: 36000 1
    stress-ng: info: [27595] stress-ng-cyclic: 37000 0
    stress-ng: info: [27595] stress-ng-cyclic: 38000 0
    stress-ng: info: [27595] stress-ng-cyclic: 39000 0
    stress-ng: info: [27595] stress-ng-cyclic: 40000 0
    stress-ng: info: [27595] stress-ng-cyclic: 41000 0
    stress-ng: info: [27595] stress-ng-cyclic: 42000 0
    stress-ng: info: [27595] stress-ng-cyclic: 43000 0
    stress-ng: info: [27595] stress-ng-cyclic: 44000 1
    stress-ng: info: [27594] successful run completed in 5.00s


    Note that stress-ng needs to be invoked using sudo to enable the Real Time FIFO scheduling for the cyclic measurements.

    The above example uses the following options:

    • --cyclic 1
      • starts one instance of the cyclic measurements (1 is always recommended)
    • --cyclic-policy fifo 
      • use the real time First-In-First-Out scheduling for the cyclic measurements
    • --cyclic-prio 100 
      • use the maximum scheduling priority  
    • --cyclic-method clock_ns
      • use the clock_nanoseconds(2) system call to perform the high precision duration sleep
    • --cyclic-sleep 20000 
      • sleep for 20000 nanoseconds per cyclic iteration
    • --cyclic-dist 1000 
      • enable latency distribution statistics with an interval of 1000 nanoseconds between each data point.
    • -t 5
      • run for just 5 seconds
    From the run above, we can see that 99.5% of latencies were less than 9821 nanoseconds and most clustered around the 4880 nanosecond model point. The distribution data shows that there is some clustering around the 5000 nanosecond point and the samples tail off with a bit of a long tail.

    Now for the interesting part. Since stress-ng is packed with many different stressors we can run these while performing the cyclic measurements, for example, we can tell stress-ng to run *all* the virtual memory related stress tests and see how this affects the latency distribution using the following:

     sudo stress-ng --cyclic 1 --cyclic-policy fifo \  
    --cyclic-prio 100 --cyclic-method clock_ns \
    --cyclic-sleep 20000 --cyclic-dist 1000 \
    --class vm --all 1 -t 60s

    ..the above invokes all the vm class of stressors to run all at the same time (with just one instance of each stressor) for 60 seconds.

    The --cyclic-method specifies the delay used on each of the 10,000 cyclic iterations used.  The default (and recommended method) is clock_ns, using the high precision delay.  The available cyclic delay methods are:
    • clock_ns (use the clock_nanosecond() sleep)
    • posix_ns (use the POSIX nanosecond() sleep)
    • itimer (use a high precision clock timer and pause to wait for a signal to measure latency)
    • poll (busy spin-wait on clock_gettime() to eat cycles for a delay.
    All the delay mechanisms use the CLOCK_REALTIME system clock for timing.

    I hope this is plenty of cyclic measurement functionality to get some useful latency benchmarks against various kernel components when using some or a mix of the stress-ng stressors.  Let me know if I am missing some other cyclic measurement options and I can see if I can add them in.

    Keep stressing and measuring those systems!

    Read more
    Colin Ian King

    What is new in FWTS 17.05.00?

    Version 17.05.00 of the Firmware Test Suite was released this week as part of  the regular end-of-month release cadence. So what is new in this release?

    • Alex Hung has been busy bringing the SMBIOS tests in-sync with the SMBIOS 3.1.1 standard
    • IBM provided some OPAL (OpenPower Abstraction Layer) Firmware tests:
      • Reserved memory DT validation tests
      • Power management DT Validation tests
    • The first fwts snap was created
    •  Over 40 bugs were fixed
    As ever, we are grateful for all the community contributions to FWTS.  The full release details are available from the fwts-devel mailing list.

    I expect that the next upcoming ACPICA release will be integrated into the 17.06.00 FWTS release next month.

    Read more
    Colin Ian King

    The Firmware Test Suite (FWTS) has an easy to use text based front-end that is primarily used by the FWTS Live-CD image but it can also be used in the Ubuntu terminal.

    To install and run the front-end use:

     sudo apt-get install fwts-frontend  
    sudo fwts-frontend-text

    ..and one should see a menu of options:


    In this demonstration, the "All Batch Tests" option has been selected:


    Tests will be run one by one and a progress bar shows the progress of each test. Some tests run very quickly, others can take several minutes depending on the hardware configuration (such as number of processors).

    Once the tests are all complete, the following dialogue box is displayed:


    The test has saved several files into the directory /fwts/15052017/1748/ and selecting Yes one can view the results log in a scroll-box:


    Exiting this, the FWTS frontend dialog is displayed:


    Press enter to exit (note that the Poweroff option is just for the fwts Live-CD image version of fwts-frontend).

    The tool dumps various logs, for example, the above run generated:

     ls -alt /fwts/15052017/1748/  
    total 1388
    drwxr-xr-x 5 root root 4096 May 15 18:09 ..
    drwxr-xr-x 2 root root 4096 May 15 17:49 .
    -rw-r--r-- 1 root root 358666 May 15 17:49 acpidump.log
    -rw-r--r-- 1 root root 3808 May 15 17:49 cpuinfo.log
    -rw-r--r-- 1 root root 22238 May 15 17:49 lspci.log
    -rw-r--r-- 1 root root 19136 May 15 17:49 dmidecode.log
    -rw-r--r-- 1 root root 79323 May 15 17:49 dmesg.log
    -rw-r--r-- 1 root root 311 May 15 17:49 README.txt
    -rw-r--r-- 1 root root 631370 May 15 17:49 results.html
    -rw-r--r-- 1 root root 281371 May 15 17:49 results.log

    acpidump.log is a dump of the ACPI tables in format compatible with the ACPICA acpidump tool.  The results.log file is a copy of the results generated by FWTS and results.html is a HTML formatted version of the log.

    Read more
    Colin Ian King

    Simple job scripting in stress-ng 0.08.00

    The latest release of stress-ng 0.08.00 now contains a new job scripting feature. Jobs allow one to bundle up a set of stress options  into a script rather than cram them all onto the command line.  One can now also run multiple invocations of a stressor with the latest version of stress-ng and conbined with job scripts we now have a powerful way of running more complex stress tests.

    The job script commands are essentially the stress-ng long options without the need for the '--' option characters.  One option per line is allowed.

    For example:

     $ stress-ng --cpu 1 --matrix 1 --verbose --tz --timeout 60s --cpu 1 --matrix -1 --icache 1 

    would become:

     $cat example.job  
    verbose
    tz
    timeout 60
    cpu 1
    matrix 1
    icache 1

    One can also add comments using the # character prefix.   By default the stressors will be run in parallel, but one can use the "run sequential" command in the job script to run the stressors sequentially.

    The following script runs the mmap stressor multiple times using more memory on each run:

     $ cat mmap.job  
    run sequential # one job at a time
    timeout 2m # run for 2 minutes
    verbose # verbose output
    #
    # run 4 invocations and increase memory each time
    #
    mmap 1
    mmap-bytes 25%
    mmap 1
    mmap-bytes 50%
    mmap 1
    mmap-bytes 75%
    mmap 1
    mmap-bytes 100%

    Some of the stress-ng stressors have various "methods" that allow one to modify the way the stressor behaves.  The following example shows how job scripts can be uses to exercise a system using different stressor methods:

     $ cat /usr/share/stress-ng/example-jobs/matrix-methods.job   
    #
    # hot-cpu class stressors:
    # various options have been commented out, one can remove the
    # proceeding comment to enable these options if required.
    #
    # run the following tests in parallel or sequentially
    #
    run sequential
    # run parallel
    #
    # verbose
    # show all debug, warnings and normal information output.
    #
    verbose
    #
    # run each of the tests for 60 seconds
    # stop stress test after N seconds. One can also specify the units
    # of time in seconds, minutes, hours, days or years with the suf‐
    # fix s, m, h, d or y.
    #
    timeout 1m
    # tz
    # collect temperatures from the available thermal zones on the
    # machine (Linux only). Some devices may have one or more thermal
    # zones, where as others may have none.
    tz
    #
    # matrix stressor with examples of all the methods allowed
    #
    # start N workers that perform various matrix operations on float‐
    # ing point values. By default, this will exercise all the matrix
    # stress methods one by one. One can specify a specific matrix
    # stress method with the --matrix-method option.
    #
    #
    # Method Description
    # all iterate over all the below matrix stress methods
    # add add two N × N matrices
    # copy copy one N × N matrix to another
    # div divide an N × N matrix by a scalar
    # hadamard Hadamard product of two N × N matrices
    # frobenius Frobenius product of two N × N matrices
    # mean arithmetic mean of two N × N matrices
    # mult multiply an N × N matrix by a scalar
    # prod product of two N × N matrices
    # sub subtract one N × N matrix from another N × N matrix
    # trans transpose an N × N matrix
    #
    matrix 0
    matrix-method all
    matrix 0
    matrix-method add
    matrix 0
    matrix-method copy
    matrix 0
    matrix-method div
    matrix 0
    matrix-method frobenius
    matrix 0
    matrix-method hadamard
    matrix 0
    matrix-method mean
    matrix 0
    matrix-method mult
    matrix 0
    matrix-method prod
    matrix 0
    matrix-method sub
    matrix 0
    matrix-method trans

    Various example job scripts can be found in /usr/share/stress-ng/example-job, one can use these as a base for writing more complex stressors.  The example jobs have all the options commented (using the text from the stress-ng manual) to make it easier to see how each stressor can be run.

    Version 0.08.00 landed in Ubuntu 17.10 Artful Aardvark and is available as a snap and I've got backports in ppa:colin-king/white for older releases of Ubuntu.

    Read more
    Colin Ian King

    Tracking CoverityScan issues on Linux-next

    Over the past 6 months I've been running static analysis on linux-next with CoverityScan on a regular basis (to find new issues and fix some of them) as well as keeping a record of the defect count.


    Since the beginning of September over 2000 defects have been eliminated by a host of upstream developers and the steady downward trend of outstanding issues is good to see.  A proportion of the outstanding defects are false positives or issues where the code is being overly zealous, for example, bounds checking where some conditions can never happen. Considering there are millions of lines of code, the defect rate is about average for such a large project.

    I plan to keep the static analysis running long term and I'll try and post stats every 6 months or so to see how things are progressing.

    Read more
    Colin Ian King

    The BPF Compiler Collection (BCC) is a toolkit for building kernel tracing tools that leverage the functionality provided by the Linux extended Berkeley Packet Filters (BPF).

    BCC allows one to write BPF programs with front-ends in Python or Lua with kernel instrumentation written in C.  The instrumentation code is built into sandboxed eBPF byte code and is executed in the kernel.

    The BCC github project README file provides an excellent overview and description of BCC and the various available BCC tools.  Building BCC from scratch can be a bit time consuming, however,  the good news is that the BCC tools are now available as a snap and so BCC can be quickly and easily installed just using:

     sudo snap install --devmode bcc  

    There are currently over 50 BCC tools in the snap, so let's have a quick look at a few:

    cachetop allows one to view the top page cache hit/miss statistics. To run this use:

     sudo bcc.cachetop  



    The funccount tool allows one to count the number of times specific functions get called.  For example, to see how many kernel functions with the name starting with "do_" get called per second one can use:

     sudo bcc.funccount "do_*" -i 1  


    To see how to use all the options in this tool, use the -h option:

     sudo bcc.funccount -h  

    I've found the funccount tool to be especially useful to check on kernel activity by checking on hits on specific function names.

    The slabratetop tool is useful to see the active kernel SLAB/SLUB memory allocation rates:

     sudo bcc.slabratetop  


    If you want to see which process is opening specific files, one can snoop on open system calls use the opensnoop tool:

     sudo bcc.opensnoop -T


    Hopefully this will give you a taste of the useful tools that are available in BCC (I have barely scratched the surface in this article).  I recommend installing the snap and giving it a try.

    As it stands,BCC provides a useful mechanism to develop BPF tracing tools and I look forward to regularly updating the BCC snap as more tools are added to BCC. Kudos to Brendan Gregg for BCC!

    Read more
    Colin Ian King

    Kernel printk statements

    The kernel contains tens of thousands of statements that may print various errors, warnings and debug/information messages to the kernel log.  Unsurprisingly, as the kernel grows in size, so does the quantity of these messages.  I've been scraping the kernel source for various kernel printk style statements and macros and scanning these for various typos and spelling mistakes and to make this easier I hacked up kernelscan (a quick and dirty parser) that helps me find literal strings from the kernel for spell checking.

    Using kernelscan, I've gathered some statistics for the number of kernel print statements for various kernel releases:


    As one can see, we have over 200,000 messages in the 4.9 kernel(!).  Given the kernel growth, we can see this seems to roughly correlate with the kernel source size:



    So how many lines of code in the kernel do we have per kernel printk messages over time?


    ..showing that the trend is to have more lines of code per frequent printk statements over time.  I didn't differentiate between different types of printk message, so it is hard to see any deeper trends on what kinds of messages are being logged more or less frequently over each release, for example,  perhaps there are less debug messages landing in the kernel nowadays.

    I find it quite amazing that the kernel contains quite so many printk messages; it would be useful to see just how many of these are actually in a production kernel. I suspect quite large number are for driver debugging and may be conditionally omitted at build time.

    Read more
    Colin Ian King

    Another year passes and once more I have another seasonal obfuscated C program.  I was caught short on free time this year to heavily obfuscate the code which is a shame. However, this year I worked a bit harder at animating the output, so hopefully that will make up for lack of obfuscation.

    The source is available on github to eyeball.  I've had criticism on previous years that it is hard to figure out the structure of my obfuscated code, so this year I made sure that the if statements were easier to see and hence understand the flow of the code.

    This year I've snapped up all my seasonal obfuscated C programs and put them into the snap store as the christmas-obfuscated-c snap.

    Below is a video of the program running; it is all ASCII art and one can re-size the window while it is running.


    Unlike previous years, I have the pre-obfuscated version of the code available in the git repository at commit c98376187908b2cf8c4d007445b023db67c68691 so hopefully you can see the original hacky C source.

    Have a great Christmas and a most excellent New Year. 

    Read more
    Colin Ian King

    stress-ng 0.07.07 released

    stress-ng is a tool that I have been developing on-and-off for a few years. It is designed to stress kernels to force out bugs, stress CPU and memory and also contains some performance benchmarking metrics too.

    stress-ng is now entering the maturity part of the development phase, however, there is always scope to add new stressors and generally improve the tool.   I've just released version 0.07.07 for the Ubuntu Zesty 17.04 release and it contains a few additional features:

    • SIGUSR2 sent to stress-ng will dump out the current system load and memory statistics
    • Sched policy stress tests for different scheduler configurations
    • Add a missing --sockfd-port option
    And various bug fixes:
    • Fixed up some minor memory leaks
    • Missing counter stats on bind-mount, fp-error, personality and resources stressors
    • Fix the --fiemap-bytes option
    • Fix up build warnings with various compilers and static analyzers
    The major change to stress-ng over the past month was an internal re-working of system call and GNU features to abstract these into a shim layer to reduce the number build conditional #ifdef paths around code. This simplifies portability, so the code now builds more easily across a range of systems and with various versions of gcc and clang and fixes some issues on older kernels too.   This makes the code also faster to statically analyze with cppcheck.

    For more details, visit the stress-ng project page or the quick help guide.

    Read more