Canonical Voices

Posts tagged with 'analysis'

Colin Ian King

Pragmatic Graphing

Over the past few days I have been analysing various issues and also doing some background research, so I have been collecting some rather large sets of data to process.   Normally I filter, re-format and process the data using a bunch of simple tools such as awk, tr, cut, sort, uniq and grep to get the data into some form where it can be plotted using gnuplot. 

The UNIX philosophy of piping together a bunch of tools to produce the final output normally works fine, however, graphing the data with gnuplot always ends up with me digging around in the online gnuplot documentation or reading old gnuplot files to remind myself exactly how to plot the data just the way I want.   This is fine for occasions where I gather lots of identical logs and want to compare results from multiple tests, the investment in time to automate this with gnuplot is well worth the hassle.   However, some times I just have a handful of samples and want to plot a graph and then quickly re-jig the data and perhaps calculate some statistical information such a trend lines.  In this case, I fall back to shoving the samples into LibreOffice Calc and slamming out some quick graphs.

This makes me choke a bit.  Using LibreOffice Calc starts to make me feel like I'm an accountant rather than a software engineer.  However, once I have swallowed my pride I have come to the conclusion that one has to be pragmatic and use the right tool for the job.  To turn around small amounts of data quickly, LibreOffice Calc does seem to be quite useful.   For processing huge datasets and automated graph plotting, gnuplot does the trick (as long as I can remember how to use it).   I am a command line junkie and really don't like using GUI based power tools, but there does seem to be a place where I can mix the two quite happily.

Read more
Colin Ian King

Benford's Law with real world data.

If one has a large enough real life source of data (such as the size of files in the file system) and look at the distribution of the first digit of these values then one will find something that at first glance is rather surprising.  The leading digit 1 appears about 30% of the time and as the digits increase to 9 their frequency drops until we reach 9, which only appears about 5% of the time.   This seemingly curious frequency distribution is commonly known as Benford's law or the first digit law.

The probability P of digit d can be expresses as follows:

P(d) = log10(1 + 1 / d)

..where d is any integer value 1 to 9 inclusive. So for each leading digit in the data, the distribution works out to be about:


But how does this hold up with some "real world" data? Can it really be true?  Well, for my first experiment, I analysed the leading digit of all the source files in the current Linux source tree and compared that to Benford's Law:

So, this is convincing enough.  How about something more exotic?  For my second experiment I counted up the number of comments in each file that start with /* in just the C source files in the Linux source tree and again looked at the distribution of the leading digits.  I was hazarding a guess that there are a reasonable amount of comments in each file (knowing the way some code is commented this may be pushing my luck).  Anyhow, the data generated also produces a distribution that obeys Benford's Law too:

Well, that certainly shows that Kernel developers are sprinkling enough comments in the Kernel source to be statistically meaningful.  If the comments themselves are meaningful is another matter...

How about one more test?  This time I gathered the length of every executable in /usr/bin and plotted the distribution of the leading digits from this data:

..this data set has far less files to analyse, so the distribution deviates a little, but the trend is still rather good.

As mentioned earlier, one has to have a large set of data for this too work well.  Interesting this may be, but what kind of practical use is it?   It can be applied to accountancy - if one has a large enough set of data in the accounts and the leading digits of the data do not fit Benford's Law then maybe one should suspect that somebody has been fiddling the books.  Humans are rather poor at making up lots of "random" values that don't skew Benford's Law.

One more interesting fact is that it applies even if one rescales the data.  For example, if you are looking at accounts in terms of £ sterling and covert it into US dollars or Albanian Lek the rescaled data still obeys Benford's Law.  Thus if re-ran my tests and didn't analyse the size of files in bytes but instead used size in 512 byte blocks it still would produce a leading digit distribution that obeyed Benford's Law.  Nice.

How can we apply this in computing? Perhaps we could use it to detect tampering with the sizes of a large set of files.  Who knows?  I am sure somebody can think of a useful way to use it.   I just find it all rather fascinating.

Read more
Jussi Pakkanen

The conventional wisdom in build systems is that GNU Autotools is the one true established standard and other ones are used only rarely.

But is this really true?

I created a script that downloads all original source packages from Ubuntu’s main pool. If there were multple versions of the same project, only the newest was chosen. Then I created a second script that goes through those packages and checks what build system they actually use. Here’s the breakdown:

CMake:           348     9%
Autofoo:        1618    45%
SCons:            10     0%
Ant:             149     4%
Maven:            41     1%
Distutil:        313     8%
Waf:               8     0%
Perl:            341     9%
Make(ish):       351     9%
Customconf:       45     1%
Unknown:         361    10%

Here Make(ish) means packages that don’t have any other build system, but do have a makefile. This usually indicates building via custom makefiles. Correspondingly customconf is for projects that don’t have any other build system, but have a configure file. This is usually a handwritten shell or Python file.

This data is skewed by the fact that the pool is just a jumble of packages. It would be interesting to run this analysis separately for precise, oneiric etc to see the progression over time. For truly interesting results you would run it against the whole of Debian.

The relative popularity of CMake and Autotools is roughly 20/80. This shows that Autotools is not the sole dominant player for C/C++ it once was. It’s still far and away the most popular, though.

The unknown set contains stuff such as Rake builds. I simply did not have time to add them all. It also has a lot of fonts, which makes sense, since you don’t really build those in the traditional sense.

The scripts can be downloaded here. A word of warning: to run the analysis you need to download 13 GB of source. Don’t do it just for the heck of it. The parser script does not download anything, it just produces a list of urls. Download the packages with wget -i.

Some orig packages are compressed with xz, which Python’s tarfile module can’t handle. You have to repack them yourself prior to running the analysis script.

Read more