Canonical Voices

Posts tagged with 'analysis'

Colin Ian King

Pragmatic Graphing

Over the past few days I have been analysing various issues and also doing some background research, so I have been collecting some rather large sets of data to process.   Normally I filter, re-format and process the data using a bunch of simple tools such as awk, tr, cut, sort, uniq and grep to get the data into some form where it can be plotted using gnuplot. 

The UNIX philosophy of piping together a bunch of tools to produce the final output normally works fine, however, graphing the data with gnuplot always ends up with me digging around in the online gnuplot documentation or reading old gnuplot files to remind myself exactly how to plot the data just the way I want.   This is fine for occasions where I gather lots of identical logs and want to compare results from multiple tests, the investment in time to automate this with gnuplot is well worth the hassle.   However, some times I just have a handful of samples and want to plot a graph and then quickly re-jig the data and perhaps calculate some statistical information such a trend lines.  In this case, I fall back to shoving the samples into LibreOffice Calc and slamming out some quick graphs.

This makes me choke a bit.  Using LibreOffice Calc starts to make me feel like I'm an accountant rather than a software engineer.  However, once I have swallowed my pride I have come to the conclusion that one has to be pragmatic and use the right tool for the job.  To turn around small amounts of data quickly, LibreOffice Calc does seem to be quite useful.   For processing huge datasets and automated graph plotting, gnuplot does the trick (as long as I can remember how to use it).   I am a command line junkie and really don't like using GUI based power tools, but there does seem to be a place where I can mix the two quite happily.

Read more
Colin Ian King

Benford's Law with real world data.

If one has a large enough real life source of data (such as the size of files in the file system) and look at the distribution of the first digit of these values then one will find something that at first glance is rather surprising.  The leading digit 1 appears about 30% of the time and as the digits increase to 9 their frequency drops until we reach 9, which only appears about 5% of the time.   This seemingly curious frequency distribution is commonly known as Benford's law or the first digit law.

The probability P of digit d can be expresses as follows:

P(d) = log10(1 + 1 / d)

..where d is any integer value 1 to 9 inclusive. So for each leading digit in the data, the distribution works out to be about:

   Digit   
  Probability
1
0.301
2
0.176
3
0.125
4
0.097
5
0.079
6
0.067
7
0.058
8
0.051
9
0.046

But how does this hold up with some "real world" data? Can it really be true?  Well, for my first experiment, I analysed the leading digit of all the source files in the current Linux source tree and compared that to Benford's Law:


So, this is convincing enough.  How about something more exotic?  For my second experiment I counted up the number of comments in each file that start with /* in just the C source files in the Linux source tree and again looked at the distribution of the leading digits.  I was hazarding a guess that there are a reasonable amount of comments in each file (knowing the way some code is commented this may be pushing my luck).  Anyhow, the data generated also produces a distribution that obeys Benford's Law too:


Well, that certainly shows that Kernel developers are sprinkling enough comments in the Kernel source to be statistically meaningful.  If the comments themselves are meaningful is another matter...

How about one more test?  This time I gathered the length of every executable in /usr/bin and plotted the distribution of the leading digits from this data:


..this data set has far less files to analyse, so the distribution deviates a little, but the trend is still rather good.

As mentioned earlier, one has to have a large set of data for this too work well.  Interesting this may be, but what kind of practical use is it?   It can be applied to accountancy - if one has a large enough set of data in the accounts and the leading digits of the data do not fit Benford's Law then maybe one should suspect that somebody has been fiddling the books.  Humans are rather poor at making up lots of "random" values that don't skew Benford's Law.

One more interesting fact is that it applies even if one rescales the data.  For example, if you are looking at accounts in terms of £ sterling and covert it into US dollars or Albanian Lek the rescaled data still obeys Benford's Law.  Thus if re-ran my tests and didn't analyse the size of files in bytes but instead used size in 512 byte blocks it still would produce a leading digit distribution that obeyed Benford's Law.  Nice.

How can we apply this in computing? Perhaps we could use it to detect tampering with the sizes of a large set of files.  Who knows?  I am sure somebody can think of a useful way to use it.   I just find it all rather fascinating.

Read more