If one has a large enough real life source of data (such as the size of files in the file system) and look at the distribution of the first digit of these values then one will find something that at first glance is rather surprising.  The leading digit 1 appears about 30% of the time and as the digits increase to 9 their frequency drops until we reach 9, which only appears about 5% of the time.   This seemingly curious frequency distribution is commonly known as Benford's law or the first digit law.

The probability P of digit d can be expresses as follows:

P(d) = log10(1 + 1 / d)

..where d is any integer value 1 to 9 inclusive. So for each leading digit in the data, the distribution works out to be about:

   Digit   
  Probability
1
0.301
2
0.176
3
0.125
4
0.097
5
0.079
6
0.067
7
0.058
8
0.051
9
0.046

But how does this hold up with some "real world" data? Can it really be true?  Well, for my first experiment, I analysed the leading digit of all the source files in the current Linux source tree and compared that to Benford's Law:


So, this is convincing enough.  How about something more exotic?  For my second experiment I counted up the number of comments in each file that start with /* in just the C source files in the Linux source tree and again looked at the distribution of the leading digits.  I was hazarding a guess that there are a reasonable amount of comments in each file (knowing the way some code is commented this may be pushing my luck).  Anyhow, the data generated also produces a distribution that obeys Benford's Law too:


Well, that certainly shows that Kernel developers are sprinkling enough comments in the Kernel source to be statistically meaningful.  If the comments themselves are meaningful is another matter...

How about one more test?  This time I gathered the length of every executable in /usr/bin and plotted the distribution of the leading digits from this data:


..this data set has far less files to analyse, so the distribution deviates a little, but the trend is still rather good.

As mentioned earlier, one has to have a large set of data for this too work well.  Interesting this may be, but what kind of practical use is it?   It can be applied to accountancy - if one has a large enough set of data in the accounts and the leading digits of the data do not fit Benford's Law then maybe one should suspect that somebody has been fiddling the books.  Humans are rather poor at making up lots of "random" values that don't skew Benford's Law.

One more interesting fact is that it applies even if one rescales the data.  For example, if you are looking at accounts in terms of £ sterling and covert it into US dollars or Albanian Lek the rescaled data still obeys Benford's Law.  Thus if re-ran my tests and didn't analyse the size of files in bytes but instead used size in 512 byte blocks it still would produce a leading digit distribution that obeyed Benford's Law.  Nice.

How can we apply this in computing? Perhaps we could use it to detect tampering with the sizes of a large set of files.  Who knows?  I am sure somebody can think of a useful way to use it.   I just find it all rather fascinating.

Read more