Canonical Voices

Posts tagged with 'stats'

Colin Ian King

Benford's Law with real world data.

If one has a large enough real life source of data (such as the size of files in the file system) and look at the distribution of the first digit of these values then one will find something that at first glance is rather surprising.  The leading digit 1 appears about 30% of the time and as the digits increase to 9 their frequency drops until we reach 9, which only appears about 5% of the time.   This seemingly curious frequency distribution is commonly known as Benford's law or the first digit law.

The probability P of digit d can be expresses as follows:

P(d) = log10(1 + 1 / d)

..where d is any integer value 1 to 9 inclusive. So for each leading digit in the data, the distribution works out to be about:

   Digit   
  Probability
1
0.301
2
0.176
3
0.125
4
0.097
5
0.079
6
0.067
7
0.058
8
0.051
9
0.046

But how does this hold up with some "real world" data? Can it really be true?  Well, for my first experiment, I analysed the leading digit of all the source files in the current Linux source tree and compared that to Benford's Law:


So, this is convincing enough.  How about something more exotic?  For my second experiment I counted up the number of comments in each file that start with /* in just the C source files in the Linux source tree and again looked at the distribution of the leading digits.  I was hazarding a guess that there are a reasonable amount of comments in each file (knowing the way some code is commented this may be pushing my luck).  Anyhow, the data generated also produces a distribution that obeys Benford's Law too:


Well, that certainly shows that Kernel developers are sprinkling enough comments in the Kernel source to be statistically meaningful.  If the comments themselves are meaningful is another matter...

How about one more test?  This time I gathered the length of every executable in /usr/bin and plotted the distribution of the leading digits from this data:


..this data set has far less files to analyse, so the distribution deviates a little, but the trend is still rather good.

As mentioned earlier, one has to have a large set of data for this too work well.  Interesting this may be, but what kind of practical use is it?   It can be applied to accountancy - if one has a large enough set of data in the accounts and the leading digits of the data do not fit Benford's Law then maybe one should suspect that somebody has been fiddling the books.  Humans are rather poor at making up lots of "random" values that don't skew Benford's Law.

One more interesting fact is that it applies even if one rescales the data.  For example, if you are looking at accounts in terms of £ sterling and covert it into US dollars or Albanian Lek the rescaled data still obeys Benford's Law.  Thus if re-ran my tests and didn't analyse the size of files in bytes but instead used size in 512 byte blocks it still would produce a leading digit distribution that obeyed Benford's Law.  Nice.

How can we apply this in computing? Perhaps we could use it to detect tampering with the sizes of a large set of files.  Who knows?  I am sure somebody can think of a useful way to use it.   I just find it all rather fascinating.

Read more
rvr

Fernando Tricas always has interesting things to say. In a recent post he talks about The life of links and digital content (Spanish):

«We tend to assume that digital [content] is forever. But anyone who accumulates enough information also knows that sometimes its difficult to find it, in other cases it breaks and, of course, there is a non-zero probability that things go wrong when hosted by third-party services. It is an old topic here, remember Will we have all this information in the future? . The topic resurfaces as news in the light of Currently charged by the article that can be read at A Year After the Egyptian Revolution, 10% of Its Social Media Documentation Is Already Gone».

In the comments, Anónima said: «Given a time t and an interval ?t, the larger ?t, the more likely is that all information in a time t-?t you want to find is gone». This sounded like an statement to check, Thus, I decided to do an experiment with del.icio.us' bookmarks.

In delicious.com/rvr I have archived around 4000 links from 2004. So, I downloaded the backup file, an HTML file with all links and metadata (date, title, tags). I developed a python script to process this file: go through the links and save its current status (whether the link is alive or not). With another script, the status were processed to generate the statistics. These are the results:

Captura de pantalla 2012-04-12 a la(s) 01.02.39

As can be seen, there is a correlation between the age of the links and the probability of being dead. For the 10% who cited the Egyptian revolution, in the case of my delicious, we must go back three years ago (2009). But at 6 years from now, a quarter of the links are now defunct. Of course, the sample is very small shouldn't be representative. It would be interesting to compare it with other accounts and to extend the time span: How many links are still alive after 10 or 15 years? Is it the same with information stored in other media? Are all this death links resting in peace in a forgotten Google's cache disk?

I imagine that sometime in the future, librarians will begin to worry not only to digitize remote past documents, but also to preserve those of the present.

In case you are interested, the code to generate such data is available at github.com/vrruiz/delicious-death-links. The spreadsheet is also available in Google Docs .

Read more
David

Quoting the Ubuntu philosophy, one of our  core values is to provide the ability for every computer user to use Ubuntu in their language of choice. This in turn is made possible by an army of volunteer translators, who throughout the development cycle and beyond, tirelessly put their translation skills to work in an outstanding feat to make a full operating system accessible to millions.

As we’re ramping up to the Ubuntu 11.10 release in a few day’s time, there’s another important milestone for ensuring Ubuntu is available in as many languages as possible: the translations deadline on the 6th of October.

Up until now, and considering the 80% coverage cut-off, Ubuntu 11.10, the Oneiric Ocelot, is translated in 38 languages, lead by the Slovenian team’s heroic effort of becoming the #1 team in the ranking.

Making Oneiric the best translated Ubuntu release ever

Last cycle Ubuntu was fully translated in 43 languages. I think this cycle we should be able to aim for more, and I’m confident that with everyone’s help we could reach the 50 fully translated languages mark.

There are a few languages that are very close to reaching the 80% translation level:

Basque, Latvian, Hebrew, Uyghur, Albanian, Estonian, Bengali, Punjabi

And others which might need an extra push to climb up the 60% to 70% mark to reach 80%:

Serbian Latin, Hindi, Indonesian, Tamil, Thai, Telugu, Slovak, Arabic, Belarusian, Gujarati

So if you speak any of these or other languages, here’s what you can do to help yours reach the 80% level and make it to the list of supported languages:

  1. Go to the Ubuntu 11.10 translation statistics page
  2. Click on your language to find out which packages need attention
  3. Find those packages in the list of Ubuntu translations
  4. Translate them!
    • You’ll want to contact the translation team for your language or check out their documentation to ensure you’re using a consistent terminology
    • They’ll also help you get started with translations and answer your questions

Note: the translations statistics are updated daily at 12:00 UTC.

More on translations

And now for something different

If there is any web guru out there who’d like to lend a hand, help with the CSS and the JS code for the stats page would be greatly appreciated.

One cool thing I’d like to do for instance is for translators to, once they’ve clicked on their language, be able to click on a package that needs attention and be taken to the corresponding Launchpad Translations page. This only needs the corresponding rows in the table to be linkified, which is something I’ve been struggling with and I’m sure would be a five-minute job for an experienced web developer.

So if you want to help translators with your web skills, drop a comment here or feel free to submit a bzr branch. Thanks!

Looking forward to the best translated Ubuntu release ever! :-)


Read more