One of the biggest mysteries in astrophysics is the dark matter. Dark matter can not be seen, it doesn't shine nor reflects light. But we infer its existence because dark matter weights, and modifies the path of stars and galaxies. Cablegate has its own dark matter.

According to WikiLeaks, 251,287 communications compose the Cablegate. But what is the real volume of cables between the Embassies and Secretary of State? Can we guess it? The answer is yes, there is a simple way to know it. Using the methodology explained below, the total number of communications between Embassies and the Secretary of State is guessed.

This are the results.

The dark matter of the Embassies.

20101224cablegate-darkmatter.001Between 2005-2009, more than 400,000 non leaked cables are identified. In this case, the uncertainty is larger than with just one embassy due to the small number or released cables. The sum increased by 50% in just one week.

Curiously, the average size of the 1800 published cables is 12 KB. If this average is representative of the whole set, something I doubt, the total size of the 250,000 messages would be 350 MB.

Secretary of State.

In addition to embassies' communications, Cablegate has some cables from the Secretary of State. This messages are often quite interesting, because they request information or send commands to the embassies (eg 09STATE106750).

20101224cablegate-darkmatter.002In 2005 and 2006 there is no released cable, and therefore the sum cannot be estimated. But between 2007 and 2009, the volume of cables sent by the Secretary of State is remarkable (so big, that I doubted that the record number was an ordinal number and not a more sophisticated identifier). Compare this graph with the one of the embassies. 2007 show more cables from the Secretary than all Embassies combined, but beware, because this trend can be reversed with better data.

This results are available in Google Docs.

Madrid Embassy.

This is the chart for Madrid Embassy, which ranks seventh in the number of leaked cables.

20101224cablegate-darkmatter.003Between 2004-2009, the existence of at least 17,000 dispatches sent from Madrid can be deduced. In the same period, there are just 3500 leaked cables. The graph shows the breakdown by year. 2007 is leaked in a high percentage, the oppositat in 2004 and 2005. Also, the number of communications decreases progressively (Why? Maybe other networks are used instead of SIPRNet). The complete table is available in Google Docs.

Cablegate Dark Matter Howto

The Guardian published a text file with dates, source and tags of the 250,000 diplomatic cables included in the Cablegate. The content of this messages are being slowly released. (Using this short descriptions, I did an analysis of the messages related to Spain -tagged as SP-, and suggested the existence of communications related to the 2004 Madrid bombings and the Spaniard Internet Law. Later, El País published this cables, confirming the suspicions).

To infer the volume of communications the methodology is quite simple. Each cable has an identifier. For example, 04MADRID893 summaries the Madrid bombing on March 11th, 2004. This identifier can be broken into three parts:

  • 04: Current year (2004).
  • MADRID: Origin (the Embassy in Madrid)
  • 893: Record number?

What's that record number? Let's investigate. There are some cables sent on December 2004 from Madrid Embassy, as 04MADRID4887 (dated December 29, 2004). Its record number is "4887". Another message sent on February has ID 04MADRID527, record number "527". Looking to others cables dated on January, seems obvious that the record number starts at 1 and goes up, one by one, through the year. The record number is a simple ordinal value. Thanks to this simple rule, and reading the last cables of Madrid Embassy on December 2004, we know it sent ~4900 cables that year alone.

Ideally, the last cable of the year from each Embassy would be available, but the Cablegate data is not complete. Just fraction of the leaked messages has been published so far and those last cables of the year may not be leaked in Cablegate anyway. But, as can be seen in the graphics, this method allows to do an approximation.

The code used for the calculations is available at github (cablegate-sp) and has a BSD license.

Out of sight, out of mind.

One month after the first cable release, only two thousand messages has been published. At this rate it will take a decade to release all Cablegate content. Maybe not all messages are as relevant as those released so far, eg boring messages about visas. But if WikiLeaks has raised such a stir with just 2000 cables, I cannot imagine which other secrets remain in those thousands unfiltered (although top-secret cables use other networks).

Anyway, I'm sure there is still a lot of data mining job to do with the cables.

PS (December 30th, 2010): Ricardo Estalmán linked to this entry on Wikipedia about the German tank problem during World War II:

«Suppose one is an Allied intelligence analyst during World War II, and one has some serial numbers of captured German tanks. Further, assume that the tanks are numbered sequentially from 1 to N. How does one estimate the total number of tanks?»

The Cablegate case is quite similar. I will update the estimation with the formula cited in the above article, as soon as possible (Xmas days!).

