Wrangling dirty data with messytables.

22 October 2012

One of the largest data collection projects we have done so far has been the consolidation of the UK’s departmental expenditure. Over 370 different government entities have published a total of more than 7000 spreadsheets. Many of those have obviously been hand-crafted or at least manually processed. Our goal was to consolidate the contained information into a single spreadsheet, discarding all the eccentricities included by the individual publishers.

messytables is a simple Python library that tries to extract tabular contents from spreadsheet documents created by human editors. Often, even files released as CSV or Excel are still not easy to parse programmatically. Some people like to start off spreadsheets with a title column or some metadata, while others use inapproriate formats to represent numbers or dates.

The tool offers a set of functions that help to make parsing data easier:

A headers detector tries to determine which row in a spreadsheet contains the actual header definitions (as opposed to any trailing content).
type detection attempts to guess the data type for each column, including a wide range of commonly used date formats.
support for streaming data, so that extremely large tables can be processed without loading the entire data into memory.
and, of course, it supports a range of spreadsheet types - from trusty CSV to Excel and even OpenOffice formats.

We’ve since also started using messytables to load data into the data API of CKAN, where it serves as the ETL for the datastore and related ReclineJS previews.

If you’re interested, check out the messytables documentation and the uk25k scripts which use it to gather UK government finance.

Of course, messytables is not a cure-all and only useful for reading data.

tablib, for example, has a fantastic API that makes writing, analyzing and converting data a breeze.
csvkit has a set of command line utilities that should be pre-installed on any computer.

But when it comes to tables that are a complete mess: give it a try!

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
Keeping stock: investigative data warehouses
Data warehouses are used in industry to manage the many datasets accrued inside a company that might be relevant to reporting and analysis. I want to propose a similar pattern for investigative journalism.
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

Other blog posts