Keeping stock: investigative data warehouses

03 August 2015

In this post I want to describe a design for investigative databases. Unlike the tooling that I’ve been working on for influence mapping projects, this approach is intended to be simple, reliable and extensible.

The basic idea is to make sure that all data sources are loaded as tables in a shared, relational database. This includes both large data sources (e.g. company registries) and small snippets of data with only a few lines. And that’s basically it, too: no user interface, no complex data modeling, no cloud hosting.

While such a database isn’t of direct use to journalists, it can be a working bench for data scrapers and developers who want to explore the data. Even better, analysts can quickly generate reports for journalists using ad-hoc queries. This way we can begin to analyze the data before investing time into fancy visualizations and interfaces.

Even better: the database you’re building will stick around as you enter new investigations, and you will be able to quickly try out a fuzzy join and see if any of the companies in this week’s dataset also turn up in a past list (or a large database of permanent value). In this way, the data store can become a first step towards a journalistic memory, a long-time archive of relevant knowledge.

I first used this pattern for data sites, such as OpenInterests.eu. While the resulting sites contained a lot of data, when asked an analytical question, I would often prefer to query the staging database, rather than using the web site. With the explorative Mozambique extractives project, I finally began to realize that such a raw data warehouse could actually be the primary output of a project, rather than just a useful step on the path there.

A variety of data sources relevant to exploring extractive industries in Mozambique.

While it is far removed from the hipster universe of civic data, I believe that the world of enterprise data warehousing has a lot to teach us. I first learned about business intelligence while working on OpenSpending: a financial dataset should be subject to strict data governance and exist in a normalized form in a database, prepared for analysis.

But while data warehousing principles dictate that the database be perfectly clean, they also talk about operational data stores, interim databases in which data is loaded, cleaned and enriched in a fairly ad-hoc manner. This might offer a better metaphor for what I’m proposing: a very large, and evolving workbench of data.

Some basic rules for such an investigative data warehouse might be:

Keep all the source files (whether they are HTML pages, PDF documents, or API call results) in a public cache that you can link to. One of the most surprising results of the Siyazana project has been how quickly link rot turned the detailed sourcing of all data on the site nearly useless.
Keep Makefiles for fetching, cleaning and loading the data into the data store. The point is to have a well-documented workflow that makes it easy to understand the source of a particular table and to re-create the whole database if needed.
Load a full source dataset into one or multiple tables, but never into a table that holds information from another source (even if the schema is similar or identical).
Keep the original data in each table, and add derived or cleaned information in extra tables or columns. If you want to delete entries, use a flag instead of actually deleting rows.
Do not integrate data without necessity. Don’t try and make sure that all of your tables join up perfectly before you ever load them. Load the source form and then add linkages and join tables as you learn that you actually have a need for them.

These are obviously very basic rules, but they’ve proven useful in some projects now, where they’ve helped to create a valuable database beyond the scope of a single inquiry.

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

Other blog posts