ReGENESIS: German statistics as raw data

08 August 2013

A choropleth map to indicate the availability of high-quality, machine readable statistical data in ReGENESIS.

One of the first tasks I was given by Spiegel Online was to make a set of simple maps to display basic statistics about Germany - things like population, unemployment or insolvencies. As Germany’s statistical data are collected in a system called GENESIS, I though that this would be trivial. I’d just have to write a script to grab the tables once a month, convert them to JSON and thus update the maps.

Unfortunately, while the GENESIS interface offers downloads, they are both hard to access (through an arcane and untested SOAP interface) and hard to parse. Essentially, the tables are reports which have been manually layed out, and getting out a predicatable data series requires you to pretty much write a bespoke parser for each table.

So I decided to solve this issue for others as well and make ReGENESIS, a service and toolkit to provide clean and well-structured data from the German statistical services.

This was inspired by some great examples of similar projects other countries: Census.IRE.org provides a lot of structured data around the US census, and the CensusReporter project is now thinking this through a lot further. I’m also really impressed by the work that Brian and the @csvsoundsystem have been doing on treasury.io, a convenient data source with a ScraperWiki-based SQL query endpoint and client bindings for a variety of languages.

ReGENESIS is powered by a collection of Python scripts available on GitHub. The scripts will first scrape bulk data exports from the official site and store them locally. These are then processed and loaded into a database, retaining a rich set of metadata as well as the actual observations. Then, the database contents are dumped to CSV file extracts, two for each dataset:

A researcher’s version with human-readable column names that make it easy to use in a spreadsheet program for manual analysis.
A raw version with more detail and machine-friendly column names, easier to parse for further processing.

Finally, Flask helps render a simple user interface to flat files to represent the metadata. Finally, the entire site is uploaded to Amazon S3 so that no server is required to serve any of the content. This makes ReGENESIS easy to maintain, all I need to do is run the extractors once a week to make sure that we’re offering the latest data.

Not really related, but that TV show was a lot of fun.

Whats next?

Obviously, ReGENESIS is in a very early prototype stage and a lot of the use cases and usability hasn’t really been ironed out at yet. Beyond that, there are plenty of ideas for the future.

Go federal: At the moment, I’m only importing data from the Regionalstatistik portal which publishes statistics from state level authorities. The much larger GENESIS database operated by the federal statistical office has its bulk export function locked down and requires a EUR 500 annual subscription. Maybe this could be an opportunity for an open data kickstarter?

Have an API: ReGENESIS holds some fairly large tables, and in order to pull them into interactive graphics or other client applications it would be nice to serve filtered and aggregated versions instead of the full data. I’m somewhat reluctant to run a server for this (something like Stefan Urbanek’s cubes), but most of the hosted data API tools I’ve checked out so far are either too expensive or very limited in terms of capacity.

Rank notifications: when I pitch ed ReGENESIS at a data journalism meetup earlier this year, one request was to ease access to local statistics for reporters at regional papers. This could, for example, be done through email alerts which notify journalists when the relative rank of their regions on any of the major statistics sees significant change.

Map it out: just before this release, I was contacted by Felix, one of the StadtLandCode grantees. He’s been working on getting regional statistics for a while and has done a lot of mapping work to generate customized maps from the data. As the ReGENESIS gives him the data in the form he needs, we’ve agreed to cooperate on integrating his GeoJSON map layers with the service.

Of course, I can’t do all of this on my own. That’s why I’m releasing this early: for you to get on board now and to try it out, to contribute your use cases and, of course, your code!

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
Keeping stock: investigative data warehouses
Data warehouses are used in industry to manage the many datasets accrued inside a company that might be relevant to reporting and analysis. I want to propose a similar pattern for investigative journalism.
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

Whats next?

Other blog posts