datawi.re: when the data mountain comes to you

06 November 2013

For a few months now, Annabel and I have been working on datawi.re, an effort to create a better way for journalists to keep track of data feeds.

If This Then News, a prototype we hacked on during the OpenNews introduction in January.

The tool would create a simple way for users to list people, companies and institutions that they want to track across different sources of data. Such sources could include government feeds for procurement, spending or public journals, or other public data such as corporate ownership information.

One benefit of such a tool would be to make more public data accessible to end-users at the time when it first appears online by feeding the latest scraper output directly into datawi.re. Any data source could be hooked up to datawi.re, allowing for connections between data feeds to be revealed across a broadening repertoire of data. It would also encourage journalists to become more systematic creating sets of people and organisations that are of interest to them.

Unfortunately, this is not a launch post. While the datawi.re prototype is functional, it is also clear that - on a fairly fundamental level - it doesn’t yet do the right thing.

The first mistake we made with datawi.re, I believe, was its choice of metaphor: datawi.re would provide an interface similar to that of Twitter, allowing users to ‘follow’ people and organizations across data feeds. The data itself was templated into short text snippets to prevent the service from bombarding users with tabular nonsense.

The datawi.re interface showing matches for a tracking search on G4S, a security contractor.

This model has a number of advantages: fully streamed messages don’t require datawi.re to know much about the structure and semantics of each data source. Essentially it would just grep through the data and then render matching data records in the user’s browser. It was important to me that datawi.re should not become another data catalogue or data warehouse, and using a streaming metaphor seemed like a convenient way of avoiding deep interaction with the data.

On the other hand, the stream nature of datawi.re also meant that the service could not easily perform any aggregate analysis on the data - such as trend and outlier detection. A government contract for 500 EUR would be shown with the same weight as one for 500 Million EUR. Essentially, datawi.re would be able to detect the topic of a data item, but without seeing any context and having an understanding of the data, it would inevitably fail to estimate it’s significance.

Another issue with the stream is the difficulty that users have in discovering entities to track. They don’t have good visibility on the existing data, as matches only show up for existing search words. In the full data stream, there are simply too many options, and identifying the overlaps between what a user is interested in and what can reasonably surfaced in a dataset is a hard task. It is in fact, it is probably the core activity of much of journalism.

One final issue for datawi.re is the lack of data infrastructure to support it. While the site itself doesn’t need to know too much about the data sources feeding into it, each source needs to be set up individually to submit data to datawi.re. Still, we don’t know very much about whether a data source is still active or in need of maintenance. It would be great if datawi.re didn’t have to build all of this infrastructure but could plug into an existing system, such as Max Ogden’s dat.

All of this is not to say that datawi.re is dead, but I think the project needs to be significantly re-engineered to overcome these issues. To that end, I’m hoping to go back to the drawing board and collect a larger set of specific use cases - each including the specific dataset, search criteria and matching heuristic for a data notification service that is actually able to surface meaningful matches.

Of course, datawi.re is available as an open source codebase for everyone to play with, and I’m keen to hear any ideas that people may have for the direction in which it should be carried next.

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
Keeping stock: investigative data warehouses
Data warehouses are used in industry to manage the many datasets accrued inside a company that might be relevant to reporting and analysis. I want to propose a similar pattern for investigative journalism.
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

Other blog posts