dataissues.org - public issue tracking for data defects

10 July 2012

On June 21st, the Knight News Challenge Round on Data ended. The day before, Rufus, Ross and I sat down to write out some ideas that we’d been discussing for a while. While we submitted proposals for Grano and DataProtocols, we decided to hold back on this idea for another round. Still, sharing is caring.

1. What do you propose to do? [20 words]

We’ll create a web service where data wranglers and consumers can log errors arising from processing, viewing or using data.

2. How will your project make data more useful? [50 words]

All data has errors. While data quality is often talked about, the best practice for data apps is often to have half a paragraph on the ‘about’ page. We want to build a service that is useful to data wranglers, but can also serve as documentation for end-users and basis for further discussion.

3. How is your project different from what already exists? [30 words]

Error reporting for software is either done as task tickets (e.g. github.com) or by capturing raw application output (e.g. exceptional.io). For data, we want to combine these two approaches to let users group recurring errors into issues that can then be discussed and fixed.

4. Why will it work? [100 words]

While all data processing workflows are different from dataset to dataset, the types of errors that occur are often quite similar and can be stored in a shared service. This is both immediately useful when doing data work - especially scheduled, unsupervised processes - but also as an activity log for other people to see.

We’ll create both an easy-to-use online validation tool to check spreadsheets against a certain schema and an API with client libraries that can be integrated into existing processing pipelines. The reported issues can be full-out errors, but also probes that highlight implausible values.

5. Who is working on it? [100 words]

The Open Knowledge Foundation is…

6. What part of the project have you already built? [100 words]

We’ve got extensive experience working with dataset metadata from DataHub.io and produced a number of complex data processing pipelines (e.g. for UK spending data, that merges over 5000 spreadsheets in different formats). These clearly show the need for better reporting, and we have built several ad-hoc solutions but know that is a major area that is inadequately addressed in our work and those of others. We have already got a basic prototype and can build a first increment quickly.

7. How would you use News Challenge funds? [50 words]

We’ll built it! We’ll develop a full version of this service iteratively, test and promote it. We plan to work together with civic data projects as early adopters to get quick feedback and adapt the service to suit their needs.

8. How would you sustain the project after the funding expires? [50 words]

This will be perfectly suited to SaaS freemium model in which heavy and/or professional users who need to report large amounts of errors and generate complex reports pay a subscription fee. In addition as open-source software the project can be re-used and extended by others.

If you think this is a good idea, help hacking and contribute patches to the dataissues repository!

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
Keeping stock: investigative data warehouses
Data warehouses are used in industry to manage the many datasets accrued inside a company that might be relevant to reporting and analysis. I want to propose a similar pattern for investigative journalism.
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

dataissues.org - public issue tracking for data defects

Other blog posts