Grano advanced queries, and Linked Data

01 September 2014

At its current stage of development, grano has achieved some level of maturity as an influence mapping toolkit. We’ve got a great workflow for importing raw data, and we’re running a number of different sites off the backend. A new web interface, to make the application more accessible to non-technical users, is in the works.

So it’s time to look at the next big challenges: building out the way in which grano lets journalist ask complex questions of their data, and improving the handling of source and quality metadata. As I’ve pointed out on IJNet, both of these are essential: having a big bunch of integrated data is cool, but only when you have a tool that lets you interrogate those relationships does it become a real journalistic asset.

</a>

ConnectedAfrica is its prototype stage, collecting information about political connections in South Africa.

Unfortunately, the questions that journalists might want to ask against influence networks are often recursive: “show me all the family members of politicians who work for organisations that receive government contracts”, “show me what connects these people to each other”. These are the exact type of query that make relational databases cry.

After producing a proof-of-concept query tool based on grano’s relational backend it became clear that a more flexible approach was needed: queries would easily take seconds, some would eat up all server memory. From this point on, there are two choices: use additional backends to satisfy different types of queries; or migrate to another data model entirely.

Thankfully, the amazing Jun Matshushita was thinking through influence mapping technology choices at the same time, and we had some interesting discussions on graph databases on GitHub. He convinced me to have another look at RDF/Linked Data as a storage mechanism. Unlike, for example, Neo4J, RDF has a variety of ways for attaching provenance to individual statements - a must-have for applications that integrate data from a wide range of sources to find evidence for misconduct and corruption.

At the same time, there is still a total lack of mature (Python) tooling around linked data. This was true when I first experimented with the stuff in 2010, and many libraries haven’t received a single commit since then. Documentation for routine tasks is non-existent, and the standard Python RDF toolkit, rdflib doesn’t actually connect to the vast majority of available triple stores.

It is clear that nobody is using Python and RDF to build web applications. Ironically, people on public-lod seem to believe that the main challenge to broader adoption of linked data is producing the stuff. In truth, your problems really start when you have RDF data and need to store, query and export it in anything other than Java.

Even so, I managed to get another proof-of-concept of my experimental query API implemented, this time running on top of an RDF-converted copy of the data. The surprise: using Apache Fuseki’s backend was even slower than a SQL database for graph traversal, queries would easily take twice or three times as long as their relational equivalent. I’m sure that I haven’t tuned my queries very well, but performance seems to be degrading proportionally to the size of the dataset. And while other triple stores may be somewhat faster, it’s becoming clear to me that this route isn’t leading anywhere. What’s left is a set of open questions:

Who out there is still building Python tools that use RDF and can talk about making it work?
How does one connect to triple stores like Virtuoso and Stardog from Python in meaningful way, without having to write a complete binding for even the most trivial operations?
What are good ways to EXPLAIN ANALYZE SPARQL queries and to get a sense of where their complexity is?
If triple stores are slow when aggregating data and doing distinct counts, how can you get multiple solutions for a query, but limit the number of entities they relate to?

So, it is now time to bring grano into split brain mode: use the existing relational database to store provenance and data quality information, while keeping a simplified data model in Neo4J to enable the types of recursive queries that our users need. As always, I need lots of help to hack out the new Neo4J query API: not only to make sure data is synchronised between the two data stores, but also to experiment with query endpoints that might enable interesting journalistic questions and cool data visualisations.

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
Keeping stock: investigative data warehouses
Data warehouses are used in industry to manage the many datasets accrued inside a company that might be relevant to reporting and analysis. I want to propose a similar pattern for investigative journalism.
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

Other blog posts