Notes on Data Catalogue Federation

22 April 2011

Over the past six months, I've been involved in the LOD2 projects use case for open government data, an effort to prototype a data catalogue federation platform for data from within the European Union. On May 3/4, OKF will be running a workshop on the same topic in Edinburgh. As I won't be able to attend it, here are some notes on requirements and technical alternatives (perhaps as a "scene setter") for the meeting. Purpose, Objectives The interest in exchanging catalogue metadata can be explained through various use cases, some of which include:

The aggregation of metadata across hierarchies, for example from regional government catalogues towards state and trans-national (EU) levels of administration. This can also be applied within government, where several department catalogues might be aggregated into a common database.
Domain-specific catalogues might cover only specific types of content or data format, e.g. data on scientific, financial or legal documents. Still, they might want to retrieve metadata from more generic catalogues or provide updated dataset descriptions upstream.
Another common scenario is the interface between institutionally operated catalogues (such as government sites) and those operated from within a civic context: while government has much better information on updates of data and similar source information, the community may be abe to contribute better knowledge on re-use opportunities, examples or fully established workflows. As it is not easy for government to allow the direct and wiki-like enrichment of metadata, this enrichment could be done in an external, community-driven catalogue (such as CKAN.net) and then fed back into the main catalogue through a guided process.
Aside from many use cases for full metadata, the centralization of access points in the form of search federation can support users in finding the right information without having to find the right search box first.

These use cases motivate the exchange of metadata, in order to allow widespread re-use of metadata, make specific capacities of different catalogues available to each other and to guarantee up-to-date information in data catalogues. The following is a somewhat random list of issues that need addressing for such exchange and federation to yield useful results. Scope and basic concepts of catalogues As with almost any other technology, various people expect and implement data catalogues to do many different and often mutually exclusive things. Any kind of exchange mechanism will have to bridge at least some of these gaps:

The first aspect may be the granularity of data referenced from within the catalogue. A particularly plastic example is bibliographic information where one might reference the whole database or each individual work. This is also true of statistical databases and, in particular, geospatial information. In the case of the US data.gov portal, 3000-odd normal datasets are met by 290000 pieces of geographic information. Trying to define a dataset (in Richard Cyganiak's words: "a set of data") is probably not helpful, perhaps marking packages as large, medium or small would be more practical.
A second aspect is the scope of the involved sites. While we're talking about data catalogues (as opposed to repositories) for the most part, this distinction is soft and often not of great help. Many catalogues (such as Data Publica, Socrata) include the data itself, some even standardize on a specific format. Including those also means looking at actual data stores which provide metadata, i.e. sites like Talis, FreeBase, Google Public Data and other statistics databases.
Exchanges will have to be aware of licensing both of the datasets referenced or contained within the catalogue and of the metadata itself. A government catalogue may only contain open data and release its metadata into the public domain, while a data mart would also contain commercial data and claim DB rights on their index. While I'm not sure we need to support machine-provided assessments of compatibility, having markers for some key pieces of information (BY, SA, NC, (C), PD) would be useful.

Metadata Formats JSON, HTML, XML/OKFN, XML/GMC, XML/DC, RDF/DC, RDF/DCat, MARC Exchange and Harvesting Mechanisms Push or pull? OAI-PMH, CSW et al., RSS/Atom, RDFa, SDMX, DVCS (Git, Mercurial, Bazaar) SPARQL or specific (RESTish or RPC-type/SOAP) interfaces Best choice at the moment is possibly the Atom Publishing Protocol as it is widely understood, implemented, tested. Distribution of Changes Given both an exchange format and a mechanism for harvesting or pushing metadata, the possibility to merge divergent metadata must be created. I also think its helpful to tackle the question of metadata provenance in this context, rather than as an isolated and theoretical concept. This involves:

Establishing and documenting institutional workflows for the processing of external or internal changes to metadata in a distributed environment.
Matching equivalent or related datasets from different sources, merging them or expressing relationships between them; deciding on a threshold for the distinction.
Exposing and merging differentials to data catalogue maintainers through a user interface to allow manual or guided reconciliation in a reproducible way.
Tracking technical metadata provenance across different systems, some of which may have their own concepts of versioning and tracking; recording the process of exchange and chains of sources.

Alignment of Metadata Once a basic architecture for federation is available, more effort can be invested into the creation of common metadata contents. Challenges here include:

Expressing advanced dataset relations and types such as time series and derived datasets.
Using common tagging schemes, and taxonomies. Especially for basic concepts where EUROVOC might be useful and then in a per-domain fashion. We should also give thought to how this can benefit form the kind of work that is being done on SEMIC. In LOD2, we've basically decided to focus on three content domains: energy-related datasets, financial spending and budgetary data and legislative documents.
Geo-spatial relationships, scope and context.
Institutional origin (incl. URIs for public bodies, companies) and paths of publication, including responsible persons.
Handling of languages and translations, Licensing and usage rights criteria, see above.
For outbound references, advanced URL schemes, data format conventions and specifications for external services such as SPARQL endpoints, REST/SOAP/XML-PRC APIs etc.

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
Keeping stock: investigative data warehouses
Data warehouses are used in industry to manage the many datasets accrued inside a company that might be relevant to reporting and analysis. I want to propose a similar pattern for investigative journalism.
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

Other blog posts