Oil Rush on Edgar Creek

12 November 2014

Over the past month, I’ve worked with OpenOil in an effort to find oil contracts which have been been published as part of filings to the US stock exchange regulator, the SEC. While these contracts are not usually public, companies are required to file full contract documents as part of their annual reports under certain conditions.

The OpenOil team, Johnny West, Anton Rühling and Don Hubert, had already located a good number of contract filings on the SEC’s EDGAR system, as well as it’s (captcha-crippled) Canadian twin, SEDAR, through manual searches. At the same time, we wanted to follow a systematic approach to check each EDGAR filing since 1995 by an oil company for any published contracts.

After experimenting with a smaller sample of filings to prove the concept, we decided that downloading and processing the full set of documents to our computers would take up too much time (there’s several terabytes of filings on EDGAR). Instead, we used Amazon’s Elastic MapReduce service as a ten-machine scraping cluster to download the full filings to S3 cloud storage.

Having retrieved the filings and split out those related to hydrocarbon-related companies, we started experimenting with different methods to score the documents that looked like oil contracts. In total, we evaluated three different approaches:

A manually composed list of weighted search terms. This was our initial approach, and it worked quite well; although it largely failed to isolate the interesting host government contracts from around the world and the much more common contracts between private companies in the US.
Using a manually curated list of true positives, we attempted to train a Bayes classifier to recognise the contracts based on n-grams (sets of two, three, four or five words occurring together). The idea was essentially to build a spam filter for oil contracts, but the results turned out to be flaky: some runs would yield an excellent precision, while others misclassified the vast bulk of the documents. Most likely, the small number of reference contracts we had was not enough to properly train the algorithm.
Finally, Johnny proposed a ‘watershed’ algorithm which searches for n-grams that occur in oil contracts, but never in any of the other filed documents. Those terms were sometimes surprising, from “b map” (apparently a map of the area is often located in appendix B), to “year means a period of twelve” (giving unwanted insight into the degree to which these contracts are an offence to common sense).

While I was initially critical of this approach, it turned out to give results that were as good - if not better - than the manual set of search terms. An interesting result, I guess, of the highly formalised and related language of these contracts.

The results of this ranking were often interesting: instead of contracts from the countries I’d expected (Africa and the Middle East), we saw contracts from China, India and many other places that are less associated with the extractive industries come to the top of our result sets.

It was also a surprise to find the company subsidiaries of many major extractives companies filed alongside the contracts - a valuable resource that may serve us as a sample case in a document-to-network machine learning experiment in the future.

To me, this SEC mining expedition has been a favourite type of open data project: not only have I learned a lot about the extractive industries - one of the largest forces in shaping the development of many countries - but I’ve also gained some experience in applying techniques commonly used in big data analytics to an open data problem (and it works!). Let’s do more of those!

Oh, and feel free to play with our scripts: pudo/edgar-oil-contracts (Hint: there’s more than just oil companies out there!)

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
Keeping stock: investigative data warehouses
Data warehouses are used in industry to manage the many datasets accrued inside a company that might be relevant to reporting and analysis. I want to propose a similar pattern for investigative journalism.
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

Other blog posts