A brief journey into the UK's accounting data

15 March 2012

The Whole of Government Accounts: there’s a bold claim in the name of the UK’s central reporting system. The database, released as part of a larger system called COINS, contains key financial documents – including balance sheets and cash flow statements – for all departments, public bodies and local authorities in Britain. As research for an upcoming OpenSpending journalism project with Lisa Evans and Dan Herbert, I’ve spent some time trying to make sense of the dataset.

Getting and loading the data

The COINS dataset page on data.gov.uk offers a large selection of different exports from the system, most of which are for previous years and proxied through BitTorrent. At the very bottom of the listing, two broken links point at the fact and adjustment table extracts for the WGA slice of COINS. Besides WGA (called Audited Outturn in the data), the original database seems to also contain budget planning and outturn information, as well as ongoing financial projections. Among the various guides available on the related Treasury web site, I’ve found Understanding the COINS data the most useful document.

Unfortunately, the primary CSV file, 2011_fact_table_extract_2009_10_WGA.csv, was generated by someone with a deep hatred of life. Not only is the file encoded in UTF-16, it also has random, unquoted line breaks in various cells. The first step was converting the file to UTF-8 with a call to iconv:

iconv -f utf-16 -t utf-8 2011_fact_table_extract_2009_10_WGA.csv \
    >2011_fact_table_extract_2009_10_WGA.utf8.csv

The newline problem turned out to be harder then expected, and neither Excel nor Refine could correctly join the source lines. Fortunately, Python has support for universal newline magic, so a simple transformation script adds proper quoting to the data:

python fix_csv_newlines.py 2011_fact_table_extract_2009_10_WGA.utf8.csv \
    2011_fact_table_extract_2009_10_WGA.utf8.nl.csv

Since the the 380MB of data is a bit unweildy in Excel or other desktop programs, we’re going to load it into a PostgreSQL database. To do this, we first need to create a table schema. csvkit, a nice set of utilities for CSV wrangling has just the answer:

pip install csvkit
createdb -E utf-8 coins
head -1000 2011_fact_table_extract_2009_10_WGA.utf8.nl.csv | \
    csvsql --table wga_fact -e utf-8 -i postgresql --no-constraints | \
    psql coins

The command will analyze the first thousand rows of the file and guess an appropriate schema based on column values (the resulting schema is also available here). Finally, we can load the data itself:

cat 2011_fact_table_extract_2009_10_WGA.utf8.nl.csv | \
    psql -c "COPY wga_fact FROM STDIN WITH DELIMITER ',' CSV HEADER QUOTE '\"';" coins

This gives us a good million entries in our table – now its time to make sense of them.

Browsing the accounts

Now that we have the data ready, its time to get a feel for whats in the various columns. To do this, we can have a quick look at the value ranges:

psql coins
coins=> SELECT "Accounts_capital_current", COUNT(*) AS num 
    FROM wga_fact 
    GROUP BY "Accounts_capital_current" 
    ORDER BY num DESC;

 Accounts_capital_current |  num   
--------------------------+--------
 NULL                     | 885613
 A-CUR                    | 128197
 A-DUM                    |  53777
 A-CAP                    |  46534
 A-FIN                    |   7034
(5 rows)

Two things quickly become apparent: that there is no shortage of obscure financial abbreviations in this data and that most information is in just a few dimensions: the Department_code and Department_description identify the relevant body, while Account_code and Account_description are the key descriptive taxonomy. Time and Value have the other necessary bits.

Looking at the submission documents (more on them later), it becomes apparent that the account dimension is based on something called the Standard Chart of Accounts, or SCOA. The Treasury website again helps us out with a complete listing of the available codes, which turn out to be arranged in a deep hierarchy. For example, SCOA 41115000, the Inheritance tax is a child of 41110000, Inland Revenue Income, which in turn is part of 41100000, Taxation Income - and so forth. We can generate some of the higher-level category codes by truncating the back digits of Account_code and padding with zeroes:

coins=> ALTER TABLE wga_fact ADD COLUMN "Account_code_level1" VARCHAR;
coins=> UPDATE wga_fact SET "Account_code_level1" = \
    RPAD(SUBSTR("Account_code",1,2),8,'0');

To expand these codes, we can load the SCOA reference data in much the same way we loaded the main table (table schema here and source CSV here):

psql -f wga_scoa_schema.sql
cat wga_scoa.utf8.csv | \
    psql -c "COPY wga_scoa FROM STDIN WITH DELIMITER ',' CSV HEADER QUOTE '\"';" coins

With both of these tables, we can be bold and aggregate the values within each SCOA group for a particular entity:

coins=> SELECT scoa."Label", scoa."Name", SUM("Value"), COUNT(*) 
    FROM wga_fact 
    JOIN wga_scoa AS scoa ON scoa."Name" = wga_fact."Account_code_level1" 
    WHERE "Department_code" = 'E3433' 
    GROUP BY scoa."Label", scoa."Name" 
    ORDER BY scoa."Name";

This is nice, but it turns out that these aggregates are less than meaningful from an accounting point of view – or at least they don’t appear in any published documents. So its time to step back and ask: what exactly do we want to extract?

Reverse-engineering the reports

Now that we have a basic understanding of the data structure, we can focus on reconstructing some fiancial reports, such as cash flow statements and balance sheets. As an impression of what this looks like, the Lichfield accounts – wobbed by Lisa – are very helpful. The released document is actually a copy of the 2009 L-Pack, the spreadsheet used by local government to gather and validate information that is then uploaded to COINS.

Weighing just under four megabytes without any actual data, the L-Pack (and its central government counterpart, the C-Pack) are impressive pieces of governemnt business logic (it should also be mentioned that using notoriously unsafe Excel passwords to ensure the integrity of government financial reporting may not be a brilliant choice).

Looking at the Cash Flow sheet, we can see that most of the values (except for the first three figures on Deprecation and Amortization) have a white cell background, which indicates they were entered directly. Those cells with a gray background are computed using a formula combining other cells (usually from other sheets) – this is true of most of the values on the Balance sheet.

Given that someone in Lichfield typed this in, how did it end up in COINS? The answer to that can be found in another sheet, called Upload Accounts. In this sheet, each row maps a cell in one of the other sheets to a SCOA code. For example, the ‘white’ Cash Flow cells are mapped to SCOA codes starting in 84. To confirm this, we can look them up in the database:

coins=> SELECT "Account_description", "Account_code", SUM("Value"), COUNT(*) 
    FROM wga_fact 
    WHERE "Department_code" = 'E3433' 
        AND SUBSTR("Account_code", 1, 2) = '84' 
    GROUP BY "Account_description", "Account_code"
    ORDER BY "Account_code";

This gives us most of the cash flow statement – the account descriptions don’t exactly match those on the sheet but it’s clear they mean the same thing. The COUNT result – the number of records for each account code – is also interesting: each figure is reported only once. This is not true in proper COINS, since authorities submit various draft forms of the figures. The data was abridged to simplify the dataset and not expose data entry errors in drafts.

So what is needed to complete the cash flow and balance sheet reports? We need to translate the logic in the L-Pack from using Excel cell identifiers (e.g. 'Cash Flow (14)'!E27) to SCOA codes (84000018) and then execute those formulae stored in the grayed-out fields. This is tedious, but not hard – so depending on how we will continue with the project, it may make sense to automate the task and write up a generator for both report types.

What else?

There is a lot of information in the WGA database that is worth exploring. One interesting aspect seems to be the requirement to report only economic activity outside the governemnt. This means that all transactions between different parts of government are filtered out using a system of Counterparty identifiers. I haven’t yet wrapped my head around how much of this actually plays a role in WGA, but it seems worth exploring.

Also: bubbles.

To be continued…

A little tour of aleph, a data search tool for reporters
Over the past six months, I've been working for OCCRP to productise Aleph, a powerful search tool for investigative reporters. This is a little tour of it's key features, and a little view into the future development agenda.
A Poor Journalists's Text Mining Toolkit
How can journalists search and analyze collections of documents on their own computers with simple tools? At last weekend's DataHarvest, we ran a workshop trying to answer that question. This write-up to covers using Apache Tika for content extraction and regular expressions in Sublime Text as an advanced search tool.
Against Decentralization
In the free software/open web community, the notion that the web should be decentralized is more than a shared ideal, it is a piece of dogma. But are we really promoting a progressive vision of the web, or fighting a losing battle to avoid political engagement?
Keeping stock: investigative data warehouses
Data warehouses are used in industry to manage the many datasets accrued inside a company that might be relevant to reporting and analysis. I want to propose a similar pattern for investigative journalism.
SpenDB, a data analysis tool for government finance, looking for testers!
The first beta version of SpenDB features a small set of well-designed features for data import and analysis. Now the platform is ready to be adopted by anyone interested in exploring financial data, from budgets to procurement.
On Hacks/Hackers, Google and community building
A few weeks ago, the US team of Hacks/Hackers announced their plans to turn the network of journalism innovators into a collaboration with Google News Labs, starting with an event in Berlin. I tweeted about this, and Phillip Smith wrote a thoughtful reaction. Given this invitation to debate, I wanted to outline my criticism in more detail.
SpenDB, a light-weight tool for government financial data
Over the past few months, I have spent my weekends simplifying and modernizing the OpenSpending codebase to create SpenDB - a prototype-stage, light-weight data loading tool and analytical API for government financial data.
Who’s got dirt? - What if robots could do cross-border investigations?
If we want to make open data relevant to investigative journalism, we have to simplify the way people access it. We must create a way for our data tools to talk to each other and trade information about the companies and people we are researching.

Getting and loading the data

Browsing the accounts

Reverse-engineering the reports

What else?

Other blog posts