15 March 2012
The Whole of Government Accounts: there’s a bold claim in the name of the UK’s central reporting system. The database, released as part of a larger system called COINS, contains key financial documents – including balance sheets and cash flow statements – for all departments, public bodies and local authorities in Britain. As research for an upcoming OpenSpending journalism project with Lisa Evans and Dan Herbert, I’ve spent some time trying to make sense of the dataset.
The COINS dataset page on data.gov.uk offers a large selection of different exports from the system, most of which are for previous years and proxied through BitTorrent. At the very bottom of the listing, two broken links point at the fact and adjustment table extracts for the WGA slice of COINS. Besides WGA (called Audited Outturn in the data), the original database seems to also contain budget planning and outturn information, as well as ongoing financial projections. Among the various guides available on the related Treasury web site, I’ve found Understanding the COINS data the most useful document.
Unfortunately, the primary CSV file, 2011_fact_table_extract_2009_10_WGA.csv, was
generated by someone with a deep hatred of life. Not only is the file encoded in
UTF-16, it also has random, unquoted line breaks in various cells. The first step
was converting the file to UTF-8 with a call to
iconv -f utf-16 -t utf-8 2011_fact_table_extract_2009_10_WGA.csv \ >2011_fact_table_extract_2009_10_WGA.utf8.csv
The newline problem turned out to be harder then expected, and neither Excel nor Refine could correctly join the source lines. Fortunately, Python has support for universal newline magic, so a simple transformation script adds proper quoting to the data:
python fix_csv_newlines.py 2011_fact_table_extract_2009_10_WGA.utf8.csv \ 2011_fact_table_extract_2009_10_WGA.utf8.nl.csv
Since the the 380MB of data is a bit unweildy in Excel or other desktop programs,
we’re going to load it into a PostgreSQL database. To do this, we first need to
create a table schema.
csvkit, a nice set of utilities for CSV wrangling has
just the answer:
pip install csvkit createdb -E utf-8 coins head -1000 2011_fact_table_extract_2009_10_WGA.utf8.nl.csv | \ csvsql --table wga_fact -e utf-8 -i postgresql --no-constraints | \ psql coins
The command will analyze the first thousand rows of the file and guess an appropriate schema based on column values (the resulting schema is also available here). Finally, we can load the data itself:
cat 2011_fact_table_extract_2009_10_WGA.utf8.nl.csv | \ psql -c "COPY wga_fact FROM STDIN WITH DELIMITER ',' CSV HEADER QUOTE '\"';" coins
This gives us a good million entries in our table – now its time to make sense of them.
Now that we have the data ready, its time to get a feel for whats in the various columns. To do this, we can have a quick look at the value ranges:
psql coins coins=> SELECT "Accounts_capital_current", COUNT(*) AS num FROM wga_fact GROUP BY "Accounts_capital_current" ORDER BY num DESC; Accounts_capital_current | num --------------------------+-------- NULL | 885613 A-CUR | 128197 A-DUM | 53777 A-CAP | 46534 A-FIN | 7034 (5 rows)
Two things quickly become apparent: that there is no shortage of obscure financial
abbreviations in this data and that most information is in just a few dimensions: the
Department_description identify the relevant body, while
Account_description are the key descriptive taxonomy.
Value have the other necessary bits.
Looking at the submission documents (more on them later), it becomes apparent
that the account dimension is based on something called the Standard Chart of
Accounts, or SCOA. The Treasury website again helps us out with a complete listing
of the available codes, which turn out to be arranged in a deep hierarchy. For example,
41115000, the Inheritance tax is a child of
41110000, Inland Revenue Income,
which in turn is part of
41100000, Taxation Income - and so forth. We can generate
some of the higher-level category codes by truncating the back digits of
and padding with zeroes:
coins=> ALTER TABLE wga_fact ADD COLUMN "Account_code_level1" VARCHAR; coins=> UPDATE wga_fact SET "Account_code_level1" = \ RPAD(SUBSTR("Account_code",1,2),8,'0');
psql -f wga_scoa_schema.sql cat wga_scoa.utf8.csv | \ psql -c "COPY wga_scoa FROM STDIN WITH DELIMITER ',' CSV HEADER QUOTE '\"';" coins
With both of these tables, we can be bold and aggregate the values within each SCOA group for a particular entity:
coins=> SELECT scoa."Label", scoa."Name", SUM("Value"), COUNT(*) FROM wga_fact JOIN wga_scoa AS scoa ON scoa."Name" = wga_fact."Account_code_level1" WHERE "Department_code" = 'E3433' GROUP BY scoa."Label", scoa."Name" ORDER BY scoa."Name";
This is nice, but it turns out that these aggregates are less than meaningful from an accounting point of view – or at least they don’t appear in any published documents. So its time to step back and ask: what exactly do we want to extract?
Now that we have a basic understanding of the data structure, we can focus on reconstructing some fiancial reports, such as cash flow statements and balance sheets. As an impression of what this looks like, the Lichfield accounts – wobbed by Lisa – are very helpful. The released document is actually a copy of the 2009 L-Pack, the spreadsheet used by local government to gather and validate information that is then uploaded to COINS.
Weighing just under four megabytes without any actual data, the L-Pack (and its central government counterpart, the C-Pack) are impressive pieces of governemnt business logic (it should also be mentioned that using notoriously unsafe Excel passwords to ensure the integrity of government financial reporting may not be a brilliant choice).
Looking at the Cash Flow sheet, we can see that most of the values (except for the first three figures on Deprecation and Amortization) have a white cell background, which indicates they were entered directly. Those cells with a gray background are computed using a formula combining other cells (usually from other sheets) – this is true of most of the values on the Balance sheet.
Given that someone in Lichfield typed this in, how did it end up in COINS? The answer to that can be found in another sheet, called Upload Accounts. In this sheet, each row maps a cell in one of the other sheets to a SCOA code. For example, the ‘white’ Cash Flow cells are mapped to SCOA codes starting in 84. To confirm this, we can look them up in the database:
coins=> SELECT "Account_description", "Account_code", SUM("Value"), COUNT(*) FROM wga_fact WHERE "Department_code" = 'E3433' AND SUBSTR("Account_code", 1, 2) = '84' GROUP BY "Account_description", "Account_code" ORDER BY "Account_code";
This gives us most of the cash flow statement – the account descriptions don’t exactly match those on the sheet but it’s clear they mean the same thing. The COUNT result – the number of records for each account code – is also interesting: each figure is reported only once. This is not true in proper COINS, since authorities submit various draft forms of the figures. The data was abridged to simplify the dataset and not expose data entry errors in drafts.
So what is needed to complete the cash flow and balance sheet reports? We need
to translate the logic in the L-Pack from using Excel cell identifiers
'Cash Flow (14)'!E27) to SCOA codes (
84000018) and then execute those
formulae stored in the grayed-out fields. This is tedious, but not hard –
so depending on how we will continue with the project, it may make sense to
automate the task and write up a generator for both report types.
There is a lot of information in the WGA database that is worth exploring. One interesting aspect seems to be the requirement to report only economic activity outside the governemnt. This means that all transactions between different parts of government are filtered out using a system of Counterparty identifiers. I haven’t yet wrapped my head around how much of this actually plays a role in WGA, but it seems worth exploring.
To be continued…