ICAEW.com works better with JavaScript enabled.
Exclusive

Databases, Data Warehouses and Data Lakes for Accountants

Published: Yesterday at 11: 11 AM BST Update History

Exclusive content
Access to our exclusive resources is for specific groups of students, subscribers and members.

Accountants today rely on software like Oracle, Sage, Excel, and many other technologies to store important data and help generate financial statements and insights. This article provides a high-level overview of different data infrastructure used in accounting and how these fit into contemporary technology such as PowerBI, PowerPivot, Excel, APIs.

Contemporary accountancy increasingly relies on record keeping systems that can be hard to understand and navigate. Long gone are the days when accountants used paper ledgers that remained fully within our control.

Today we rely on software and technology platforms to store our data and help us generate financial statements and insights.

In most cases these record keeping systems can be divided into 3 types:

  • Databases
  • Data Warehouses
  • Data Lakes

In this article I will provide a high-level explanation of these different systems in the context of accounting and how they relate to technology that accountants use on a day-to-day basis.

Databases

As this article is largely focused on the data infrastructure used in accountancy, this section will start with SQL or relational databases.

These databases are similar to spreadsheets in the sense they store data in columns and rows. However they differ in the sense they have a tightly defined structure and, as the name suggests, mapped relationships between different data tables, which is required in order for the database to be trustworthy.

This structure is equivalent to the chart of accounts within the double entry accounting system. Each SQL database has its own structure, in the same way each company can have its own unique chart of accounts.

SQL databases when used properly can store accounting transactions - in the same way T accounts ‘store’ double entries. For example, if a cash sale is made then a journal record should be written to the database to reflect the cash sale by posting a debit and a credit to the relevant tables in the database. There might be a single table for all transactions, or separate tables for different subledgers.

In order for SQL databases to be effective they need to be able to read and write transactions reliably and quickly.

What the database does not do is generate financial reporting. This is done by querying the database, which allows the data to be stored temporarily in the memory of a computer program. Code can then be run on this data in order to generate financial reports.

There are various “brands” of SQL databases, some of which are proprietary and some of which are free. Proprietary versions include Oracle, IBM DB2, Microsoft SQL Server. And there are free versions such as MySQL, PostgreSQL, SQLite.

For most accounting software such as Xero, Sage, QuickBooks etc it is probably one of the aforementioned database brands that behind the scenes is maintaining the underlying journals.

Data Warehouses

Imagine you run a chain of pubs you’ve acquired over many years. Each day you make sales, and you recognise these sales in an SQL database at the respective pub.

However, the problem you face at head office of the pub chain is all the transactional data is stored at each of the pubs and this means you do not have visibility of the performance of your estate of pubs as a whole.

In order to solve this problem what you can do is implement a so-called extract, transform and load (“ETL”) process in order to effectively copy and paste the data from the various databases into a datastore at head office.

You could do this by exporting data in csv and processing it in Excel, however this is not good practice.

Instead, what you should be doing is “loading” the data into another SQL database, called a Data Warehouse.

This Data Warehouse should probably have a slightly different structure to the SQL databases at the pubs. This is because the Data Warehouse will not be used for writing journals and should only be used for analysis, such as generating financial reporting.

The distinction between Databases and Data Warehouses is analogous to the distinction between single entity accounts and consolidated accounts.

Data Lakes

Nowadays it is normal for accounting software to be a cloud-based web application such as Xero, QBO or Freeagent, rather than a local application running on local hardware and a local database (as was the case with software such as Sage50 or Quickbooks Desktop).

This migration to cloud based accounting software has revolutionised the way accountants work. However, it is typically harder to gain access to the ‘back end’ database where your data is stored.In order to access the accounting data, you usually have two imperfect choices. The first is to access the data via the user interface of the web application. This is slow, constrained and error prone. The second is to access the data via an API.

An API connection to a web application is not the same as direct connection to a database. You cannot write efficient SQL queries with an API connection. Instead, you are usually forced to make computationally inefficient requests, which can often be throttled based on speed and/or volume, and provide responses, usually in a text-based format – JSON, CSV or XML being most common.

A Data Lake in this context is essentially a place to save these text files which the API has responded with. That ‘should’ be it. A Data Lake can just be folders with organised files in them. Ideally the contents of these files should be unchanged from what the API responded with.

A Data Lake can also be constructed from files extracted through the user interface of the application.

Once you have a Data Lake you can run code over it. For example, you could construct an SQL database from the Data Lake. Or you could run a machine learning model over the Data Lake to build an AI. Or you could load the data into Excel or Power BI.

The problem with Data Lakes is unlike databases and data warehouses, they do not have a structure imposed on them. As a result they can easily become chaotic and a so called ”data swamp” - unreliable, potentially corrupted data rife with duplication and quality issues. This is especially the case where humans can access the Data Lake directly, a prime example being SharePoint.

Upon reading this as an accountant you may realise you are harbouring Data Lakes (or swamps) on your local machine or inside the company SharePoint. This isn’t necessarily a bad thing – we all need to save files somewhere - however if it’s data that you depend on you should be looking to have the data properly maintained in a curated system.

Data LakeHouse

If your organisation has both a Data Warehouse and a Data Lake then it may make sense to unify these repositories into a single architecture known as a Data Lakehouse.

You can think of a Date Lakehouse as being somewhat analogous to a hybrid combustion and electric engine car in the sense it combines two similar technologies into a single solution, which in theory could make accessing and processing disparate data sources easier. For example, we mentioned above the notion of running code over a Data Lake in order to build a Data Warehouse . A shortcoming of doing this is known as “data redundancy”, whereby the same data is duplicated in the Data Lake and the Data Warehouse, adding to storage costs and complexity.With a Data Lakehouse, the idea is the data is only stored once and is then accessed and referenced accordingly from all places that require it, ensuring consistency and minimising redundancy.

More on Microsoft’s Lakehouse platform, MS Fabric, was covered in this webinar.

NoSQL

NoSQL databases refer to databases which are non-relational and where data is stored in non-tabular formats, different to relational SQL databases which are more suitable for structured data.

If you ever hear somebody say they are using a NoSQL database for financial data, be sceptical: NoSQL databases can lack the data consistency and integrity of relational databases, and in comparison to standard database technology that’s been around for decades, NoSQL databases remain relatively new.

You should seek to understand what kind of “NoSQL” database is being used: Document, Graph, Key Value, Object etc.

Saying a database is NoSQL is akin to saying your method of transport is not a car - it could be any number of alternatives. The whole idea behind NoSQL is greater flexibility in database structure, but they do still have structure.

Power BI and PowerPivot

Power BI (and other equivalents such as Tableau, Looker Studio or Qlik) are data visualisation tools which can connect to databases, data warehouses and data lakes. If you have access to any of these, such tools can be a great way to automate aspects of reporting as you can connect directly to the data source and leverage any definitions and schemas.

If you cannot connect to a SQL database, then an alternative is to use PowerPivot which is Excel’s inbuilt data modelling and analysis solution.

PowerPivot allows you to load data into an analytical database stored in the Excel workbook using Power Query and then create relationships between data in order to perform sophisticated SQL-esque queries on the data using Pivot Tables.

The results of these Pivot Tables can then be rendered into Excel and its graphing features can be used to visualise the data or manipulate it further using Excel.

Summarising

Databases are for writing and reading transactions. They can also be used for analytics.

Data Warehouses are for consolidating data for analytics.

Data Lakes are repositories for unstructured data (which can include CSV, JSON, XML). They can be used for analytics too, but may lead to integrity issues, so data stored in this way for analytical purposes may be more effective in a lakehouse architecture.

About the author

Edward Franklin is the owner of Ledger Legend, Pett Franklin Services Limited and Two Franklins.
Open AddCPD icon

Add Verified CPD Activity

Introducing AddCPD, a new way to record your CPD activities!

Log in to start using the AddCPD tool. Available only to ICAEW members.

Add this page to your CPD activity

Step 1 of 3
Download recorded
Download not recorded

Please download the related document if you wish to add this activity to your record

What time are you claiming for this activity?
Mandatory fields

Add this page to your CPD activity

Step 2 of 3
Mandatory field

Add activity to my record

Step 3 of 3
Mandatory field

Activity added

An error has occurred
Please try again

If the problem persists please contact our helpline on +44 (0)1908 248 250
Open AddCPD icon

Add Verified CPD Activity

Introducing AddCPD, a new way to record your CPD activities!

Log in to start using the AddCPD tool. Available only to ICAEW members.

Add this page to your CPD activity

Step 1 of 3
Download recorded
Download not recorded

Please download the related document if you wish to add this activity to your record

What time are you claiming for this activity?
Mandatory fields

Add this page to your CPD activity

Step 2 of 3
Mandatory field

Add activity to my record

Step 3 of 3
Mandatory field

Activity added

An error has occurred
Please try again

If the problem persists please contact our helpline on +44 (0)1908 248 250