Introduction
What is a Unified Data Warehouse?
“A unified database also known as an enterprise data warehouse holds all (or most of) the business information of an organisation and makes it accessible across the company”.
In other words, we are referring to a platform where all or the vast majority of data of an organisation is collected; the user has autonomous access and can re-produce analytics in real-time thus gaining from continuous and real-time insights and cloud computing power.
Which platform is the optimal choice?
There are several examples of such platforms in the market, which vary in features, capabilities, and user experience, like Databricks, Snowflake, Google Cloud Dataprep, Amazon Redshift etc.
In this article, we are sharing our experience gained from the usage of Microsoft Azure Databricks. We lack firsthand experience with alternative platforms, and therefore you should not consider this as endorsement of Databricks over other tools. We firmly believe that extensive research is needed to leverage the ultimate selection taking into consideration specific business needs, resources and existing infrastructure of your business or organisation.
Such platforms usually need knowledge of at least one programming language. However, even with elementary knowledge and using Generative AI one can run a code tailor-made to their tasks.
Through a unique use case, we aim to illustrate the advantages from the usage of a unified data analytics platform in audit.
The use-case
The use-case was an audit assignment that focused on the assessment of specific Risk Metrics of the Bank and the evaluation of their adherence to the applicable regulatory requirements established by EBA and ECB.
Towards this direction, one of our audit procedures involved gaining access to the raw data and run the calculation of the process, based on the regulatory requirements, to assess compliance and reasonability of any assumptions.
Using Databricks, we obtained autonomous access to the Bank's anonymous customer deposit data. Through data cleansing and processing, we implemented necessary modifications to the raw data, generating distinct, new datasets within Databricks. This approach allowed us to streamline further processing without the need to re-run the code each time on the original raw data. This new database, clean from unnecessary noise in data, also allowed us to reduce processing time in our final calculations.
Databricks also gave us the flexibility to switch between Python, PySpark, R, SQL, or Scala as needed within the same notebook, enabling us to leverage on the strengths of each programming language (including their unique libraries and packages), based on the specific tasks required. Indicatively, SQL is ideal for efficient retrieval and management of data, Apache Spark through Python (PySpark) for optimising performance of the process in terms of big data handling while further Python processing – and especially the Pandas library – is a standard choice for exploratory data analysis (EDA), statistical analysis and visualisation of results.
What did we gain from this?
Autonomous, Real-Time Access
One of the most significant benefits was the autonomous access to the Bank’s data in real (or close to) real-time thus resulting in limited delays naturally caused by requesting data from the IT department, including the time needed to document the specifications of such a request and the multiple iterations to test the adequacy of the data output. Therefore, autonomous access allowed us to explore the data ourselves and customise our processing based on our specific needs. Such independent processing also allowed the independent computation of the required variables, allowing the validation and challenging of existing risk metric assumptions.
Undoubtedly, the existence of a comprehensive data dictionary is of utmost importance for successful data extraction and subsequent analysis. The data dictionary encompasses all the metadata (data about the data) for the dataset. Indicatively, it includes, as a minimum, names and description of the files and systems from which the tables are extracted, data relationships among the files and the systems, brief description of each table and its columns including type and format of the values they are expected to be populated with. The power of such a data dictionary is the common understanding among all users of data and the reliability of the database per se.
Population vs Sample Testing
By harnessing the computational capabilities of cloud technology, you may conduct comprehensive audit tests on the entire dataset in a prompt manner, rather than relying on a sampling approach. As a result, the estimation of materiality of potential outliers identified could be more robust when assessed from the entire population as opposed to being extrapolated from a sample. This not only enhances the level of audit assurance but also provides Management with more insightful information. This could facilitate and enhance decision making.
It is noted that cloud technology and the specific tool used in our case (Databricks), allows the population processing for big data (read more on What is big data?), leveraging on the processing power setting inherent in the tool. Proper configuration allows optimisation between performance and cost. Such processing is usually not available in personal computers, when we process big data.
“Ready-to-reproduce” analysis
Assuming the structure of the datasets remains the same or experiences immaterial alterations, we can re-execute the code for different timing periods making comparative analysis but also saving potential FTE (Full-Time Equivalent) automating the process for future audits in the area or continuous monitoring.
Collaboration
Another feature of Databricks – and of unified platforms in general - is that a team can work together in a shared workspace facilitating collaboration in a prompt and efficient manner. In addition, the workspace can be shared among all stakeholders of the firm, allowing also interdepartmental collaboration and elimination of duplication of effort.
Could we use a unified platform without knowing a programming language?
It is beneficial that you have at least a minimal knowledge of a programming language. Some unified platforms may provide some level of assistance making the platform accessible to all users even in cases of limited knowledge or experience. Generative AI is always a good solution assisting you enhance your code without being an expert, although you should avoid sharing confidential information and should ensure understanding of the Generative AI output to ensure that is fit for purpose.