ICAEW.com works better with JavaScript enabled.
GENERATIVE AI GUIDE

Case study: Software developer using generative AI - EngineB

Franki Hackett, Head of Audit and Ethics at EngineB, shares how EngineB embedded generative AI into their product.

Engine B is an AI-enabled technology company, backed by the ICAEW and the accounting industry. Their technology allows accountants to easily ingest client data from any system and map it to their common data model to facilitate a hassle free and higher quality accounting service. “We are using Generative AI to help with conforming trial balances, so you could take a trial balance from any kind of organisation and map their accounts to a common model. This means whether your client is a charity, a hairdressing group or anything else you can use our ‘groupings’ tool to get the trial in a consistent format.” What generative AI is doing here is matching language and learning how different types of account, for different organisations may be categorised within the broader context of standard accounting practices. “We still have humans to review it, and what we are seeing is that using generative AI followed by a human review is getting us to a very high level of accuracy in a fifth of the time it previously took.”

In deciding whether to use the technology, EngineB’s approach was fundamentally the same as many other companies; they looked at what their customers wanted, what EngineB wanted to achieve and whether AI could help them achieve their objectives faster or more accurately. Franki Hackett said: “We always consider if AI will help us to do a better job, where better is defined as significantly faster and more efficient, or a much higher quality or more pleasant for the user. So even if the job will take the same amount of time and be as accurate, if it just makes it nicer for a human being then that would tell us that AI is a good use case here. Sometimes we find it’s overkill and just having good front end design, or a non-AI automation is going to achieve the objectives, so AI isn’t necessary. Where AI helps us achieve one of those three things (quality, speed, user experience) then as standard we would go with AI as an option.” 

One example use case was to conform trial balances from any customer anywhere in the audit market, and quickly and accurately map that to a common chart of accounts in an audit firm to compare, for example, payables of one client with another client, without having to go through the work of tidying and preparing the data. 

When considering AI vs manual, they first looked at whether there were non-AI ways to do it. The answer was yes – to some extent. It was possible to define some rules based on what historical data showed; for example account codes starting with a one are frequently seen to be an asset account. Similar rules are possible. However, the problem was that it doesn’t work very well as you just don’t get very high accuracy. The alternative option was to have human beings do it. 

Franki explained: “Our principal approach is that to deliver real value to our customers, we’ve got to be faster and more efficient than humans doing tasks manually. But human judgement is still the best guide we have to accuracy and quality. So in that sense defining and quantifying the benefits is about knowing when we’ve hit a good enough standard, because the quality and speed of human beings are going to be the gold standard until AI is improved in this kind of use case. We know we can’t beat trained humans on accuracy, so we had defined that we want to be able to save significant time and we want the AI to be in the very first version to be at least 50% as accurate as a human, so that we can then improve from there. We knew we wanted to be efficient, and we had to have a level of accuracy that didn’t reduce the efficiency gain or make the tool horrible to use. Having a very tight outcome focussed on benefits for our customers was key to defining benefits.”

Methodology

When deciding whether to use in-house or third-party models and training data, a lot came down to experimentation and using the resources of their internal AI team. They looked at the third-party model market and used fake data to test third party models. They also tried using the libraries that were available to them to test some in-house models using in-house generated data as well as metadata shared by partnering customer firms. They quickly found that in-house models were going to be much more effective because they could build a combination of libraries and code that was truly responsive to the specific context of the problem. “When you build AI, you often don’t start from scratch, and instead you use coding languages that have libraries, which have been developed to perform this kind of use case,” said Hackett. Using these libraries and tailoring them to their context instead of taking an off-the-shelf generative AI allowed them to deliver better accuracy.  

EngineB also spoke to their connections in other technology firms, including Microsoft, about how they achieve similar goals in other language and classification problems and what works and what doesn’t. They then created a project where they had a team of permanent and temporary expert staff to build the training data set using trial balances from various places which helped to build the model. EngineB mapped the trial balances to the model manually and this gave them both the training and the testing data set for the first version. They did this process iteratively over six months or so with the team doing a lot of manual work adding more and more training data so they could scale up.  This gave EngineB the mapping foundation to be able to release it to customers for feedback in the real world. EngineB recognised that their AI – like all AIs- was never going to be 100% accurate but could get stronger with more training and by using their customer base, who could opt into providing feedback to retrain their models and make improvements.

Preparing data

The first step in data preparation was the planning. EngineB had to work out every step of the process and consider what it would look like from a user's perspective and what needed to be done first. For example, in the first year of the build you don't need the ability to compare the current year’s trial balance with the previous year’s to identify changes, so they didn't build that functionality first time around, they decided to build that later. But some functionality like the ability to make an accurate account prediction, was absolutely needed. 

EngineB broke down the overall task into identifying what the common chart of accounts should look like, what features it would need to do the classification, where account codes should sit, and additional requirements such as the need to avoid duplication where there is a strict hierarchy. For example it can be tricky to know whether trade receivables is a current or a non-current asset if you’ve got it appearing twice, so EngineB had to have different names for that, which resulted in a lot of data preparation. 

Hackett says: “The core decision we made very early was to say we’re not going to work with data that’s not tidy.  It’s a waste of time and effort. So we leveraged the technology we already have in our platform to conform all our training and test trial balances to our common data model before attempting to use it in the AI model. This has been a significant benefit to us internally. It’s the same benefits our customers see when they use EngineB, because a trial balance always looks the same; the account numbers and account names might be different but where they’re found is always exactly the same. This has probably saved us weeks in time manually tidying up and that’s something we do as a standard now if we are going to do an AI project not focused on how we put data into the common data model itself.”

Working with vendors: Language matching

Recognising that they are part of a wider eco—system, EngineB, spoke with various people, including audit tech vendors, to understand how they approach the problem of understanding and solving language-based technology problems. This included conversations with Microsoft and various other people who do language categorisation and language modelling. Hackett says: “having those ongoing relationships is helpful because they can nudge you in the right direction and I’ve often seen tech companies work like this. I’ve seen people using AI for very different things, using AI for journals risk assessment and being a little bit cagey for obvious reasons about how their algorithms work but quite often some places will even publish papers or present at tech conferences talking about their algorithms and how successful it is and keeping them almost semi academic: how are we approaching this problem? What problems have we found? That kind of conversation in the tech world between tech companies and academia really enriches how we do things.”

Hackett added that: “I see few audit firms in that space, maybe because they don’t have the time, or possibly because they feel they don’t have the knowledge or feel appropriately qualified, but we benefit enormously being in it. Anyone who has ambitions to use AI, I would suggest it is worth going to these tech conferences, as it is building those relationships with the universities and speaking to your tech company and saying ‘how does that work for you?’”

Creating governance

Franki explains “We don’t use a different approach for good quality AI development than other forms of development – generative AI is another tool for us and we manage that within our existing processes, which makes it less likely that specific items will get lost. Our CTO and our senior architect are responsible for making sure we have the right level of skill in the teams which build our AI tools, which is just good people management. Good recruitment and good people management are processes that are very familiar already to firms and to businesses.”

EngineB use a project management approach called Scrum, a type of agile project management  commonly used in tech development. They initiate their projects by defining a requirement, based on their business strategy and customer requests, which is then broken down into tiny actionable chunks, called tickets, each of which must have additional requirements. Hackett says: “One of my jobs as head of audit and ethics is to make sure those requirements meet the needs of the audit customers, and they are as complete as possible. We also risk assess in the process of developing tickets, which we do for all requirements. So for example when we develop tickets where users can type in information, we know there’s a risk people might enter malicious code so we have ways to prevent that. With AI similarly we design with risk in mind. For example, when making recommendations we know people can be subject to automation bias – accepting whatever the computer tells them. So we explicitly capture on our tickets mitigations like showing a little robot face to indicate that a robot, not a human, made this suggestion, and requiring users to accept a pop-up if they want to proceed without manually reviewing the work the AI has done. To support the tickets, we also have a built-in design document which include things like how do we manage and deal with data and with AI risks; having that definition up front means it then goes into our development pipeline.”

Once the work has been built, the good governance process continues. Hackett says: “EngineB get all of our developers to test their own work, so they do have to say before they can unleash the work on the rest of our environment that it is working locally, quite often they will ask a peer to check it for them, because often you can’t see what you can’t see. Once they’ve checked it, it will then come into a separate testing environment where we have a dedicated team of people who will go back to the requirements as written and try it out against all of the requirements (those are not auditors or accountants they are technical test people who test scripts etc). They will run all those tests to check its working and they will flag if it’s not and look at fixes.

“Once that’s been done, it goes to our expert testers who are subject matter experts; these are auditors and accountants, who will check that it works from an audit point of view. This team will understand the level of quality we require in terms of our ambitions for accuracy, and will feed back any problems, and we will repeat this cycle until we reach a minimum standard. Often this is part of the loop of building up our test and training data, so the expert team will try using the AI to map something it’s never seen before, and then feedback the performance to the AI build team. We don’t release the AI until we’re happy it’s hitting that minimum standard. Often this team will also flag if the tool encourages users to behave in unexpected or unhelpful ways – it was an element of this testing which identified that the tool should prevent the user proceeding where they have relied entirely on AI without having to acknowledge that.”

EngineB have a rigorous testing process all of which is documented within their internal systems, which means on rare occasions if something goes wrong they can look back and see what’s happened so they can improve their testing procedures. Quite often if something goes wrong it’s because of forgetting to do something or someone didn’t realise they should test for it. 

EngineB believe there isn’t a substitute for rigorous testing, and it’s about having the right people. Hackett says: “I am reasonably technical but I can’t check for example if somebody has switched off an environment variable, that’s just not within my power whereas technical testers absolutely can. They can’t check that an account code has been mapped correctly to a common chart of accounts unless we’ve already mapped that in our balance previously because they just don’t know; if they’ve got something called doubtful debts they are not going to know about bad debt provision. It’s about having the right people doing the right testing and it’s also about defining things as well as you can prior to testing. And making sure that all your users’ needs are represented in the process.

“It’s the subject matter experts that are doing the usability testing on that kind of “can an auditor use it, can an accountant use it?” level and our technical testers are doing usability in terms of “do all these buttons work?” Periodically as well we have external people come in and review our software and say to us how instinctive is this, how easy is it to use? And does it encourage people to over-rely or to do risky things with their data. They give us massive amounts of feedback, which we will then build back into the tool to improve it. Sometimes getting an extra outside pair of eyes on it can be very valuable.”

“Governing Generative AI, is not that different from other kinds of project management, other kinds of technology development. These are things that accountants are quite skilled at certainly in terms of project management, you have to know where you are going, you have to know who’s going with you and you then have to know how you are going to get there. Once you’ve done that you need to be checking where you’re going when you’re actually doing it – all of those things are true of Generative AI as well as other things. There are some differences: maybe you need different skills and expertise, but fundamentally this is very similar to what you already do as accountants. We think people should feel confident and take heart from that.”

Measuring benefits

For EngineB, the biggest measurable impact is on their customers. Hackett says: “. We were able to quite simply measure the time it took an auditor to map a trial balance to a common chart of accounts before and after the AI and results showed its actually saving 75-80% of the time to get to the same level of accuracy so that’s an immediate time saving.

“We also tested for accuracy, and could say that human beings were getting it about 98% right (sometimes it’s really hard to tell as you need client specific knowledge that we are never going to be able to replicate), and then we tested that in the AI and the first version was 50% accurate for its first recommendation and then 85% accurate within the top 10 recommendations - that’s obviously fantastic, we were very pleased to hit those levels of accuracy.”

EngineB could then further improve its accuracy when they rolled it out to customers as they are feeding back the information to them by sharing the correct and incorrect mappings they are seeing. EngineB took that feedback onboard to retrain the AI and continued to enhance and improve it. 

Hackett says: “If you’ve defined the benefits and quantified them early, you are going to be able to keep measuring them later. If you don’t do that upfront, you are going to have to scramble to do it later.”

Challenges and lessons from implementation

One of the challenges EngineB faced was making sure that everyone was on the same page and had the same understanding. Hackett says: “This applies to our whole business and I think it applies to development of any AI or any technology. We all start off thinking we agree on what’s going to happen and write down and agree the requirements, but it is very rare that I don’t have a week where I’m testing something or someone in my team is testing something and I have the experience of going ‘that’s not at all what I meant, and yet I totally see why you thought I mean that’, because we all make assumptions and particularly people with different skills backgrounds will have different views on how things should or would behave. 

”The thing that I see all the time is that software people will have assumptions about how data is treated, software engineers or database engineers will have a different set of assumptions and auditors again will have a totally different set of assumptions about what is normal behaviour. We have our principal document which is something that we developed over the last year or so; you have to align some basic agreements otherwise you will find you’re going back on the same issues again and again. It’s probably the hardest thing to do is to look at the assumptions you are making. 

“I teach my team to pretend they are explaining things to a martian or a very small child, you have to assume no background knowledge and not because people are stupid but because people have other kinds of background knowledge.”

Instead, EngineB found it helpful to list in tickets a set of resilience criteria which outline where are the instances a function should break and how, because sometimes the AI is going to do something which to the software developer seems really stupid, but to an auditor seems normal and vice versa. For example, to an auditor an account code called ‘increase in year’ sat directly underneath ‘freehold property’ is almost certainly an account you want to roll up into assets. To a software engineer who has only an initial overview of what the balance sheet and P&L do, ‘increases in year’ seems like a slam-dunk for an income code. A software developer is just not going to know that the data behaves in a particular way. If the AI predicts an income code the software developer is happy and the accountant isn’t. So you need to have some resilience built in to allow the AI to make a wrong prediction and then get corrected, and capture learning from previous work to say I see this gap in data, the account name isn’t the only relevant information, position is relevant too so I’m just going to tweak my tool to reflect that . Hackett says: “This is why EngineB uses Scrum; it’s reflective, in part it makes you go back and learn; be aware of the assumptions and be aware of what the customer wants – those are the two big pieces of advice I would give people. 

AI Copilot created for auditors

Fuelled with cutting-edge generative AI, a new software developed by Engine B aims to add dramatic efficiencies to lease accounting.

Case studies

You may also be interested in

Elearning
Finance in a Digital World - support for ICAEW members and students on digital transformation and technology
Finance in a Digital World

ICAEW has worked with Deloitte to develop Finance in a Digital World, a suite of online learning modules to support ICAEW members and students, develop awareness and build understanding of digital technologies and their impact on finance.

Resources
Artificial intelligence
Artificial intelligence

Discover more about the impact of artificial intelligence and the opportunities it presents for the accountancy profession. Access articles, reports and webinars from ICAEW and resources from tech experts.

Browse resources
Event
Graphic for Analytics Live 2023
AI and analytics live

Explore the vital role that data is playing at the intersection of the two most transformative topics facing organisations in a generation: AI and ESG.

Find out more More tech events
Open AddCPD icon

Add Verified CPD Activity

Introducing AddCPD, a new way to record your CPD activities!

Log in to start using the AddCPD tool. Available only to ICAEW members.

Add this page to your CPD activity

Step 1 of 3
Download recorded
Download not recorded

Please download the related document if you wish to add this activity to your record

What time are you claiming for this activity?
Mandatory fields

Add this page to your CPD activity

Step 2 of 3
Mandatory field

Add activity to my record

Step 3 of 3
Mandatory field

Activity added

An error has occurred
Please try again

If the problem persists please contact our helpline on +44 (0)1908 248 250