When first embarking on an AI project, the biggest question is about the quantity, type and quality of data. This question is a tricky one to answer. It all depends on the use case that the AI will be solving.
For instance, if the AI is being trained for facial recognition purposes, training must include different lighting conditions, profiles (front, side, high and low, for instance), single faces, faces within a crowd, ethnicity, gender and age, among other attributes. Without facial diversity, the AI will innately contain biases because of its limited learning. It will recognise some facial characteristics with a high degree of accuracy, but struggle with others.
In previous entries in this series, we looked at the objectives of our AI project to tackle modern slavery, and the various ways to tackle bias and hallucinations. In the context of modern slavery, the data issue is complex. Modern slavery occurs for a variety of reasons, in various ways, and is inherently concealed by its perpetrators. This means that data is often scarce, disparate, incomplete and in non-digital forms.
When data elements are grouped together – typical when problem solving – it is known as a dataset. The datasets needed for iEARTHS are numerous, precisely because the issue is complex, has many layers, is varied and the task is specific.
For instance, when considering the datasets needed to develop the economic model used to show the P&L impact of a given action or inaction, we had to normalise average country wage calculations. Individual countries’ calculation methods or standards of living often differ.
Why is this important? Imagine the AI doesn’t ‘understand’ nuanced wage differences. The costing determination for each recommendation could inadvertently lead to a user deciding to continue suboptimal labour practices in lower-cost jurisdictions where individuals are more likely to be vulnerable and exploited. This would be an example of the AI causing more harm, which is clearly not an acceptable outcome. The key consideration in this aspect of data is quality.
Data quality can help lessen the need for quantity. If we return to our facial recognition example, training the AI with high resolution photos rather than a larger volume of low-resolution photos is more effective.
In the context of modern slavery, all the data consumed by iEARTHS is curated and of high quality. We took the time to ensure datasets were validated by NGOs, economists, supply chain experts, behavioural scientists, survivors or supply chain investigators, among others. It is a painstaking process, but also an essential one to ensure a solid foundation.
Data terminology also posed some interesting challenges. How something is defined can differ between actors within the ecosystem. Even using the same term can mean very different things.
Normalising definitions, or linking them as interoperable, ensures a consistent understanding within the domain and its users. We addressed this challenge by using taxonomies and knowledge graph technology.
Knowledge graphs generally include individually sourced datasets, often with unique structures. These datasets are in turn connected through schemas, identities and context.
A schema provides a wireframe for the knowledge graph. The context distinguishes the setting within which the knowledge exists. In other words, this approach allows the iEARTHS AI to distinguish definitions or concepts that hold multiple meanings. Structuring and contextualising data is the starting point for effective AI learning.
Sourcing the data was also a consideration. The various sources are often correlated with trust, or a lack thereof. Some AI, such as GenAI, use all publicly available data – good and bad – the effects of which we saw in the first miniseries article.
The very nature of the problem that iEARTHS is tackling is around human impacts, meaning the data learning must be of the highest quality, bias free and as complete as possible.
Data gaps are addressed through targeted data acquisitions, such as on-the-ground NGOs, or advanced learning techniques, such as data augmentation. It is important to recognise that data processing by the AI is also constrained by computing power, so that needs to be managed as well.
Finally, large language models (LLMs) are designed to understand and generate human-like text based on their learning from the vast amount of data used to train them. LLMs can infer context, translate into multiple languages, answer questions and more.
They can be a useful technological tool, but they are often quite biased, as seen in a recent scientific report. It demonstrates the gender and racial biases inherited from archival data when it is not adapted to eliminate such historical prejudice, racism or stereotypes.
iEARTHS does not rely on LLMs or proxy approaches to AI training because of the bias findings in a recent Stanford University study. However, we are evaluating mechanisms for neutralising biases, allowing for possible future usage.
Data is the foundation upon which any AI tool learns. Biases are built into, or prevented within, datasets, depending on whether the historical data is used as-is or adapted to reflect what we’d like to see in the future (such as an equitable and inclusive future). Sufficiency is relative to the problem the AI will be tasked with solving. Any AI project requires the user to understand the data’s limitations so they can be actively managed.
In the next and final instalment of the miniseries, we will look at the intersection of ethics with AI, and how finance professionals can play a pivotal role in supporting businesses to transition towards sustainable operating models.
Read the whole series...
Supporting AI adoption
In its Manifesto, ICAEW sets out its vision for a renewed and resilient UK, including incentivising the use of AI and upskilling the workforce to do so.