The challenge of biases in the construction of Artificial Intelligence systems

The iDanae Chair (where iDanae stands for intelligence, data, analysis and strategy in Spanish) for Big Data and Analytics, created within the framework of a collaboration between the Polytechnic University of Madrid (UPM) and Management Solutions, has published its 1Q25 quarterly newsletter on the challenge of biases in the construction of Artificial Intelligence systems

The iDanae Chair for Big Data and Analytics, created within the framework of a collaboration between UPM and Management Solutions, aims to promote the generation and dissemination of knowledge, the transfer of technology, and the furthering of R&D in the Analytics field. In this context, one of the lines of work developed by the iDanae Chair is the analysis of meta-trends in the field of Analytics.

The challenge of biases in the construction of Artificial Intelligence systems

Watch video

Introduction

In a world increasingly driven by Artificial Intelligence (AI), machine learning models have become essential tools for decision-making across a wide range of applications, from medicine to finance. Their ability to analyse large volumes of data and extract useful patterns has transformed entire sectors, enabling advances that once seemed unattainable. However, this growing adoption has also highlighted critical challenges related to the possible biases derived from business processes and reflected in training data, which may also produce unfair decisions in the use of these models. These are essential aspects to be addressed for ensuring a correct implementation and deployment, as well as a responsible and ethical use of the models.

Biases inherent in data and models can lead to undesired decisions that negatively affect to different groups. These biases may be present in the data used to train the models, both because the process of collecting and storing the data is not complete (e.g. if there is some relevant information not collected and stored) or because the information present in the data represents a biased reality (e.g. when there are some undesired patterns, such as historical decisions, treatments or behaviours that wants to be avoided). In both cases, these biases can be perpetuated or even amplified by algorithms. This may jeopardise the correct application of the model and could also raise ethical questions about the decisions taken automatically using the model.

This newsletter aims to address these issues from two key perspectives: (1) biases in machine learning models, considering both the biases in the training datasets and in the modelling process, and (2) biases in large language models. Through an analysis of metrics, strategies and innovative tools, it is examined how research and development communities are working to detect and mitigate biases, adding to an ethical use of AI systems. Specific metrics to assess the impact of algorithmic decisions on different population groups, data preprocessing techniques and algorithmic adjustments are explored.

Concept

In general terms, machine learning models are trained on observed or generated data and are implemented and used for the execution of general or specific tasks (prediction, classification, decision-making, etc.). However, some characteristics of the different components of the modelling process may have a (undesired) influence on the results, producing a (systematic, i.e. not random) deviation between the actual and the expected value given by the model. This type of deviation is called bias. Although they cannot be totally eliminated, they can be mitigated, and for this it is necessary to know where they come from. The sources of a bias can be classified in two types: (1) bias stemming from the data used for training the model; (2) bias introduced during the modelling process, typically derived from decisions taken or techniques applied by the modeler. As a result, the presence of biases may imply a discrimination against certain groups of people, and lead to unfair decisions in practice.

Biases in the training data

The models are trained based on a set of data with the objective of capturing the patterns or behaviours that are present in the training dataset. Indeed, the process of model training uses a function to minimize the difference between the output of the model and the observed data. If the data show a set of patterns of behaviours that do not accurately represent the reality, the output of the model will be biased. Therefore, it is key to assess the representativeness of the dataset, i.e., that the data are correctly representing the reality that is to be modelled.

A dataset is said to be representative when there are no significant differences between the behaviours, patterns or characteristic of the data and the reality or population to which the model is expected to be applied. The occurrence of these differences could mainly happen for two reasons:

1. When a change in the reality that remained reflected in the training dataset occurs (e.g., when there are changes in the strategy of a company, in policies, in decision or business processes, in the regulation, in the macroeconomic or market conditions, etc.).

2. The process that generates the training data is designed in such a way that it does not represents the reality that wants to be modelled. This occurs when:

a) The business process affects the information produced or the patterns or behaviours that will be present in the dataset (known as historical bias). For example, in a credit admission process of a financial institution, only good creditors are granted a loan, and therefore, the default history cannot be observed on the rejected loans.

b) The process of data collection or the construction of the dataset is designed in such a way that influences the representation of the real patterns or behaviours or does not include information which is determinant for the represented reality.

To understand and determine the possible lack of representativeness of the data, a representativity analysis must be performed during the data preparation phase, considering both quantitative analysis (e.g. statistical distributions) and qualitative criteria (e.g. population characteristics, similarity of business processes, etc.).

As a first step to address the lack of representativeness, the incorporation of all relevant data in the modelling sample should be ensured, if not present yet. For example, in the case of rejected loans an inference technique may be applied: the payment behaviour of the loan that would have been observed in case it had been granted is estimated, with a probability, and then the rejected loans are also included in the modelling sample.

It should also be analysed the case where a pattern or behaviour is misrepresented in the data, or where there is a subpopulation that shows an undesired pattern or behaviour. To this end, some techniques could be applied:

Some metrics can be estimated to determine the presence of said pattern (see some examples of these metrics below). These metrics could be applied both to the training dataset and after the model deployment to check if the bias has been corrected.
Analyse the relations among the relevant variables underlying the data-generation mechanism to identify causal relationships in the variables that could be producing the observed misrepresentation. In doing so, the use of Causal Bayesian Networks (CBNs) could be considered. CNBs constitute a powerful graphical tool for the representation and reasoning about causal relationships in data. More specifically in the context of identifying biases, CBNs facilitate the visualization of how some attributes directly or indirectly influence decisions or outcomes through causal pathways. The use of this tool may enable the identification of misrepresented patterns within data sets and facilitates the design of models that seek to mitigate those biases.

For addressing the mitigation of biases in this case, the rebalancing of the dataset is an option to be explored (for example, using oversampling techniques or the incorporation of synthetic data).

Biases in the modelling process

During the modelling process, the modeller may take some decisions that could influence the result. Therefore, it is important to apply appropriate techniques for detecting and avoiding this kind of undesired consequences. For example, an inappropriate split between train and test samples could bias the samples. The use of a random selection addresses the problem, and a cross-validation technique may be applied to check for the correctness of the split.

If a specific bias has not been resolved in the data treatment phase and is still present in the dataset, the training of the model could help in correcting this situation. This is done by training an algorithm that accepts the use of weights, so that the samples can be re-weighted, changing the representation and the “influence” of some samples against others for the estimation of the parameters. Other techniques suggest the use of adversarial training (the use of an adversarial model aiming to predict if a sample belongs to a protected group given the prediction of the model that is being trained; if this prediction is successful, then the bias could be present), or the modification of the cost function by adding to the error of the model a fairness metric for the minimisation process and the obtention of the parameter estimates. For example, the Exponentiated Gradient Reduction is a training technique that allows the use of restrictions such as demographic parity or equalized odds.

In addition, new techniques have been introduced to address biases in machine learning models. Methods such as marginal direct effect (MDE) measurements or SHapley additive explanations (SHAP) have shown that even advanced unbiased learning techniques can generate biases, either through association or inverse discrimination, in both real and synthetic datasets. These techniques, MDE and SHAP, when used to construct loss functions, can be beneficial in addressing bias-related issues. A key challenge remains avoiding direct discrimination without incurring indirect discrimination (i.e., inducing discrimination using variables correlated with the protected attribute as proxies), such as the phenomenon of "redlining" (the process where banks in United States systematically denied loans and services to certain racial groups based on the areas they lived in).

Recent advances have significantly reduced explicit bias in model responses. These methods leverage the ability of such systems to recognize inappropriate behaviour when flagged. While these techniques are not infallible, the prevalence of problematic behaviours is expected to decrease with the continuous refinement of methodologies. However, this does not imply complete control over the models, and constant monitoring remains essential.

A new challenge: biases in Large Language Models

Although large language models (LLMs) often achieve high levels of accuracy, they can be influenced by biases that negatively impact their results, especially for underrepresented minority groups. Biases introduced during the development of an algorithm can lead to issues such as discrimination, lack of fairness, and a shortage of diversity and inclusion. These biases often arise unintentionally due to insufficient representation in the data used to train the models. However, even methods designed to mitigate discrimination often produce biased models, regardless of whether the datasets used contained discriminatory elements or not.

One of the observed biases is the so-called “social bias”: the attribution of specific characteristics to a person based solely on the group to which they belong. For instance, a model might erroneously associate characteristics like "punctuality and efficiency" to an individual from the Nordic countries due to their association with the geographical origin. Several studies have confirmed that LLMs tend to incorporate social biases present in unprocessed training data. These biases can influence the decisions made by the model in downstream tasks, thereby compromising fairness. In this context, a truly fair language model must operate impartially.

Metrics for detecting biased behaviour between two subpopulations

A special type of bias may be present when the same behaviour or pattern is expected to be applied for two different subpopulations, but the data shows a different pattern. A set of metrics could be applied to detect these situations. These metrics can be divided into two different types: Group and individual metrics [5].
Group metrics focus on the analysis of the difference of model predictions on two groups. In this case, the following measures can be computed:

Demographic parity: the percentages of a positive outcome across different groups under analysis should be the same.
Equal opportunity: different groups should have equal true positive rates.
Equalized odds: different groups should have equal true positive and false positive rates. This metric is more restrictive than the prior two.
Overall accuracy equality: the accuracies across the sensitive groups are the same. This metric can be used in the scenarios where true negatives are as desirable as true positives.
Treatment equality: the ratios of false negative prediction to false positive prediction should be equal across groups
Equalizing disincentives: this measurement can be used when incorrectly classifying an individual in a group results in a bigger loss
Rawlsian Max-Min fairness principle: this measurement encourages maximizing the utility of the group with the lowest utility, where the utility metric can be accuracy, AUC, etc.
Fair clustering: it is defined for quantifying unfairness in clustering problems.
For individual fairness metrics, the objective is to ensure that every sample with similar characteristics is assigned the same result by the model. Some metrics include the following:
Fairness through awareness: any two individuals who have similar non-sensitive attributes should receive a similar outcome.
Counterfactual fairness: two examples that are identical in all respects, except a given sensitive attribute, should result in the same model prediction.
Causal reasoning: analysis of whether an attribute causes a model decision (for example, the rate of admitted job candidates had all been type A should be equal to the rate of admitted candidates had all been type B, under the condition that there is no causal relationship from the inspected attribute to the model outcome.

Biases stemming from the training of the LLM

Biases affecting language models may be originated from the following sources:

Label bias: Training corpora containing harmful information or subjective labels provided by biased annotators can introduce stereotypes into the model.
Sampling bias: when the representation of different demographic groups in the test data does not match that of the training data, the model exhibits bias influenced by this distribution mismatch.
Semantic bias: unexpected biases may arise during the encoding process, embedding skewed semantic information into the model’s representations.
Bias amplification: biases present in the training data can be amplified during the model's learning process, with this effect further reinforced during fine-tuning.

Some training methodologies could also produce a bias. For example:

Pre-training and fine-tuning: this approach begins with an unsupervised pre-training phase on a large corpus, followed by supervised fine-tuning for a specific task. It is commonly applied to medium-sized LLMs and enables a wide range of applications. However, two types of biases may emerge in this process:
- Intrinsic bias: arises from the representations generated by the pre-trained model and is independent of specific tasks.
- Extrinsic bias: manifests in the model’s outputs for specific tasks, such as predicted labels in classification or generated text in creative tasks.
Prompt-based paradigm: in this approach, the model responds to task-specific prompts without requiring additional training phases. This method is more suitable for large-scale LLMs but may reflect human biases in the model’s generated responses, particularly in models that have been fine-tuned using techniques like reinforcement learning with human feedback (RLHF).

To address and measure biases in LLMs, evaluation metrics have been developed and categorized into intrinsic and extrinsic bias metrics:

Intrinsic bias metrics: these metrics evaluate biases embedded in the model’s representations by quantifying statistical associations between target concepts and attributes:
- Similarity-based metrics: use semantically neutral sentence templates to calculate similarities between different demographic groups. They are an adaptation of the Word Embedding Association Test (WEAT), which measures bias in word embeddings.
- Probability-based metrics: formalize intrinsic bias by analyzing probabilities generated by pre-trained LLMs for candidate words or phrases based on evaluation datasets.
Extrinsic bias metrics: these metrics assess biases in the outputs of downstream tasks, characterizing extrinsic bias through performance gaps. Benchmark datasets are often created to measure bias in specific tasks:
- NLU-based metrics: evaluate classification models, such as BERT, on natural language understanding (NLU) tasks. A task-specific classifier is trained on an evaluation dataset, and its output is used as a metric.
- NLG-based metrics: evaluate autoregressive models, such as GPT-2, on natural language generation (NLG) tasks. The model is fine-tuned on an evaluation dataset containing prompts for different conditions, and the generated outputs are assessed.

Using these metrics, biases in LLMs may be identified and addressed, ensuring these systems operate more fairly and equitably across various applications.

Biases stemming from the use of the LLM: hallucinations

A key factor related to the presence of biases is the phenomenon known as hallucination. Hallucinations in LLMs represent a significant challenge, as these models can generate incorrect, inconsistent, or fabricated responses. Recent research indicates that the hallucination phenomenon stems from issues within datasets such as the lack of relevant data and repeated data. This problem, inherent to LLMs, has been showed to be impossible to eliminate, although various techniques have been proposed for its mitigation.

Hallucinations can be divided into several types: Intrinsic hallucinations occur when the model's responses directly contradict the prompts or input information, while extrinsic hallucinations manifest when outputs cannot be verified against the provided information. Factual hallucinations highlight discrepancies between the generated content and verifiable facts from the real world. On the other hand, fidelity hallucinations encompass inconsistencies related to user instructions, context, or logic, such as deviations from the original instructions, contextual contradictions, and errors in internal coherence. Other issues include factual inaccuracy, where responses are incorrect but based on existing data, and misinterpretation, which occurs when the model fails to correctly understand the input or context, resulting in inaccurate responses. Finally, the needle-in-a-haystack problem reflects the difficulty of retrieving specific and precise information from a vast corpus, which can lead to omissions of key data or partially correct responses.

Various studies have introduced strategies to address hallucinations in LLMs. These include the use of fact-centered datasets and extensive data cleaning to reduce noise and existing biases. Additionally, reasoning techniques such as chain of thought and tree of thought have been proposed to structure responses in a logical and coherent manner. Another key strategy is to enhance the model's capabilities through advanced prompting, model ensembling, and the implementation of guardrails, which are programmatic rules designed to limit undesirable behaviours. Furthermore, providing additional data, improving dataset adjustments, adding synthetic data, and making architectural improvements strengthen model accuracy.

Another approach to mitigate hallucinations is Retrieval Augmented Generation (RAG), which combines external information retrieval with response generation (see figure 1). This includes techniques such as iterative retrieval, which allows knowledge to be gathered during the generation process, and post-hoc retrieval, which refines model outputs based on external information. Additionally, post-generation fact verification ensures that responses are consistent with reliable sources.

Example of Chain of verification

This example allows to compare the response generated by an LLM using standard prompting or chain of verification prompting:

Standard Prompting (No Chain of Verification)

Input:

Q: A store sells 5 apples for $2.

How much would 20 apples cost?

Model Output:

A: The answer is $8.

The answer is CORRECT, but the model did not check its math properly.

Chain of Verification Prompting

Input:

Q: A store sells 5 apples for $2.

How much would 20 apples cost?

Model Output (Step-by-Step Verification):

A: First, we know that 5 apples cost $2. To find the cost of 1 apple, we divide $2 by 5, which gives $0.40 per apple.

Now, we multiply $0.40 by 20 apples: $0.40 × 20 = $8. Let me verify: I need to multiply the price of 5 apples by 4 (because 20 apples are 4 times 5). $2 × 4 = $8. So, the correct cost of 20 apples is $8.

The answer is CORRECT. The model first breaks down the problem, verifies the intermediate steps, and ensures that the calculation is correct before providing the final answer. The model checks its work and confirms that $8 is indeed the correct answer.

The fact-checking approach consists of two main stages: fact extraction, which involves identifying factual statements within the results generated by the model, and fact verification, which aims to contrast the accuracy of these statements with reliable knowledge sources. Depending on the sources used, verification methodologies are divided into external retrieval, which involves consulting databases or external resources, and internal verification, which relies on the parametric knowledge stored within the LLMs. In this context, the Chain of Verification (CoVe) was introduced, an approach in which an LLM generates verification questions for an initial draft and then systematically evaluates its consistency, detecting possible inconsistencies or hallucinations. This process allows the model to self-correct errors and provide more accurate responses through an iterative cycle of generation and verification.

Finally, it is worth highlighting several investigations into LLMs that explore their ability to benefit from explanations when learning new tasks from limited examples. Providing detailed explanations of responses in few-shot tasks can improve model performance. Moreover, experiments have demonstrated that including explanations after the response in prompts enhances few-shot learning in large language models. Another notable approach is explanation tuning, a method that trains smaller language models using step-by-step explanations generated by more advanced models, providing supervision. Experimental results indicate that this approach outperforms conventional instruction tuning in complex zero-shot reasoning tasks.

Understanding the output for detecting and addressing biases

The complexity of algorithms has been one of the main obstacles to understand the outputs of AI systems. In the context of LLMs, understanding the process of how the LLM have produced a specific output may help to detect possible biases hidden in between. The so-called “chain of thought” is a technique used in language models to understand the process of producing an output by encouraging the model to articulate intermediate steps while arriving at an answer. It allows users to trace the logic behind the model's conclusions and identify potential errors or biases. This step-by-step may also enhances the model's performance on complex tasks. The greater the number of examples, the better the response from the LLM (see figure 2).

Additionally, tools like Phoenix and LLMCHECKUP complement these strategies by facilitating observability, evaluation, and experimentation with LLMs:

Phoenix is an open-source AI observability platform that enables users to experiment with, evaluate, and troubleshoot issues in web applications. It is also good for testing applications with different LLMs and creating customized datasets.
LLMCHECKUP allows users to interact with any state-of-the-art large language model (LLM) to analyse its behaviour: rather than relying on multiple language models to understand an LLM’s behaviour, LLMCHECKUP uses the same model to understand the user’s intent, facilitating self-explanation of its own functionality. Additionally, it provides users with the flexibility to input customized data, overcoming the limitations of working exclusively with predefined dataset examples. Given that LLMs may occasionally generate inaccurate responses, LLMCHECKUP incorporates the ability to search for information in external knowledge bases through integration with Google Search. This includes generating external links with relevant information for the inputs provided. Users can compare the retrieved information with the explanations offered by the model, achieving a more comprehensive and accurate understanding of its behaviour.

Example of Chain of Thought

This example allows to compare the response generated by an LLM using standard prompting or chain of thought prompting:

Standard Prompting (Few-Shot)
Input:

Q: Maria has 10 candies. She buys 3 packs of candies, and each pack contains 5 candies. How many candies does she have now?

A: The answer is 25.

Q: There are 40 students in a class. If 18 of them go on a trip and 5 new students join, how many students are in the class now?

Model Output:

A: The answer is 23.
That answer is INCORRECT.

Chain of Thought Prompting
Input

Q: Maria has 10 candies. She buys 3 packs of candies, and each pack contains 5 candies. How many candies does she have now?

A: Maria starts with 10 candies. Each pack has 5 candies, and there are 3 packs. So, 3 × 5 = 15. Adding the original candies, 10 + 15 = 25. The answer is 25.

Q: There are 40 students in a class. If 18 of them go on a trip and 5 new students join, how many students are in the class now?

Model Output:

A: There were 40 students in the class originally. 18 students left for the trip, so 40 - 18 = 22 students remain. Then, 5 new students joined, so 22 + 5 = 27. The answer is 27.

That answer is CORRECT.

Conclusions

The increasing adoption of artificial intelligence across various sectors presents unprecedented opportunities but also significant challenges related to the existence and treatment of biases in the data or in the construction of machine learning models. While these technologies have the potential to positively transform society, their impact will largely depend on how ethical and technical issues are addressed during their development and implementation. The presence of biases in data and algorithms highlights the importance of designing systems that are not only technically proficient but also socially responsible and equitable.

Current advancements in metrics and strategies to mitigate biases are promising, but there is still much work to be done. It is crucial for companies, researchers, and policymakers to collaborate in establishing clear standards and modelling techniques to ensure unbiased decisions when using automated systems. In doing so, the development of tools to identify and correct biases must remain a priority for the community.

In addition, the complexity of modern models, such as large language models, continues to be a critical challenge. The lack of transparency in these systems can lead to mistrust and limit their adoption, particularly in sectors where decisions have a direct impact on people’s lives. Therefore, investing in understanding the process of thought and approaches that foster a better understanding of the results generated by these models is essential for their ethical use and broader acceptance.

Ultimately, the development of artificial intelligence requires an interdisciplinary approach that combines technical, ethical, and social insights. Only through continuous collaboration and a commitment to unbiased decisions and transparency AI systems can respect the values of equity, trust, and responsibility.

The newsletter “The challenge of biases in the construction of Artificial Intelligence systems" is now available for download on the Chair's website in both in spanish and english.

The challenge of biases in the construction of Artificial Intelligence systems

The challenge of biases in the construction of Artificial Intelligence systems

Introduction

Biases in Artificial Intelligence

Concept

Biases in the training data

Biases in the modelling process

A new challenge: biases in Large Language Models

Metrics for detecting biased behaviour between two subpopulations

Biases stemming from the training of the LLM

Biases stemming from the use of the LLM: hallucinations

Example of Chain of verification

Understanding the output for detecting and addressing biases

Example of Chain of Thought

Conclusions