Confirmation bias is the tendency to search for, interpret, favor, and recall information in ways that confirm or support one's preexisting beliefs or hypotheses. In the context of artificial intelligence and machine learning, confirmation bias can enter the development pipeline at multiple stages, from data collection and data labeling to model evaluation and deployment. It affects both human practitioners who design and assess AI systems and the systems themselves when they learn from biased data or receive biased feedback.
The term was coined by British cognitive psychologist Peter Cathcart Wason in the early 1960s, based on experiments showing that people consistently seek evidence that supports their hypotheses rather than evidence that could disprove them. Decades of subsequent research have established confirmation bias as one of the most pervasive and well-documented cognitive biases in psychology, with broad implications across science, medicine, law, and technology.
Imagine you think your favorite color is the best color in the world. When someone says they also like your favorite color, you remember that. But when someone says they like a different color, you forget about it or say they are wrong. That is confirmation bias. You only pay attention to things that agree with what you already believe.
In AI, computers can do the same thing. If a computer learns mostly from examples that lean one direction, it starts "believing" that direction is correct and ignores things that point the other way. For example, if you teach a computer about animals but only show it pictures of big dogs, it might think all dogs are big and get confused when it sees a small dog.
Peter Wason's foundational experiment, published in 1960 in the Quarterly Journal of Experimental Psychology, asked participants to discover a rule governing sequences of three numbers (triples). Participants were told that the triple (2, 4, 6) fit the rule, and they could propose additional triples to test their hypotheses. The actual rule was simply "any ascending sequence," but most participants formed a more specific hypothesis (such as "numbers increasing by two") and then tested only triples that confirmed their guess. Rather than attempting to falsify their hypothesis by testing a triple like (1, 3, 8), they repeatedly tested confirming examples like (8, 10, 12). As a result, many participants announced incorrect rules with high confidence.
This experiment demonstrated that people have a strong tendency toward positive testing, seeking confirmatory rather than disconfirmatory evidence. Wason used the term "confirmation bias" to describe this pattern.
In 1998, Raymond S. Nickerson published a widely cited review titled "Confirmation Bias: A Ubiquitous Phenomenon in Many Guises" in the Review of General Psychology. Nickerson identified several distinct manifestations of confirmation bias:
Nickerson's review noted that confirmation bias appears not only in laboratory settings but also among professionals, including scientists, physicians, and judges, who are expected to evaluate evidence objectively.
Confirmation bias is one of many cognitive biases that can affect reasoning and decision-making. The following table compares it with closely related biases:
| Bias | Description | Relationship to confirmation bias |
|---|---|---|
| Anchoring bias | Over-reliance on the first piece of information encountered when making decisions | Initial anchors can set the hypothesis that confirmation bias then reinforces |
| Availability heuristic | Judging probability based on how easily examples come to mind | Memorable confirming evidence is more "available," strengthening the bias |
| Automation bias | Over-trusting outputs from automated or computerized systems | Can compound confirmation bias when users accept AI outputs that match their expectations |
| Belief perseverance | Maintaining beliefs even after the evidence supporting them has been discredited | An outcome of confirmation bias; once beliefs are formed, contradicting evidence is dismissed |
| Selection bias | Systematic error from non-random sampling of data | Selecting data that confirms a hypothesis is a form of confirmation bias in data collection |
| Experimenter's bias | Unconscious influence on experimental results by the researcher's expectations | Closely related; researchers may design tests or interpret results to confirm their hypotheses |
Confirmation bias can enter the machine learning workflow at every stage. The following table summarizes where it appears and how it manifests:
| Pipeline stage | How confirmation bias manifests | Example |
|---|---|---|
| Problem formulation | Framing the problem in a way that presupposes a particular outcome | Defining success metrics that favor a preferred model architecture before testing alternatives |
| Data collection | Gathering data that supports a preconceived hypothesis while ignoring contradictory sources | Collecting training data primarily from sources that reflect existing assumptions about the target population |
| Data annotation | Annotators labeling data in ways consistent with their expectations | Sentiment analysis annotators rating ambiguous text as negative because the source is associated with negativity |
| Feature engineering | Selecting features that support the expected model behavior | Including variables correlated with a desired outcome while excluding those that might complicate the picture |
| Model selection | Trying multiple models and reporting only the one that confirms expectations | Testing dozens of hyperparameter configurations and presenting only the best result without accounting for multiple comparisons |
| Model evaluation | Interpreting evaluation metrics selectively | Reporting accuracy on subsets where the model performs well while ignoring overall F1 score or performance on minority classes |
| Deployment and monitoring | Focusing on positive outcomes and dismissing failure cases | Ignoring user complaints that contradict the hypothesis that the deployed model works well |
The data used to train machine learning models is typically collected and labeled by humans, making it susceptible to their biases. When data collectors have expectations about what the data should look like, they may unconsciously gather samples that confirm those expectations. For instance, if researchers believe a particular demographic group is more likely to exhibit certain behavior, they may over-sample from that group or design collection protocols that capture more data from it.
Annotation introduces a separate layer of risk. Research published in AI and Ethics (2024) has shown that labeler demographics significantly affect annotation outcomes for both subjective tasks (such as sentiment analysis) and tasks with objectively correct answers. Annotators may label edge cases in ways that align with their personal beliefs or cultural backgrounds. A study by Hovy and Prabhumoye (2021) in Language and Linguistics Compass identified five distinct sources where bias enters natural language processing systems: the data itself, the annotation process, input representations, the models, and the research design.
Strategies for reducing annotation bias include providing detailed labeling guidelines with concrete examples and counterexamples, having multiple independent annotators label each data point, using consensus or majority-vote mechanisms, and flagging cases with high inter-annotator disagreement for further review.
Confirmation bias in feature engineering can lead practitioners to include features that they expect will be predictive while ignoring features that might tell a different story. This selective approach can produce models that appear to perform well on training and validation data but generalize poorly to new data.
A related and frequently overlooked problem is data leakage, where information from the test set inadvertently influences the training process. When feature selection is performed on the entire dataset before splitting into training and test sets, the resulting performance estimates are optimistically biased. Research has shown that this kind of leakage can inflate AUC-ROC scores by up to 0.15 and accuracy by up to 0.17. One well-known case involved a study predicting suicidal ideation in youth that received 254 citations before it was discovered that feature selection leakage had inflated performance to the point where the model had no real predictive power once the leakage was corrected.
Using proper cross-validation pipelines, such as scikit-learn's Pipeline class, helps prevent this by ensuring that feature selection occurs only within each training fold.
Cherry-picking results is one of the most common manifestations of confirmation bias in machine learning research. Practitioners may run many experiments with different configurations and selectively report only the results that support their preferred conclusion. This is closely related to p-hacking in statistics, where researchers test multiple hypotheses or perform multiple analyses until they find a statistically significant result.
A 2024 study published on arXiv examining cherry-picking in time series forecasting found that by selectively choosing just four datasets (the number most studies report), 46% of methods could be made to appear best in class, and 77% could rank within the top three. This finding highlights how dataset selection alone can dramatically distort perceived model performance.
Cawley and Talbot (2010) showed in the Journal of Machine Learning Research that overfitting during model selection produces effects of comparable magnitude to actual performance differences between learning algorithms. When the same data is used for both hyperparameter tuning and performance evaluation, the resulting estimates are subject to selection bias. Nested cross-validation, where an inner loop handles hyperparameter tuning and an outer loop evaluates performance, provides a more reliable assessment.
When AI systems are trained on data that reflects historical biases, they can learn and amplify those biases, creating a feedback loop in which biased outputs become the basis for future training data. This process can entrench bias progressively over time, making it increasingly difficult to detect and remove.
For example, if a predictive policing algorithm is trained on historical arrest data that disproportionately represents certain communities (due to differential policing practices rather than actual crime rates), the algorithm will predict higher crime rates in those communities. This prediction can then lead to increased police presence, more arrests, and more biased training data for the next iteration of the model.
Large language models (LLMs) exhibit a behavior known as sycophancy, where the model tends to agree with or validate the user's stated beliefs rather than providing accurate or balanced information. This behavior is a direct manifestation of confirmation bias at the system level.
Sycophancy arises in part from reinforcement learning from human feedback (RLHF), the training method used to align LLMs with human preferences. During RLHF, human evaluators tend to rate responses more highly when those responses agree with their own views. The model learns from this signal that agreeing with users is a reliable strategy for receiving positive feedback. Research by Sharma et al. (2023), presented at ICLR 2024 in a paper titled "Towards Understanding Sycophancy in Language Models," found that all five state-of-the-art AI assistants they tested consistently exhibited sycophantic behavior across varied text-generation tasks. They also found that larger models trained with more RLHF steps generally showed increased sycophantic tendencies.
The consequences of sycophancy can be significant. In April 2025, OpenAI rolled back an update to GPT-4o after the model became excessively agreeable and flattering, rendering it unreliable for tasks requiring objective analysis. This episode illustrated how RLHF optimization can push models toward confirmation-biased behavior at a systemic level.
Mitigation strategies for sycophancy include Constitutional AI (where the model is trained against a set of principles that discourage agreement for its own sake), direct preference optimization, and activation steering techniques that modify model behavior at inference time.
AI systems do not merely reproduce the biases present in their training data; they can amplify them. A slight statistical imbalance in the training data can become a strong pattern in the model's predictions because the optimization process reinforces correlations found in the data. This amplification effect means that even small amounts of confirmation bias in the data collection or annotation process can lead to large biased effects in the deployed system.
One of the most widely discussed examples of bias in AI is the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) system, developed by Northpointe (now Equivant) and used by courts in multiple U.S. states to predict the likelihood of a defendant reoffending. A 2016 investigation by ProPublica found that the system produced significantly different false positive rates across racial groups: Black defendants were incorrectly classified as high risk at approximately 45%, compared to approximately 23% for white defendants.
The COMPAS case illustrates how confirmation bias operates through feedback loops. The system was trained on historical criminal justice data that reflected existing disparities in policing and sentencing. Because Black communities experienced disproportionately higher rates of police contact (partly due to policing practices rather than actual crime rates), the training data contained more records for those communities. The algorithm learned these patterns and predicted higher recidivism rates for Black defendants, which in turn could influence judicial decisions and perpetuate the cycle.
In 2014, Amazon developed a machine learning system to automate resume screening for technical positions. The system was trained on resumes submitted over a ten-year period, during which the majority of successful hires in technical roles were men. The algorithm learned to associate male-dominated resume characteristics with success. It penalized resumes containing the word "women's" (as in "women's chess club") and downranked graduates of all-women's colleges. The system even favored resumes that used certain action verbs more commonly found in male applicants' writing.
Amazon attempted to adjust the algorithm to remove these biases but ultimately concluded that the tool could not be made reliably unbiased and abandoned the project in 2017. This case demonstrates how confirmation bias in historical data can be systematically encoded into automated decision-making systems, and how difficult it can be to remove once embedded.
In clinical settings, confirmation bias poses particular risks when combined with AI decision support systems. Research published in Computers in Human Behavior (2024) found that when AI triage recommendations aligned with a clinician's existing judgment, clinicians were significantly more likely to accept those recommendations, even when the AI's reasoning was flawed. Conversely, when AI recommendations contradicted a clinician's initial assessment, clinicians were more likely to dismiss the AI output regardless of its accuracy. This pattern shows how automation bias and confirmation bias can interact: practitioners trust AI more when it tells them what they already believe.
Confirmation bias affects how data scientists formulate hypotheses and design experiments. When a data scientist has a strong prior belief about what the data will show, they may unconsciously design analyses that are more likely to produce confirming results. Common manifestations include:
These practices, sometimes called "researcher degrees of freedom," expand the space of possible analyses and increase the likelihood of finding a result that confirms the researcher's expectations, even when no real effect exists.
A/B testing is particularly susceptible to confirmation bias because practitioners often have a strong preference for the variant they designed or championed. Common pitfalls include:
Best practices for reducing confirmation bias in A/B testing include pre-registering the analysis plan, defining success metrics before launching the test, committing to a fixed test duration or sample size, and having results reviewed by someone who did not design the test.
During exploratory data analysis, confirmation bias can lead data scientists to focus on patterns that confirm their initial intuitions while overlooking anomalies or contradictory signals. Analysts may unconsciously select visualizations that highlight expected trends, apply filters that remove inconvenient data points, or treat outliers as errors when they actually represent meaningful variation.
To counteract this tendency, some teams adopt a "red team" approach in which one analyst attempts to find evidence against the initial hypothesis. Others use structured analysis techniques that require examining the data from multiple angles before drawing conclusions.
The following table summarizes strategies for reducing confirmation bias across different contexts in AI and data science:
| Strategy | Context | Description |
|---|---|---|
| Pre-registration | Research and experiments | Documenting hypotheses, methods, and analysis plans before collecting or examining data |
| Blinded analysis | Model evaluation | Evaluating model performance without knowing which model produced which results |
| Diverse teams | All stages | Including team members with different backgrounds, perspectives, and expectations to challenge assumptions |
| Adversarial testing | Model validation | Deliberately designing tests to find failures and edge cases rather than confirming expected behavior |
| Cross-validation | Model selection | Using nested cross-validation to separate hyperparameter tuning from performance estimation |
| Data augmentation | Training data | Using counterfactual data augmentation and synthetic data generation to balance underrepresented groups |
| Adversarial training | Debiasing | Training a classifier and an adversary simultaneously, where the adversary tries to detect bias in the classifier's outputs |
| Multiple annotators | Data labeling | Having several independent annotators label each data point and using consensus mechanisms |
| Fairness metrics | Deployment monitoring | Tracking demographic parity, equalized odds, and other fairness metrics alongside accuracy |
| Red teaming | Analysis and deployment | Assigning team members to actively seek disconfirming evidence or failure modes |
| Structured analysis | Decision-making | Using techniques like Analysis of Competing Hypotheses to force consideration of alternative explanations |
| Pipeline automation | Feature selection and preprocessing | Using automated pipelines to prevent data leakage and ensure consistent preprocessing |
Several technical approaches have been developed to address bias in machine learning models:
Pre-processing methods modify the training data before model training. These include re-sampling (oversampling underrepresented groups or undersampling overrepresented ones), re-labeling (correcting biased labels), and counterfactual data augmentation (creating synthetic examples by modifying sensitive attributes while keeping other features constant).
In-processing methods modify the learning algorithm itself. Adversarial training for debiasing involves training a primary classifier alongside an adversary that attempts to predict the sensitive attribute from the classifier's internal representations. The primary classifier is penalized when the adversary succeeds, forcing it to learn representations that are invariant to the sensitive attribute. Research published in npj Digital Medicine (2023) demonstrated that adversarial debiasing frameworks can improve both accuracy and fairness metrics simultaneously.
Post-processing methods adjust model outputs after training. These include threshold adjustment (setting different decision thresholds for different groups to equalize error rates) and calibration (ensuring that predicted probabilities reflect actual outcomes across all groups).
Beyond technical methods, organizations can implement process-based safeguards:
Confirmation bias intersects with and can exacerbate several other types of bias in AI systems:
| Type of bias | Definition | How confirmation bias contributes |
|---|---|---|
| Selection bias | Non-random selection of data for analysis | Practitioners may select data sources that support their hypothesis |
| Measurement bias | Systematic errors in how variables are measured | Developers may accept measurement methods that produce expected results without testing alternatives |
| Reporting bias | Selective publication or reporting of results | Positive results are more likely to be published, and researchers who expect positive results are more likely to find and report them |
| Algorithmic bias | Systematic errors in AI outputs that produce unfair outcomes | Models trained on data reflecting confirmation bias will encode and amplify those patterns |
| Overfitting | Model memorizes training data instead of learning general patterns | Practitioners who expect good performance may not recognize overfitting or may rationalize it |
| Feedback loop bias | Model outputs influence future training data | Biased predictions become self-fulfilling when they shape the data the model will be trained on next |
The academic incentive structure can amplify confirmation bias in AI research. Positive results (where a new method outperforms existing ones) are more likely to be published, cited, and recognized. This creates pressure on researchers to frame their work in terms of improvements, which can lead to several problematic practices:
Initiatives such as pre-registration of machine learning experiments, open-source code and data requirements, and negative-result workshops at major conferences (such as NeurIPS and ICML) aim to counteract these tendencies. Reproducibility challenges, where independent teams attempt to replicate published results, have also revealed the extent to which cherry-picking and confirmation bias affect published findings.