Ground Truth

Ground truth refers to verified, correct information that serves as the authoritative reference for training and evaluating machine learning models. In supervised machine learning, ground truth consists of the accurate labels or annotations attached to data points in a training set, representing the ideal output that a model should learn to predict. More formally, ground truth is information that is known to be real or true, provided by direct observation and measurement (empirical evidence) as opposed to information provided by inference.

The term originates from remote sensing and surveying, where "ground truth" described physical measurements taken at the Earth's surface to validate aerial or satellite observations. In modern AI, the concept has become foundational: without reliable ground truth, it is impossible to train accurate models or meaningfully evaluate their performance.

Explain Like I'm 5 (ELI5)

Imagine you are studying for a math test and your teacher gives you a worksheet with all the correct answers on a separate sheet. That answer sheet is the ground truth. You check your work against it to see where you made mistakes. A machine learning model does the same thing: it looks at data, makes a guess, then checks its guess against the ground truth to learn what it got wrong. If the answer sheet has mistakes on it, you would learn the wrong answers. That is why having a correct and trustworthy answer sheet (ground truth) is so important.

Origin and history of the term

The phrase "ground truth" has roots that predate its use in computer science. According to the Oxford English Dictionary, the compound "groundtruth" first appeared in Henry Ellison's 1833 poem "The Tale of a Siberian Exile," where it conveyed the sense of "fundamental truth." The modern technical usage, however, emerged in the 1960s and 1970s within the remote sensing and meteorology communities.

In 1972, NASA formally described ground truth as essential "data about materials on the earth's surface" used to calibrate measurements taken by satellites and aircraft. Scientists would physically visit imaged locations to confirm what satellite pixels represented, establishing the "truth" on the ground. This process of field verification became standard practice for validating remote sensing data, enabling supervised classification of satellite images and correcting for atmospheric distortion.

The U.S. military also adopted the phrase, using "ground truth" to refer to actual facts describing a tactical situation, in contrast to filtered intelligence reports or policy projections.

The statistical modeling and machine learning communities later borrowed the term. By the 2000s, "ground truth" had become standard vocabulary in AI research, referring to the verified labels against which model predictions are compared. Some remote sensing scholars have argued that the term should be retired because it implies a certainty that real-world data rarely provides. Nevertheless, the phrase remains firmly established across multiple disciplines.

Grammatical forms

The term functions in several grammatical roles:

Form	Usage	Example
Noun (unhyphenated)	"ground truth"	"The ground truth labels were verified by three experts."
Adjective (hyphenated)	"ground-truth"	"We compared predictions against ground-truth annotations."
Verb (hyphenated)	"to ground-truth"	"Field teams ground-truthed the satellite imagery."

Role in supervised learning

Ground truth is central to the supervised learning pipeline. During training, a model receives input data (such as images, text, or numerical features) along with corresponding ground truth labels. The model generates predictions, compares them to the ground truth using a loss function, and adjusts its internal parameters to reduce the error. This feedback loop repeats over many epochs until the model converges on patterns that generalize to unseen data.

During evaluation, ground truth labels in a held-out test set serve as the benchmark for computing metrics like accuracy, precision, recall, and F1 score. Without ground truth, there would be no objective way to measure whether a model's predictions are correct.

Stage	Role of ground truth	Example
Training	Provides correct labels for the model to learn from	Labeled images of cats and dogs
Validation	Tunes hyperparameters and prevents overfitting	Held-out labeled data used during training
Testing	Measures final model performance on unseen data	Test set with verified labels
Production monitoring	Detects model drift by comparing predictions to new labels	Periodic human review of live predictions

It is worth noting that ground truth in machine learning is not always objectively accurate. In some cases, ground truth comprises human judgments or inferences drawn from user behavior. For example, spam filter training labels may be subjective (what one person considers spam, another may not), yet they still function as the ground truth for the system. The key requirement is that the labels are consistently applied and represent the best available approximation of the target variable.

Evaluation metrics that depend on ground truth

Nearly all standard evaluation metrics in machine learning require ground truth labels to compute. The model's predictions are compared element-by-element against the ground truth, and the resulting counts of correct and incorrect predictions form the basis of a confusion matrix.

Metric	What it measures	Ground truth requirement
Accuracy	Proportion of correct predictions out of all predictions	Needs true labels for all test samples
Precision	Proportion of true positives among all positive predictions	Needs true labels to identify false positives
Recall	Proportion of true positives among all actual positives	Needs true labels to identify false negatives
F1 score	Harmonic mean of precision and recall	Requires both precision and recall
AUC-ROC	Area under the receiver operating characteristic curve	Needs true binary labels at varying thresholds
Intersection over Union (IoU)	Overlap between predicted and true regions	Needs ground truth segmentation masks or bounding boxes
Mean Average Precision (mAP)	Average precision across multiple classes or IoU thresholds	Needs ground truth bounding boxes with class labels

In tasks where ground truth is unavailable at inference time, proxy metrics or human evaluation may be used instead, but these are generally considered less reliable than metrics grounded in verified labels.

Annotation and labeling

Creating ground truth data typically requires human annotation: the process of reviewing raw data and assigning the correct labels or tags. Depending on the domain, annotation can involve categorizing text, drawing bounding boxes around objects in images, transcribing audio, or marking pixel-level segmentation masks.

Common annotation methods

Method	Description	Typical use case
Manual expert annotation	Domain specialists label data with high precision	Medical imaging, legal document review
Crowdsourcing	Large pools of non-expert workers label data at scale	Image classification, sentiment analysis
Programmatic labeling	Heuristic functions generate labels automatically	Text classification with keyword rules
Semi-automated labeling	Model pre-labels data, humans correct mistakes	Object detection pre-annotation
Active learning	Model selects uncertain samples for human review	Efficient labeling under budget constraints
Synthetic data generation	Simulated environments produce data with automatic labels	Autonomous driving simulation, robotics

Manual expert annotation produces the highest quality ground truth but is expensive and slow. A single medical image may require a radiologist to spend several minutes identifying and outlining abnormalities. In contrast, crowdsourcing platforms can label thousands of images per hour, though the quality of individual annotations varies. Semantic segmentation annotation (outlining every region of interest with a polygon) takes approximately 15 times longer than standard bounding box annotation.

Annotation tools and platforms

A growing ecosystem of tools supports ground truth creation at scale:

Platform	Key capabilities
Amazon SageMaker Ground Truth	Managed labeling service integrated with Amazon Mechanical Turk; supports automated labeling via active learning
Labelbox	Team collaboration workflows with support for multiple annotation formats
Scale AI	Handles computer vision, NLP, and audio data labeling
Label Studio	Open-source annotation tool with support for inter-annotator agreement metrics
CVAT	Open-source tool optimized for video and image annotation
Prodigy	Annotation tool with built-in active learning for NLP tasks

Gold standard vs. silver standard

In the annotation community, a gold standard dataset refers to a corpus labeled by expert human annotators with rigorous quality controls. Gold standard data is considered the most reliable form of ground truth and is commonly used as the definitive benchmark for evaluating models.

A silver standard dataset, by contrast, is generated using automated systems or a combination of model predictions rather than expert human annotators. Multiple pre-trained models may each label the same data, and their outputs are merged using ensemble or voting strategies. The key advantage is that silver standard datasets can be produced much faster and at a fraction of the cost. However, research has consistently shown that quality does not simply scale with quantity: a smaller gold standard dataset frequently outperforms a larger silver standard one for model training.

Attribute	Gold standard	Silver standard
Annotators	Expert humans	Automated systems or model ensembles
Quality	High (verified and adjudicated)	Moderate (may contain systematic errors)
Cost	Expensive	Low
Scale	Typically small	Can be very large
Common use	Final evaluation benchmarks	Pre-training or supplementary training data

Ground truth quality and inter-annotator agreement

The reliability of ground truth depends on how consistently annotators agree on the correct labels. Inter-annotator agreement (IAA) measures this consistency and is a critical quality metric for any labeled dataset. High levels of inter-annotator agreement help reduce noise and subjectivity in model training and evaluation. When human labels diverge significantly, model training becomes unstable and evaluation results become unreliable.

Key IAA metrics

Cohen's kappa is the most widely used IAA metric for two annotators. It measures agreement while correcting for agreement that would occur by chance. The formula is:

kappa = (P_o - P_e) / (1 - P_e)

where P_o is the observed proportion of agreement and P_e is the proportion of agreement expected by random chance. Cohen's kappa ranges from -1 to +1.

Kappa value	Interpretation
0.81 to 1.00	Almost perfect agreement
0.61 to 0.80	Substantial agreement
0.41 to 0.60	Moderate agreement
0.21 to 0.40	Fair agreement
0.00 to 0.20	Slight agreement
Below 0.00	Less than chance agreement

A common pitfall is relying on simple percentage agreement, which can be misleading. For example, if two annotators both assign the same label 90% of the time but one is systematically more lenient, their Cohen's kappa could be surprisingly low (below 0.10), revealing that the apparent agreement is largely driven by class imbalance rather than genuine consensus.

Fleiss' kappa extends this framework to any number of raters. Unlike Cohen's kappa, which assumes the same two raters have rated every item, Fleiss' kappa allows different items to be rated by different individuals, as long as a fixed number of raters label each item.

Krippendorff's alpha is the most general IAA metric. It accommodates missing data, varying numbers of raters per item, and different measurement scales (nominal, ordinal, interval, and ratio) through configurable distance functions. It is defined as 1 minus the ratio of observed disagreement to expected disagreement. In most practical settings, a coefficient of 0.80 or higher is considered reliable, although the exact threshold depends on the requirements of a particular project.

Metric	Number of raters	Data types supported	Handles missing data
Cohen's kappa	2	Nominal	No
Fleiss' kappa	2 or more	Nominal	No
Krippendorff's alpha	2 or more	Nominal, ordinal, interval, ratio	Yes

Ground truth estimation from multiple annotators

When multiple annotators label the same data, the challenge becomes how to combine their annotations into a single ground truth. Simple majority voting is the most common approach, but more sophisticated methods exist.

The Dawid-Skene model (1979) uses the Expectation-Maximization (EM) algorithm to simultaneously estimate the true labels and each annotator's confusion matrix (their pattern of errors). This allows the system to weight reliable annotators more heavily than unreliable ones.

The STAPLE algorithm (Simultaneous Truth and Performance Level Estimation), developed by Warfield et al. (2004), applies a similar EM-based approach specifically to image segmentation. It takes a collection of segmentations and computes a probabilistic estimate of the true segmentation while simultaneously measuring each rater's performance level. STAPLE has become a standard tool for establishing ground truth in medical image analysis, where multiple radiologists may produce different segmentation boundaries for the same structure.

Noisy ground truth

In practice, ground truth data often contains errors, a problem known as label noise. Sources of label noise include annotator mistakes, ambiguous examples that lack a clear correct answer, inconsistent labeling guidelines, and subjective tasks where reasonable people disagree.

Label noise is considered more damaging than feature noise because the ground truth provides the unique supervisory signal for each training example. Research has shown that deep learning models, particularly neural networks with high capacity, can memorize corrupted labels during training. This memorization degrades the model's ability to generalize to new, unseen data.

Types of label noise

Label noise is generally categorized into three types:

Type	Description	Example
Uniform (symmetric) noise	Each label has an equal probability of being flipped to any other class	Random annotation errors distributed evenly across classes
Class-conditional (asymmetric) noise	Mislabeling probability depends on the true class	"Cat" images more likely to be mislabeled as "dog" than as "airplane"
Instance-dependent noise	Mislabeling probability depends on the specific data point	Ambiguous images near decision boundaries are more likely to be mislabeled

Impact of label noise

Noise level	Typical effect on model
Less than 5%	Minimal impact; most models remain robust
5% to 15%	Noticeable drop in accuracy; increased overfitting risk
15% to 30%	Significant performance degradation; model may learn incorrect patterns
Above 30%	Training becomes unreliable; model predictions approach random chance

Quantitative studies on the COCO dataset have shown that using the highest level of annotation noise resulted in mAP degradation of 0.185 for object detection and 0.208 for instance segmentation, demonstrating the concrete impact of ground truth quality on model performance.

Mitigation strategies

Several techniques have been developed to handle noisy ground truth:

Robust loss functions that down-weight samples with large losses, reducing the influence of mislabeled examples.
Co-teaching, where two networks train simultaneously and teach each other by selecting clean samples.
Label smoothing, which replaces hard 0/1 labels with soft probabilities to reduce the impact of wrong labels.
Confident learning, which identifies and removes likely mislabeled samples before training. The Cleanlab framework implements this approach by estimating the joint distribution between noisy labels and true labels.
Meta-learning approaches that learn to re-weight training examples based on a small clean validation set.
MixUp and data augmentation strategies that blend training examples to create softer decision boundaries less sensitive to individual label errors.

Ground truth in different domains

Computer vision

In computer vision, ground truth takes many forms depending on the task. For image classification, it is a categorical label (e.g., "cat" or "dog"). For object detection, it includes bounding box coordinates and class labels. For image segmentation, every pixel in an image receives a class assignment. For instance segmentation, individual object instances are distinguished even within the same class.

Creating pixel-level segmentation ground truth is particularly labor-intensive. Annotating a single image in the Cityscapes dataset (used for autonomous driving research) takes approximately 1.5 hours per image on average.

Benchmark datasets

Several benchmark datasets have become standard references for evaluating computer vision models, and each required enormous ground truth annotation efforts:

Dataset	Year	Images	Object categories	Annotation type	Annotation effort
PASCAL VOC	2005-2012	~11,000	20	Bounding boxes, segmentation masks	Multi-year community effort
ImageNet (ILSVRC)	2009-2017	~1.4 million	1,000	Image-level labels, bounding boxes	Extensive use of Amazon Mechanical Turk
MS COCO	2014-present	~330,000	80	Instance segmentation, captions	Over 70,000 worker hours
KITTI	2012-present	~15,000	8	3D bounding boxes, depth maps	Velodyne LiDAR + GPS ground truth
Cityscapes	2016-present	~25,000	30	Pixel-level semantic segmentation	~1.5 hours per image

Natural language processing

In natural language processing, ground truth labels vary by task. Sentiment analysis requires polarity labels (positive, negative, neutral). Named entity recognition demands token-level annotations marking person names, organizations, locations, and other entities. Machine translation uses human-translated reference sentences. Question answering datasets provide verified answer spans within passages.

Subjectivity is a particular challenge in NLP ground truth. Tasks like hate speech detection or sarcasm identification often yield low inter-annotator agreement because reasonable annotators can interpret the same text differently based on cultural context or personal perspective. Research has shown that in sentiment analysis, different annotators frequently disagree on the polarity of borderline cases, leading to inconsistencies in the ground truth that propagate into model behavior.

Medical diagnosis

Medical ground truth typically relies on the diagnosis made by experienced physicians, confirmed through biopsy, pathology reports, or other definitive diagnostic procedures. In medical imaging, a common approach involves having multiple radiologists independently annotate the same scan, then using consensus, adjudication, or algorithms like STAPLE to establish the ground truth.

The stakes are especially high in medical applications. If ground truth labels are inaccurate, a model trained on that data could misdiagnose patients, leading to delayed treatment or unnecessary procedures. Medical datasets also suffer from high inter-observer and intra-observer variability, making rigorous annotation protocols essential. Studies on medical image segmentation have shown that inter-expert variability is one of the primary sources of label noise in clinical datasets.

Autonomous driving

Autonomous driving systems require ground truth for 3D object detection, lane marking, traffic sign recognition, and pedestrian tracking. The primary sensor for generating 3D ground truth is LiDAR (Light Detection and Ranging), which produces detailed point cloud representations of the environment. Human annotators then label objects within these point clouds, drawing 3D bounding boxes around vehicles, pedestrians, cyclists, and other road users.

Annotating LiDAR data is extremely time-consuming due to the sparse and low-resolution nature of point clouds. Objects at a distance may be represented by only a handful of points, making it difficult for annotators to determine object boundaries, class, and orientation. Semi-automated tools have reduced manual annotation time by roughly 50% in some workflows, and automation ratios of up to 70% have been reported for 3D sensor data. Multi-sensor fusion annotation, where LiDAR data is labeled alongside synchronized camera imagery and radar returns, gives annotators additional context that improves labeling accuracy. Despite these improvements, human review remains necessary to ensure safety-critical ground truth quality.

Large language model alignment

The rise of large language models has introduced new forms of ground truth. In reinforcement learning from human feedback (RLHF), the ground truth takes the form of human preference data: annotators compare pairs of model outputs and indicate which response is better. These preference labels are used to train a reward model, which in turn guides the language model toward producing outputs aligned with human values.

A distinctive challenge in this setting is that human annotators often disagree on which output is "better," introducing substantial variance into the preference data. Unlike classification tasks where a definitive correct answer often exists, preference judgments are inherently subjective. Research on human-AI hybrid approaches, such as RLTHF (Reinforcement Learning from Targeted Human Feedback), has shown that combining LLM-based initial alignment with selective human annotation on difficult samples can reach full-human annotation-level alignment with only 6 to 7 percent of the human annotation effort.

Crowdsourced ground truth

Crowdsourcing platforms like Amazon Mechanical Turk (MTurk) and Amazon SageMaker Ground Truth have become widely used for generating labeled datasets at scale. MTurk provides access to a distributed workforce of hundreds of thousands of workers who can perform tasks such as image labeling, text categorization, and audio transcription.

Quality control in crowdsourcing

Crowdsourced ground truth requires careful quality control because individual workers may lack domain expertise. Common strategies include:

Redundant labeling: Each data point is labeled by multiple workers (typically 3 to 5), and labels are aggregated through majority voting or more sophisticated methods like Dawid-Skene estimation.
Gold questions: Known-answer questions are interspersed with real tasks to identify and filter out unreliable workers.
Qualification tests: Workers must pass screening tasks before gaining access to the labeling job.
Active quality monitoring: Worker performance is tracked over time, and consistently low-quality workers are removed.
Dual-step annotation: A "scratch and review" protocol where initial labels are created by one annotator and independently reviewed by a second annotator.

Research has shown that models trained on expert-labeled ground truth outperform those trained on crowd-labeled data by at least 8% across standard performance metrics. Amazon SageMaker Ground Truth addresses this gap by incorporating automated labeling: after enough human labels are collected, the system trains an internal model to pre-label remaining data, reducing the need for human review by up to 70% and lowering costs accordingly.

The MS COCO dataset provides an instructive example of crowdsourcing at scale. The COCO team compared precision and recall of seven expert workers with the results obtained by taking the union of one to ten AMT workers. Ground truth was computed using majority vote of the experts, and the study demonstrated that aggregating annotations from multiple non-expert workers can approach expert-level quality when sufficient redundancy is employed.

Programmatic labeling and weak supervision

Traditional ground truth creation through manual annotation does not scale well for the massive datasets that modern deep learning models require. Programmatic labeling offers an alternative by using code to generate labels automatically.

Snorkel, developed at Stanford University, pioneered the approach of weak supervision for ground truth generation. Instead of labeling individual data points by hand, users write labeling functions: small programs that encode heuristics, patterns, or external knowledge bases to assign noisy labels. For example, a labeling function for sentiment analysis might label any review containing the word "excellent" as positive.

Snorkel then applies a generative model to learn the accuracies and correlations of these labeling functions without access to ground truth. The system resolves conflicts between labeling functions and outputs probabilistic labels that can train a downstream model. In user studies, subject matter experts using Snorkel built models 2.8 times faster and achieved 45.5% higher predictive performance compared to seven hours of manual labeling.

The underlying paradigm, called data programming, models multiple label sources without access to ground truth and generates probabilistic training labels that capture the lineage of each individual label. A notable finding is that source accuracy and correlation structure can be recovered without any hand-labeled training data, using only the agreements and disagreements among labeling functions.

While powerful, weak supervision does not eliminate the need for labeled data entirely. A small development set with ground truth labels is typically needed to design and validate labeling functions, and a gold standard test set remains essential for final evaluation.

Alternatives to manual ground truth

Beyond weak supervision, several other paradigms reduce or eliminate the need for manually annotated ground truth:

Semi-supervised learning

Semi-supervised learning uses a small amount of labeled data alongside a large pool of unlabeled data. Techniques like self-training have the model generate pseudo-labels for unlabeled examples: the model is first trained on the labeled subset, then makes predictions on unlabeled data, and predictions above a confidence threshold are accepted as pseudo-labels. The model is retrained on the combined original labels and pseudo-labels, iterating until performance stabilizes. One of the biggest challenges is that if the initial model is poorly trained, incorrect pseudo-labels can accumulate, leading to degraded performance.

Self-supervised learning

Self-supervised learning derives the supervisory signal from the underlying structure of unlabeled data itself, without any manual annotation. Tasks like masked language modeling (predicting masked words in a sentence) or contrastive learning (learning to distinguish between augmented views of the same image) allow models to learn meaningful representations. After pretraining with self-supervision, models can be fine-tuned on a small amount of labeled data and often achieve performance comparable to fully supervised models trained on much larger labeled datasets.

Synthetic data generation

Synthetic data provides a cost-effective alternative to manual annotation by generating training examples programmatically, often using simulated environments, game engines, or generative models. The ground truth labels are produced automatically as part of the generation process. For example, a driving simulator can render street scenes with perfect bounding box and segmentation labels for every object.

Gartner has forecast that by 2030, synthetic data will be more widely used for AI training than real-world datasets. However, models trained on synthetic data still require benchmarking against real-world ground truth to ensure they generalize beyond the simulated domain. Research also shows that training exclusively on synthetic data from previous model generations can cause model collapse, but mixing synthetic data with real data can maintain or even improve performance.

Active learning

Active learning minimizes the amount of labeled data needed by having the model strategically select which samples to send for human annotation. Instead of labeling data randomly, the model identifies the most informative or uncertain examples and requests labels only for those. Common query strategies include uncertainty sampling (selecting examples the model is least confident about), diversity sampling (selecting examples that are most different from already labeled data), and query-by-committee (selecting examples where an ensemble of models disagrees most). Research has shown that active learning can achieve comparable model performance with 10 to 100 times fewer labeled examples than random sampling.

Ground truth bias

Ground truth data can encode and perpetuate biases, leading to models that produce unfair or discriminatory predictions. Bias in ground truth arises from several sources:

Annotator bias: Annotators may bring personal prejudices or cultural assumptions to subjective labeling tasks. In hate speech detection, for instance, annotators from different demographic backgrounds often disagree significantly on what constitutes harmful content.
Selection bias: If the data used to create ground truth is not representative of the real-world population, the resulting model will perform unevenly across different groups.
Historical bias: Ground truth derived from historical records (such as past hiring decisions or criminal justice outcomes) may reflect systemic inequalities rather than objective truths.
Aggregation bias: The common practice of aggregating multiple annotator judgments into a single ground truth label erases legitimate disagreement. For subjective tasks, there may not be a single correct answer, and forcing consensus can systematically marginalize minority perspectives.
Measurement bias: The tools or instruments used to collect ground truth data may themselves introduce systematic errors. For example, certain medical devices may perform differently across patient populations, leading to ground truth that is less accurate for some groups.

Fairness metrics such as Demographic Parity and Equalized Odds explicitly incorporate ground truth when measuring model fairness, comparing predicted labels against ground truth labels across different demographic groups. Addressing ground truth bias requires diverse annotator pools, clear and inclusive annotation guidelines, and transparency about the limitations of the ground truth.

Cost of obtaining ground truth

The cost of creating high-quality ground truth is a major bottleneck in AI development. Data preparation, including collection, organization, and labeling, can consume up to 80% of an AI project's total time. The global data labeling market was valued at approximately $18.6 billion in 2024 and is projected to reach $57.6 billion by 2030, reflecting the enormous and growing demand for labeled data.

Leading AI companies such as OpenAI, Google, Meta, and Anthropic each spend on the order of $1 billion per year on human-provided training data. Costs vary dramatically by domain and task complexity:

Task	Approximate cost per label
Simple image classification	$0.01 to $0.05
Bounding box annotation	$0.05 to $0.50
Pixel-level segmentation	$1.00 to $10.00
Medical image annotation (expert)	$10.00 to $100.00+
LiDAR 3D point cloud labeling	$5.00 to $50.00 per frame
NLP token-level annotation (NER)	$0.02 to $0.20 per token
RLHF preference comparison	$0.50 to $5.00 per pair

These costs have driven interest in techniques that reduce the labeling burden, including active learning, semi-supervised learning, transfer learning, data augmentation, synthetic data generation, and the programmatic labeling approaches discussed above.

Best practices for ground truth creation

Establishing reliable ground truth requires a systematic approach. The following practices are widely recommended in both academic and industry settings:

Define clear annotation guidelines. Instructions should be exhaustively formulated with step-by-step procedures, edge case examples, and decision rules for ambiguous situations.
Pilot and iterate. Run a small pilot annotation round, measure inter-annotator agreement, identify sources of disagreement, refine the guidelines, and repeat before scaling up.
Use multiple annotators per item. Redundant labeling (typically 3 to 5 annotators) with aggregation reduces the impact of individual errors.
Measure IAA continuously. Track Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha throughout the annotation process, not just at the end.
Separate training and evaluation data rigorously. Ground truth used for model evaluation must be held out from training to avoid data leakage.
Prevent data leakage in feature selection. Ensure that no features encode information from the ground truth that would be unavailable during real-world deployment.
Document limitations and known biases. Every ground truth dataset should include a datasheet or data card describing its collection methodology, annotator demographics, known limitations, and intended use cases.
Version control ground truth. As labels are corrected or updated, maintain version history so that experiments can be reproduced against a specific snapshot of the data.

References

Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., & Re, C. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." *Proceedings of the VLDB Endowment*, 11(3), 269-282.
Cohen, J. (1960). "A Coefficient of Agreement for Nominal Scales." *Educational and Psychological Measurement*, 20(1), 37-46.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). "The Cityscapes Dataset for Semantic Urban Scene Understanding." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Xiao, T., Xia, T., Yang, Y., Huang, C., & Wang, X. (2015). "Learning from Massive Noisy Labeled Data for Image Classification." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Song, H., Kim, M., Park, D., Shin, Y., & Lee, J. (2022). "Learning from Noisy Labels with Deep Neural Networks: A Survey." *IEEE Transactions on Neural Networks and Learning Systems*, 34(11), 8135-8153.
Northcutt, C. G., Jiang, L., & Chuang, I. L. (2021). "Confident Learning: Estimating Uncertainty in Dataset Labels." *Journal of Artificial Intelligence Research*, 70, 1373-1411.
Artstein, R. & Poesio, M. (2008). "Inter-Coder Agreement for Computational Linguistics." *Computational Linguistics*, 34(4), 555-596.
Grand View Research. (2025). "Data Labeling Solution and Services Market Size, Share & Trends Analysis Report." Grand View Research.
Dawid, A. P. & Skene, A. M. (1979). "Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm." *Journal of the Royal Statistical Society: Series C*, 28(1), 20-28.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). "A Survey on Bias and Fairness in Machine Learning." *ACM Computing Surveys*, 54(6), 1-35.
Warfield, S. K., Zou, K. H., & Wells, W. M. (2004). "Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation." *IEEE Transactions on Medical Imaging*, 23(7), 903-921.
Krippendorff, K. (2011). "Computing Krippendorff's Alpha-Reliability." *Departmental Papers (ASC)*, University of Pennsylvania.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." *Proceedings of the European Conference on Computer Vision (ECCV)*.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). "ImageNet Large Scale Visual Recognition Challenge." *International Journal of Computer Vision*, 115(3), 211-252.
Rottger, P., Vidgen, B., Nguyen, D., Waseem, Z., Margetts, H., & Pierrehumbert, J. (2021). "HateCheck: Functional Tests for Hate Speech Detection Models." *Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)*.

Explain Like I'm 5 (ELI5)

Origin and history of the term

Grammatical forms

Role in supervised learning

Evaluation metrics that depend on ground truth

Annotation and labeling

Common annotation methods

Annotation tools and platforms

Gold standard vs. silver standard

Ground truth quality and inter-annotator agreement

Key IAA metrics

Ground truth estimation from multiple annotators

Noisy ground truth

Types of label noise

Impact of label noise

Mitigation strategies

Ground truth in different domains

Computer vision

Benchmark datasets

Natural language processing

Medical diagnosis

Autonomous driving

Large language model alignment

Crowdsourced ground truth

Quality control in crowdsourcing

Programmatic labeling and weak supervision

Alternatives to manual ground truth

Semi-supervised learning

Self-supervised learning

Synthetic data generation

Active learning

Ground truth bias

Cost of obtaining ground truth

Best practices for ground truth creation

See also

References

Improve this article

Related Articles

ARC-AGI 2

Discrete Feature

Categorical Data

Continuous Feature

Instance

Confident Learning (CL)

Explain Like I'm 5 (ELI5)

Origin and history of the term

Grammatical forms

Role in supervised learning

Evaluation metrics that depend on ground truth

Annotation and labeling

Common annotation methods

Annotation tools and platforms

Gold standard vs. silver standard

Ground truth quality and inter-annotator agreement

Key IAA metrics

Ground truth estimation from multiple annotators

Noisy ground truth

Types of label noise

Impact of label noise

Mitigation strategies

Ground truth in different domains

Computer vision

Benchmark datasets

Natural language processing

Medical diagnosis

Autonomous driving

Large language model alignment

Crowdsourced ground truth

Quality control in crowdsourcing

Programmatic labeling and weak supervision

Alternatives to manual ground truth

Semi-supervised learning

Self-supervised learning

Synthetic data generation

Active learning

Ground truth bias

Cost of obtaining ground truth

Best practices for ground truth creation

See also

References