Label

See also: Machine learning terms

In machine learning, a label is the target output value associated with a single training example. During supervised machine learning, a model learns to map input features to labels so that it can predict labels for new, unseen data. Labels are also referred to as targets, response variables, dependent variables, ground truth values, or annotations, depending on the field and context.

The process of assigning labels to raw data points is called labeling (or annotation), and it is one of the most time-consuming and expensive steps in building a machine learning system. A dataset in which every example has been assigned a label is called a labeled example set, while data that lacks labels is referred to as unlabeled example data.

Labels in classification

In a classification model, labels are discrete categories (also called class labels or classes). Each training example belongs to exactly one class in standard single-label classification. Common examples include:

Task	Possible class labels	Number of classes
Email spam detection	Spam, Not Spam	2 (binary)
Handwritten digit recognition (MNIST)	0, 1, 2, 3, 4, 5, 6, 7, 8, 9	10
ImageNet image classification	Dog, Cat, Car, Airplane, ...	1,000
Sentiment analysis	Positive, Negative, Neutral	3

In binary classification there are only two possible labels (often encoded as 0 and 1). In multiclass classification there are three or more mutually exclusive labels. The model outputs a predicted probability distribution over all classes, and the class with the highest probability is selected as the prediction.

Labels in regression

In a regression model, labels are continuous numerical values rather than discrete categories. The model learns to predict a real number that minimizes the error between the predicted value and the actual label. Examples include predicting house prices, stock returns, temperature forecasts, and patient blood pressure readings. Because the label space is continuous, evaluation metrics such as mean squared error (MSE) and mean absolute error (MAE) are used instead of accuracy.

Multi-label classification

Multi-label classification is a generalization of standard classification in which each example can be assigned multiple non-exclusive labels simultaneously. For instance, a news article might be tagged with both "Politics" and "Economics," or a photograph might contain both a dog and a car. Formally, the model maps each input to a binary vector where each element indicates the presence (1) or absence (0) of a particular label.

Multi-label problems are commonly solved using neural networks with a sigmoid activation function on each output node and binary cross-entropy loss. Other approaches include problem transformation methods (such as Binary Relevance, which trains one binary classifier per label) and algorithm adaptation methods that modify existing algorithms to handle multiple outputs directly.

Label encoding

Before labels can be used by most machine learning algorithms, they must be converted into a numerical representation. The two most common encoding schemes are integer encoding and one-hot encoding.

Encoding method	Representation	Best suited for	Example (three classes: Cat, Dog, Bird)
Integer (ordinal) encoding	Single integer per class	Ordinal data or tree-based models	Cat = 0, Dog = 1, Bird = 2
One-hot encoding	Binary vector with one 1	Nominal data, neural networks, logistic regression	Cat = [1,0,0], Dog = [0,1,0], Bird = [0,0,1]

Integer encoding assigns each category a unique integer. This is compact but can mislead models that interpret numerical proximity as similarity (for example, a linear model might assume Dog is "between" Cat and Bird). One-hot encoding avoids this problem by representing each class as a separate binary column, but it increases the dimensionality of the feature space. The choice between the two depends on the algorithm and the nature of the data.

Label smoothing

Label smoothing is a regularization technique introduced by Szegedy et al. (2016) that replaces hard one-hot target vectors with softened versions. Instead of assigning a probability of 1.0 to the correct class and 0.0 to all others, label smoothing redistributes a small amount of probability mass uniformly across all classes. For a smoothing parameter alpha (typically 0.1), the target for the correct class becomes (1 - alpha) and each incorrect class receives alpha / K, where K is the total number of classes.

This prevents the model from becoming overconfident in its predictions and has been shown to improve generalization across image classification, machine translation, and language modeling tasks. Label smoothing has become a standard component in modern training recipes, particularly on large-scale benchmarks like ImageNet.

Pseudo-labels in semi-supervised learning

Pseudo-labeling is a semi-supervised learning technique introduced by Lee (2013) that leverages unlabeled data by assigning it artificial labels generated by the model itself. The process works as follows:

Train a model on the available labeled data.
Use the trained model to predict labels for the unlabeled data. The predicted class with the highest probability becomes the pseudo-label.
Combine the labeled data and pseudo-labeled data, then retrain the model on the combined set.
Repeat the process iteratively.

Pseudo-labeling is equivalent to entropy regularization, which encourages the model to make confident predictions and favors a low-density separation between classes. Modern semi-supervised methods such as FixMatch and MixMatch build on the pseudo-label concept by adding consistency regularization and advanced data augmentation. The technique is especially valuable when labeled data is scarce but unlabeled data is abundant.

Label quality

The quality of labels in a training dataset plays a crucial role in the performance of any supervised learning algorithm. Poorly labeled data can produce incorrect or biased models that fail when deployed in production. Label quality is affected by several factors.

Noise

Label noise refers to random errors or inconsistencies in the assigned labels. Noise can arise from annotator mistakes, ambiguous instructions, measurement error, or subjective disagreements among labelers. Studies have shown that even benchmark datasets contain meaningful rates of label error. Northcutt et al. (2021) estimated that popular datasets such as ImageNet, CIFAR-10, and Amazon Reviews contain approximately 3-6% label errors.

Ambiguity

Ambiguity occurs when it is genuinely unclear which label should be assigned to a given example. This is common in subjective tasks like sentiment analysis, where reasonable annotators may disagree. Ambiguity can result in inconsistent labels that confuse the model during training.

Bias

Label bias refers to systematic errors that skew the distribution of labels in a dataset. Bias can be introduced by annotator prejudice, skewed sampling, or historical patterns embedded in the data. Biased labels lead to biased models, which can produce unfair or harmful outcomes when deployed.

Detecting and correcting label errors

Confident Learning, proposed by Northcutt, Jiang, and Chuang (2021), is a principled framework for finding and fixing label errors in datasets. It directly estimates the joint distribution of noisy (observed) labels and true (hidden) labels using only the model's predicted probabilities, without requiring any guaranteed-clean validation data. The open-source cleanlab Python library implements Confident Learning and has become a widely used tool for data-centric AI practitioners.

Label acquisition

Obtaining high-quality labels is one of the most significant bottlenecks in applied machine learning. Several strategies exist for acquiring labels, each with different tradeoffs in cost, speed, and quality.

Manual annotation

Domain experts or trained annotators manually assign labels to each data point. This produces the highest-quality labels but is the slowest and most expensive approach. For specialized domains like medical imaging or legal document review, expert annotators may charge hundreds of dollars per hour, and labeling a single complex image (such as pixel-level semantic segmentation) can take several minutes.

Crowdsourcing

Platforms like Amazon Mechanical Turk, Scale AI, and Toloka distribute labeling tasks to a large pool of non-expert workers. Crowdsourcing is faster and cheaper than expert annotation but introduces more noise. Quality control mechanisms such as majority voting, gold-standard questions, and inter-annotator agreement metrics are used to improve reliability.

Programmatic labeling (weak supervision)

Programmatic labeling uses heuristic rules, knowledge bases, and existing models to automatically generate labels without manual annotation. Snorkel, developed by Ratner et al. (2017) at Stanford, pioneered the data programming paradigm in which users write labeling functions (small programs that each provide a noisy label for a subset of data points). Snorkel then models the accuracies and correlations of these labeling functions to produce a single probabilistic label for each example. In user studies, subject matter experts using Snorkel built models 2.8x faster than with hand labeling, and predictive performance improved by an average of 45.5% compared to seven hours of manual labeling.

Active learning

Active learning selects the most informative unlabeled examples for an annotator to label next, reducing the total number of labels needed to achieve a given level of model performance. By focusing human effort on examples near the decision boundary or where the model is most uncertain, active learning can achieve strong results with a fraction of the labels required by random sampling.

Acquisition method	Cost	Speed	Label quality	Best for
Manual expert annotation	High	Slow	Highest	Medical, legal, safety-critical domains
Crowdsourcing	Medium	Fast	Moderate (with QC)	Large-scale image and text annotation
Programmatic labeling (Snorkel)	Low	Very fast	Moderate	Rapid prototyping, large unlabeled corpora
Active learning	Medium	Moderate	High	Limited annotation budget

Label imbalance

Label imbalance (also called class imbalance) occurs when certain labels appear far more frequently than others in the training data. For example, in fraud detection, legitimate transactions may outnumber fraudulent ones by 1,000 to 1. Models trained on imbalanced data tend to be biased toward the majority class, achieving high overall accuracy while performing poorly on the minority class, which is often the class of greatest interest.

Common strategies for addressing a class-imbalanced dataset include:

Oversampling the minority class. SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic examples by interpolating between existing minority-class samples and their nearest neighbors. Variants like Borderline-SMOTE and ADASYN target samples near the decision boundary.
Undersampling the majority class. Random undersampling or methods like Tomek Links remove majority-class examples to balance the distribution, though this discards potentially useful data.
Cost-sensitive learning. Assigning higher misclassification costs to the minority class adjusts the model's objective function to penalize errors on rare classes more heavily.
Evaluation metrics. Using precision, recall, F1-score, and area under the ROC curve (AUC-ROC) instead of accuracy provides a more meaningful assessment of model performance on imbalanced data.

Label leakage

Label leakage (a specific form of data leakage) occurs when information about the target label inadvertently makes its way into the input features. This causes the model to achieve artificially high performance during training and evaluation but fail in production, where the leaked information is unavailable.

Label leakage can take several forms:

Direct leakage. A feature is a direct copy or close proxy of the label. For example, including a "diagnosis code" feature when predicting whether a patient has a disease.
Temporal leakage. Features derived from data that was collected after the label was determined. For example, using next-month revenue to predict this-month churn.
Aggregation leakage. Summary statistics (means, counts) computed over groups that include the target label.

Detecting leakage typically involves inspecting feature importance scores for suspiciously predictive features, examining the data collection timeline, and verifying that the train/test split respects temporal ordering.

Data labeling tools

Several commercial and open-source platforms have been developed to streamline the data labeling workflow.

Tool	Type	Key features
Label Studio	Open-source	Supports text, image, audio, video, and time series; customizable labeling interfaces; self-hosted or cloud
Labelbox	Commercial	Model-assisted labeling, workflow automation, cloud integrations (AWS, GCP, Azure)
Amazon SageMaker Ground Truth	Commercial (AWS)	Active learning to reduce costs, integration with AWS ecosystem, built-in workforce management
Prodigy	Commercial	Built by the spaCy team, scriptable annotation with active learning, NLP-focused
CVAT	Open-source	Computer vision focused, supports bounding boxes, polygons, segmentation masks
Scale AI	Commercial platform	Managed labeling workforce, API-driven, used by autonomous vehicle companies

Cost of labeling

Data labeling is widely recognized as one of the most expensive components of machine learning projects. By some estimates, data preparation (including labeling) accounts for roughly 80% of the total time spent on a machine learning project.

Costs vary dramatically based on task complexity:

Task type	Approximate cost per annotation	Notes
Simple binary classification	$0.01 - $0.05	Spam/not spam, sentiment polarity
Object bounding boxes	$0.05 - $0.50	Varies by number of objects per image
Named entity recognition (text)	$0.10 - $1.00	Depends on entity density and domain expertise
Semantic segmentation	$1.00 - $10.00+	Pixel-level annotation is highly labor-intensive
Medical image annotation	$10.00 - $100.00+	Requires trained radiologists or pathologists

The high cost of manual labeling has driven research into weak supervision, semi-supervised learning, self-supervised learning, and foundation models that can be fine-tuned with minimal labeled data.

Explain like I'm 5 (ELI5)

Imagine you are teaching a friend to sort a pile of toy animals into the right boxes. You hold up each toy and say, "This one is a dinosaur" or "This one is a horse." The name you say out loud is the label. Your friend listens, looks at each toy, and eventually learns to sort them on their own, even when you stop telling them the answers. In machine learning, labels work the same way: they are the correct answers that a computer uses to learn patterns from examples.

References

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2818-2826.
Lee, D.-H. (2013). "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks." *ICML Workshop on Challenges in Representation Learning*.
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., & Re, C. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." *Proceedings of the VLDB Endowment*, 11(3), 269-282.
Northcutt, C. G., Jiang, L., & Chuang, I. L. (2021). "Confident Learning: Estimating Uncertainty in Dataset Labels." *Journal of Artificial Intelligence Research*, 70, 1373-1411.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357.
Kaufmann, E., & Kalyanakrishnan, S. (2013). "Information Complexity in Bandit Subset Selection." *Conference on Learning Theory (COLT)*. [Relevant to active learning query strategies.]
Zhang, Z., & Sabuncu, M. R. (2018). "Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels." *Advances in Neural Information Processing Systems (NeurIPS)*, 31.
Sohn, K., Berthelot, D., Carlini, N., et al. (2020). "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence." *Advances in Neural Information Processing Systems (NeurIPS)*, 33.
Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). "Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers." *Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 614-622.
Tsoumakas, G., & Katakis, I. (2007). "Multi-Label Classification: An Overview." *International Journal of Data Warehousing and Mining*, 3(3), 1-13.

Labels in classification

Labels in regression

Multi-label classification

Label encoding

Label smoothing

Pseudo-labels in semi-supervised learning

Label quality

Noise

Ambiguity

Bias

Detecting and correcting label errors

Label acquisition

Manual annotation

Crowdsourcing

Programmatic labeling (weak supervision)

Active learning

Label imbalance

Label leakage

Data labeling tools

Cost of labeling

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset

Labels in classification

Labels in regression

Multi-label classification

Label encoding

Label smoothing

Pseudo-labels in semi-supervised learning

Label quality

Noise

Ambiguity

Bias

Detecting and correcting label errors

Label acquisition

Manual annotation

Crowdsourcing

Programmatic labeling (weak supervision)

Active learning

Label imbalance

Label leakage

Data labeling tools

Cost of labeling

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset