See also: Machine learning terms
In machine learning, a label is the target output value associated with a single training example. During supervised machine learning, a model learns to map input features to labels so that it can predict labels for new, unseen data. Labels are also referred to as targets, response variables, dependent variables, ground truth values, or annotations, depending on the field and context.
The process of assigning labels to raw data points is called labeling (or annotation), and it is one of the most time-consuming and expensive steps in building a machine learning system. A dataset in which every example has been assigned a label is called a labeled example set, while data that lacks labels is referred to as unlabeled example data.
In a classification model, labels are discrete categories (also called class labels or classes). Each training example belongs to exactly one class in standard single-label classification. Common examples include:
| Task | Possible class labels | Number of classes |
|---|---|---|
| Email spam detection | Spam, Not Spam | 2 (binary) |
| Handwritten digit recognition (MNIST) | 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 | 10 |
| ImageNet image classification | Dog, Cat, Car, Airplane, ... | 1,000 |
| Sentiment analysis | Positive, Negative, Neutral | 3 |
In binary classification there are only two possible labels (often encoded as 0 and 1). In multiclass classification there are three or more mutually exclusive labels. The model outputs a predicted probability distribution over all classes, and the class with the highest probability is selected as the prediction.
In a regression model, labels are continuous numerical values rather than discrete categories. The model learns to predict a real number that minimizes the error between the predicted value and the actual label. Examples include predicting house prices, stock returns, temperature forecasts, and patient blood pressure readings. Because the label space is continuous, evaluation metrics such as mean squared error (MSE) and mean absolute error (MAE) are used instead of accuracy.
Multi-label classification is a generalization of standard classification in which each example can be assigned multiple non-exclusive labels simultaneously. For instance, a news article might be tagged with both "Politics" and "Economics," or a photograph might contain both a dog and a car. Formally, the model maps each input to a binary vector where each element indicates the presence (1) or absence (0) of a particular label.
Multi-label problems are commonly solved using neural networks with a sigmoid activation function on each output node and binary cross-entropy loss. Other approaches include problem transformation methods (such as Binary Relevance, which trains one binary classifier per label) and algorithm adaptation methods that modify existing algorithms to handle multiple outputs directly.
Before labels can be used by most machine learning algorithms, they must be converted into a numerical representation. The two most common encoding schemes are integer encoding and one-hot encoding.
| Encoding method | Representation | Best suited for | Example (three classes: Cat, Dog, Bird) |
|---|---|---|---|
| Integer (ordinal) encoding | Single integer per class | Ordinal data or tree-based models | Cat = 0, Dog = 1, Bird = 2 |
| One-hot encoding | Binary vector with one 1 | Nominal data, neural networks, logistic regression | Cat = [1,0,0], Dog = [0,1,0], Bird = [0,0,1] |
Integer encoding assigns each category a unique integer. This is compact but can mislead models that interpret numerical proximity as similarity (for example, a linear model might assume Dog is "between" Cat and Bird). One-hot encoding avoids this problem by representing each class as a separate binary column, but it increases the dimensionality of the feature space. The choice between the two depends on the algorithm and the nature of the data.
Label smoothing is a regularization technique introduced by Szegedy et al. (2016) that replaces hard one-hot target vectors with softened versions. Instead of assigning a probability of 1.0 to the correct class and 0.0 to all others, label smoothing redistributes a small amount of probability mass uniformly across all classes. For a smoothing parameter alpha (typically 0.1), the target for the correct class becomes (1 - alpha) and each incorrect class receives alpha / K, where K is the total number of classes.
This prevents the model from becoming overconfident in its predictions and has been shown to improve generalization across image classification, machine translation, and language modeling tasks. Label smoothing has become a standard component in modern training recipes, particularly on large-scale benchmarks like ImageNet.
Pseudo-labeling is a semi-supervised learning technique introduced by Lee (2013) that leverages unlabeled data by assigning it artificial labels generated by the model itself. The process works as follows:
Pseudo-labeling is equivalent to entropy regularization, which encourages the model to make confident predictions and favors a low-density separation between classes. Modern semi-supervised methods such as FixMatch and MixMatch build on the pseudo-label concept by adding consistency regularization and advanced data augmentation. The technique is especially valuable when labeled data is scarce but unlabeled data is abundant.
The quality of labels in a training dataset plays a crucial role in the performance of any supervised learning algorithm. Poorly labeled data can produce incorrect or biased models that fail when deployed in production. Label quality is affected by several factors.
Label noise refers to random errors or inconsistencies in the assigned labels. Noise can arise from annotator mistakes, ambiguous instructions, measurement error, or subjective disagreements among labelers. Studies have shown that even benchmark datasets contain meaningful rates of label error. Northcutt et al. (2021) estimated that popular datasets such as ImageNet, CIFAR-10, and Amazon Reviews contain approximately 3-6% label errors.
Ambiguity occurs when it is genuinely unclear which label should be assigned to a given example. This is common in subjective tasks like sentiment analysis, where reasonable annotators may disagree. Ambiguity can result in inconsistent labels that confuse the model during training.
Label bias refers to systematic errors that skew the distribution of labels in a dataset. Bias can be introduced by annotator prejudice, skewed sampling, or historical patterns embedded in the data. Biased labels lead to biased models, which can produce unfair or harmful outcomes when deployed.
Confident Learning, proposed by Northcutt, Jiang, and Chuang (2021), is a principled framework for finding and fixing label errors in datasets. It directly estimates the joint distribution of noisy (observed) labels and true (hidden) labels using only the model's predicted probabilities, without requiring any guaranteed-clean validation data. The open-source cleanlab Python library implements Confident Learning and has become a widely used tool for data-centric AI practitioners.
Obtaining high-quality labels is one of the most significant bottlenecks in applied machine learning. Several strategies exist for acquiring labels, each with different tradeoffs in cost, speed, and quality.
Domain experts or trained annotators manually assign labels to each data point. This produces the highest-quality labels but is the slowest and most expensive approach. For specialized domains like medical imaging or legal document review, expert annotators may charge hundreds of dollars per hour, and labeling a single complex image (such as pixel-level semantic segmentation) can take several minutes.
Platforms like Amazon Mechanical Turk, Scale AI, and Toloka distribute labeling tasks to a large pool of non-expert workers. Crowdsourcing is faster and cheaper than expert annotation but introduces more noise. Quality control mechanisms such as majority voting, gold-standard questions, and inter-annotator agreement metrics are used to improve reliability.
Programmatic labeling uses heuristic rules, knowledge bases, and existing models to automatically generate labels without manual annotation. Snorkel, developed by Ratner et al. (2017) at Stanford, pioneered the data programming paradigm in which users write labeling functions (small programs that each provide a noisy label for a subset of data points). Snorkel then models the accuracies and correlations of these labeling functions to produce a single probabilistic label for each example. In user studies, subject matter experts using Snorkel built models 2.8x faster than with hand labeling, and predictive performance improved by an average of 45.5% compared to seven hours of manual labeling.
Active learning selects the most informative unlabeled examples for an annotator to label next, reducing the total number of labels needed to achieve a given level of model performance. By focusing human effort on examples near the decision boundary or where the model is most uncertain, active learning can achieve strong results with a fraction of the labels required by random sampling.
| Acquisition method | Cost | Speed | Label quality | Best for |
|---|---|---|---|---|
| Manual expert annotation | High | Slow | Highest | Medical, legal, safety-critical domains |
| Crowdsourcing | Medium | Fast | Moderate (with QC) | Large-scale image and text annotation |
| Programmatic labeling (Snorkel) | Low | Very fast | Moderate | Rapid prototyping, large unlabeled corpora |
| Active learning | Medium | Moderate | High | Limited annotation budget |
Label imbalance (also called class imbalance) occurs when certain labels appear far more frequently than others in the training data. For example, in fraud detection, legitimate transactions may outnumber fraudulent ones by 1,000 to 1. Models trained on imbalanced data tend to be biased toward the majority class, achieving high overall accuracy while performing poorly on the minority class, which is often the class of greatest interest.
Common strategies for addressing a class-imbalanced dataset include:
Label leakage (a specific form of data leakage) occurs when information about the target label inadvertently makes its way into the input features. This causes the model to achieve artificially high performance during training and evaluation but fail in production, where the leaked information is unavailable.
Label leakage can take several forms:
Detecting leakage typically involves inspecting feature importance scores for suspiciously predictive features, examining the data collection timeline, and verifying that the train/test split respects temporal ordering.
Several commercial and open-source platforms have been developed to streamline the data labeling workflow.
| Tool | Type | Key features |
|---|---|---|
| Label Studio | Open-source | Supports text, image, audio, video, and time series; customizable labeling interfaces; self-hosted or cloud |
| Labelbox | Commercial | Model-assisted labeling, workflow automation, cloud integrations (AWS, GCP, Azure) |
| Amazon SageMaker Ground Truth | Commercial (AWS) | Active learning to reduce costs, integration with AWS ecosystem, built-in workforce management |
| Prodigy | Commercial | Built by the spaCy team, scriptable annotation with active learning, NLP-focused |
| CVAT | Open-source | Computer vision focused, supports bounding boxes, polygons, segmentation masks |
| Scale AI | Commercial platform | Managed labeling workforce, API-driven, used by autonomous vehicle companies |
Data labeling is widely recognized as one of the most expensive components of machine learning projects. By some estimates, data preparation (including labeling) accounts for roughly 80% of the total time spent on a machine learning project.
Costs vary dramatically based on task complexity:
| Task type | Approximate cost per annotation | Notes |
|---|---|---|
| Simple binary classification | $0.01 - $0.05 | Spam/not spam, sentiment polarity |
| Object bounding boxes | $0.05 - $0.50 | Varies by number of objects per image |
| Named entity recognition (text) | $0.10 - $1.00 | Depends on entity density and domain expertise |
| Semantic segmentation | $1.00 - $10.00+ | Pixel-level annotation is highly labor-intensive |
| Medical image annotation | $10.00 - $100.00+ | Requires trained radiologists or pathologists |
The high cost of manual labeling has driven research into weak supervision, semi-supervised learning, self-supervised learning, and foundation models that can be fine-tuned with minimal labeled data.
Imagine you are teaching a friend to sort a pile of toy animals into the right boxes. You hold up each toy and say, "This one is a dinosaur" or "This one is a horse." The name you say out loud is the label. Your friend listens, looks at each toy, and eventually learns to sort them on their own, even when you stop telling them the answers. In machine learning, labels work the same way: they are the correct answers that a computer uses to learn patterns from examples.