Labeled example
Last reviewed
May 11, 2026
Sources
13 citations
Review status
Source-backed
Revision
v2 · 2,198 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
13 citations
Review status
Source-backed
Revision
v2 · 2,198 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, a labeled example is a data point that consists of one or more input features and a corresponding target value called the label. Google's Machine Learning Glossary defines a labeled example as "an example that contains one or more features and a label," and describes an unlabeled example as one that "consists of one or more features but no label." Labeled examples are the fuel for supervised learning algorithms, which study many such examples to learn a function that maps features to labels and can then make predictions on new, unseen data.
A labeled example is often visualized as a single row in a spreadsheet: feature values in the left-hand columns, and the label in the right-hand column. Strip the label column and you are left with an unlabeled example, which is what the model encounters at inference time.
For a house price task, a labeled example might contain features such as square footage, number of bedrooms, year built, and ZIP code, paired with a label of $725,000 (the actual sale price). For a spam filter, a labeled example would be the text of an email along with a label of "spam" or "not spam." The model only sees labels during training and evaluation.
Labels in machine learning fall into two broad families depending on the task.
Categorical labels, also called class labels, are used in classification tasks where the output is one of a finite set of discrete categories. Common examples include image classification (cat, dog, car), sentiment analysis (positive, negative, neutral), and medical diagnosis (benign or malignant). Categorical labels can be binary or multi-class, and a single example can sometimes carry several labels at once in a multi-label setting (for example, an image tagged with both "beach" and "sunset").
Continuous labels are used in regression tasks where the goal is to predict a real-valued number. Typical examples include predicting housing prices from location, size, and age; estimating a person's weight from height and age; or forecasting next-day temperature from sensor readings. Unlike categorical labels, continuous labels have a natural ordering and meaningful distances between values.
Some tasks blur the line. Ordinal labels (a star rating from 1 to 5) carry order but a small fixed set of values, and structured prediction tasks output labels that are themselves complex objects: a sequence of part-of-speech tags, a bounding box, a segmentation mask, or a parse tree.
Building a labeled dataset is usually the most expensive and slowest part of a supervised machine learning project. The work breaks down into three stages.
The first stage is collecting raw data relevant to the problem. Sources include sensor logs, web scrapes, internal databases, user-generated content, scientific instruments, and licensed third-party datasets. A persistent concern is whether the data is representative of the situations the model will face in production. A model trained on photos shot in daylight will not behave well on night driving.
Before labels are attached, raw data is usually cleaned and normalized: removing duplicates, handling missing values, stripping personally identifiable information, and standardizing units. For images this often means resizing, cropping, and color-space conversion; for text it can mean tokenization, lowercasing, and stripping HTML.
The final stage is attaching a label to each example. Human annotators are the most common source, especially when judgment is required, but labels can also come from logged user behavior (clicks, purchases, ratings), instrument readings (the actual stock price the next day), or programmatic rules. Label quality directly bounds the quality of any model trained on them.
Different tasks call for different annotation formats. For images, annotators may pick a category, draw a bounding box around each object, or produce a pixel-level segmentation mask. For text, they may classify sentiment, mark named entities, or write a reference translation. For speech, they transcribe audio or mark phoneme boundaries. The richer the annotation, the more expensive each labeled example tends to be.
Large labeled datasets are rarely produced by a single person. Instead, the work is distributed across many annotators, often through online platforms.
Amazon Mechanical Turk (MTurk) is a crowdsourcing marketplace launched by Amazon in 2005. Requesters post small tasks called HITs (Human Intelligence Tasks), and a global pool of workers completes them for small payments. MTurk became famous as the workhorse behind ImageNet, the dataset built by Stanford computer scientist Fei-Fei Li starting in 2007. Labeling for ImageNet ran from July 2008 through April 2010 and involved roughly 49,000 workers from 167 countries who filtered and labeled more than 160 million candidate images. Each of the roughly 14 million accepted images was reviewed by three independent annotators. The dataset became the standard benchmark for the deep learning revolution that followed AlexNet in 2012.
Scale AI is a more recent industrial labeling provider. Founded in 2016 by Alexandr Wang and Lucy Guo, it runs a network of human annotators augmented by machine assisted pre-labeling and consensus based quality assurance. Scale's main early business was labeling sensor data for autonomous vehicle companies: drawing 3D cuboids around cars and pedestrians in LiDAR point clouds, annotating high-definition maps, and segmenting camera frames. The company has labeled data for Google, Microsoft, Meta, General Motors, and OpenAI. In June 2025, Meta agreed to take a 49% non-voting stake in Scale AI for about $14.8 billion, a sign of how strategic labeled data has become to frontier AI development.
Other providers include Appen, Labelbox, Sama, iMerit, and Snorkel AI. The global data labeling market was estimated at roughly $18.6 billion in 2024 and is projected to grow at over 20% annually through 2030, according to Grand View Research.
Labeled examples are only as useful as the labels are correct. "Label noise" refers to errors and inconsistencies in the labels themselves, whether from annotator mistakes, ambiguous cases, inconsistent instructions, or genuine class boundaries that are fuzzy.
A 2021 study by researchers at MIT found that 10 of the most cited machine learning test sets, including ImageNet and CIFAR-10, contained an average label error rate of about 3.4% in their test splits. A CNN trained on CIFAR-10 with clean labels reached around 73.6% accuracy in one common benchmark; the same architecture trained on a version with 30% injected label noise dropped to 64.1%. Random label noise tends to act like a weak regularizer, while systematic noise (where mistakes follow a pattern) is much more damaging, because models cheerfully learn the pattern.
The usual way to measure annotation quality is inter-annotator agreement. Cohen's kappa, introduced by psychologist Jacob Cohen in 1960, measures agreement between two annotators while correcting for chance agreement. Fleiss's kappa generalizes this to more than two annotators, and Krippendorff's alpha handles missing data and various scales. High agreement suggests the task is well defined and the annotators are calibrated; low agreement is a sign that the labeling instructions need work, or that the task itself is too ambiguous to label reliably.
Common tactics for improving label quality include using multiple annotators per example with majority vote, building a "gold standard" set that annotators are silently retested on, running automated consensus checks, and having experienced reviewers audit a random sample of work.
When labels are expensive, it pays to be careful about which examples you ask humans to label. Active learning is the family of techniques that lets a model choose its own training data by repeatedly picking the unlabeled examples it expects to learn the most from.
The most common setup is pool based active learning. The system starts with a small labeled seed set and a large pool of unlabeled examples. A model is trained on the seed set, then ranks every example in the pool by some measure of informativeness. The top-ranked examples are sent to human annotators, their labels are added, the model is retrained, and the loop repeats until the labeling budget runs out or accuracy is sufficient.
Key query strategies include uncertainty sampling (ask about the examples the current model is least confident on), query by committee (ask where an ensemble of models disagrees most), expected model change, and core-set selection (pick examples that cover the input distribution well). In practice, active learning can reduce the number of labels needed for a target accuracy by 50% to 90% depending on the task. A stream based variant decides on the spot whether to spend a label on each arriving example, useful for labeling rare events in a live data feed.
Another way to attack the labeling bottleneck is to skip individual human annotation almost entirely. Weak supervision generates labels with noisy automated sources: heuristic rules, distant supervision from existing knowledge bases, crowdsourced labels of varying quality, or outputs from other models.
Snorkel is the best known system in this family. Developed at the Stanford AI Lab and first described in a 2017 VLDB paper by Alexander Ratner and colleagues, Snorkel asks domain experts to write "labeling functions," short pieces of code that programmatically label data points or abstain. Each function on its own is noisy and incomplete, but Snorkel's generative model learns how reliable each is, denoises their outputs, and produces probabilistic training labels. In one published user study, experts using Snorkel built models 2.8 times faster and achieved 45.5% higher predictive performance than they did by hand labeling for seven hours. Snorkel DryBell, a follow-up case study at Google, used the same approach to train production text classifiers without manually labeled training data.
Related approaches include distant supervision (using a knowledge base to auto-label text mentions), self-training (a model labels new examples for itself), and semi-supervised methods that combine a small labeled set with a much larger unlabeled set.
The contrast with unlabeled examples is fundamental to understanding what each branch of machine learning does:
| Concept | Has features | Has label | Used by |
|---|---|---|---|
| Labeled example | Yes | Yes | Supervised learning, evaluation |
| Unlabeled example | Yes | No | Unsupervised learning, inference, semi-supervised learning |
| Partially labeled example | Yes | Some labels missing | Semi-supervised, multi-task learning |
Supervised learning needs labeled examples for training and evaluation. Unsupervised learning works directly on unlabeled examples and tries to find structure such as clusters or low-dimensional manifolds. Semi-supervised learning combines a small labeled set with a much larger unlabeled set, using techniques like pseudo-labeling, consistency regularization, and self-training. Self-supervised learning, the approach behind modern foundation models, sidesteps the labeling problem by inventing prediction tasks from the raw data: predict the next word, predict a masked patch of an image, predict whether two sentences are adjacent.
A subtle failure mode is label leakage. This happens when a feature in the training data is actually a downstream consequence of the label, or even a near-duplicate of it. A model trained on such features looks great on the validation set but fails in production, because the leaky feature is not available at prediction time. A 2023 review of leakage in scientific machine learning identified at least 294 affected academic publications across 17 disciplines. Careful definition of what counts as a feature versus what counts as label-adjacent information is a routine part of dataset design.
Imagine you are learning to sort fruit. Your mom shows you a basket and tells you the name of each piece: "that's an apple, that's an orange, that's a banana." Each fruit is an example. The features are what you see and feel: the color, the shape, the smell. The name your mom tells you is the label. After enough rounds, you can sort new fruits on your own.
Machine learning models learn the same way. A labeled example gives the computer both the inputs (features) and the right answer (label) so it can learn the link between them. An unlabeled example is a fruit with no name attached: the computer can still describe it, but to learn what it is, somebody has to do the labeling first.