Labeled example

Machine Learning

12 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v4 · 2,443 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

A labeled example is a single data point used to train a machine learning model that consists of one or more input features paired with the correct answer, called the label. Google's Machine Learning Glossary defines it simply as "an example that contains one or more features and a label," in contrast with an unlabeled example, which "consists of one or more features but no label."^[1] Labeled examples are the building block of supervised learning: a model studies many of them to learn a function mapping features to labels, then predicts labels for new, unseen data.^[2] Producing them is typically the most expensive and time-consuming stage of a machine learning project, which is why the global data labeling market was estimated at roughly $18.6 billion in 2024.^[12]

The simplest mental model is a single row in a spreadsheet: feature values in the left-hand columns and the label in the right-hand column. A collection of labeled examples forms a training set. Strip the label column and you are left with an unlabeled example, which is what the model encounters at inference time.

What is a labeled example in machine learning?

Definition

In machine learning, a labeled example is a data point that consists of one or more input features and a corresponding target value called the label. Google's Machine Learning Glossary defines a labeled example as "an example that contains one or more features and a label," and describes an unlabeled example as one that "consists of one or more features but no label."^[1] Labeled examples are the fuel for supervised learning algorithms, which study many such examples to learn a function that maps features to labels and can then make predictions on new, unseen data.^[2]

A labeled example is often visualized as a single row in a spreadsheet: feature values in the left-hand columns, and the label in the right-hand column. Strip the label column and you are left with an unlabeled example, which is what the model encounters at inference time.

For a house price task, a labeled example might contain features such as square footage, number of bedrooms, year built, and ZIP code, paired with a label of $725,000 (the actual sale price). For a spam filter, a labeled example would be the text of an email along with a label of "spam" or "not spam." The model only sees labels during training and evaluation.

What types of labels are there?

Labels in machine learning fall into two broad families depending on the task.

Categorical labels

Categorical labels, also called class labels, are used in classification tasks where the output is one of a finite set of discrete categories. Common examples include image classification (cat, dog, car), sentiment analysis (positive, negative, neutral), and medical diagnosis (benign or malignant). Categorical labels can be binary or multi-class, and a single example can sometimes carry several labels at once in a multi-label setting (for example, an image tagged with both "beach" and "sunset").

Continuous labels

Continuous labels are used in regression tasks where the goal is to predict a real-valued number. Typical examples include predicting housing prices from location, size, and age; estimating a person's weight from height and age; or forecasting next-day temperature from sensor readings. Unlike categorical labels, continuous labels have a natural ordering and meaningful distances between values.

Some tasks blur the line. Ordinal labels (a star rating from 1 to 5) carry order but a small fixed set of values, and structured prediction tasks output labels that are themselves complex objects: a sequence of part-of-speech tags, a bounding box, a segmentation mask, or a parse tree.

How are labeled examples created?

Building a labeled dataset is usually the most expensive and slowest part of a supervised machine learning project. The work breaks down into three stages.^[13]

Data acquisition

The first stage is collecting raw data relevant to the problem. Sources include sensor logs, web scrapes, internal databases, user-generated content, scientific instruments, and licensed third-party datasets. A persistent concern is whether the data is representative of the situations the model will face in production. A model trained on photos shot in daylight will not behave well on night driving.

Data preprocessing

Before labels are attached, raw data is usually cleaned and normalized: removing duplicates, handling missing values, stripping personally identifiable information, and standardizing units. For images this often means resizing, cropping, and color-space conversion; for text it can mean tokenization, lowercasing, and stripping HTML.

Label assignment

The final stage is attaching a label to each example. Human annotators are the most common source, especially when judgment is required, but labels can also come from logged user behavior (clicks, purchases, ratings), instrument readings (the actual stock price the next day), or programmatic rules.^[13] Label quality directly bounds the quality of any model trained on them.

Different tasks call for different annotation formats. For images, annotators may pick a category, draw a bounding box around each object, or produce a pixel-level segmentation mask. For text, they may classify sentiment, mark named entities, or write a reference translation. For speech, they transcribe audio or mark phoneme boundaries. The richer the annotation, the more expensive each labeled example tends to be.

Who labels the data? Crowdsourcing and labeling providers

Large labeled datasets are rarely produced by a single person. Instead, the work is distributed across many annotators, often through online platforms.

Amazon Mechanical Turk (MTurk) is a crowdsourcing marketplace launched by Amazon in 2005. Requesters post small tasks called HITs (Human Intelligence Tasks), and a global pool of workers completes them for small payments.^[11] MTurk became famous as the workhorse behind ImageNet, the dataset built by Stanford computer scientist Fei-Fei Li starting in 2007. Labeling for ImageNet ran from July 2008 through April 2010 and involved roughly 49,000 workers from 167 countries who filtered and labeled more than 160 million candidate images. Each of the roughly 14 million accepted images was reviewed by three independent annotators.^[3] The dataset became the standard benchmark for the deep learning revolution that followed AlexNet in 2012.

Scale AI is a more recent industrial labeling provider. Founded in 2016 by Alexandr Wang and Lucy Guo, it runs a network of human annotators augmented by machine assisted pre-labeling and consensus based quality assurance. Scale's main early business was labeling sensor data for autonomous vehicle companies: drawing 3D cuboids around cars and pedestrians in LiDAR point clouds, annotating high-definition maps, and segmenting camera frames. The company has labeled data for Google, Microsoft, Meta, General Motors, and OpenAI. In June 2025, Meta agreed to take a 49% non-voting stake in Scale AI in a deal worth roughly $14.3 billion that valued the startup at about $29 billion, a sign of how strategic labeled data has become to frontier AI development; Scale CEO Alexandr Wang left to lead Meta's superintelligence effort, with chief strategy officer Jason Droege promoted to CEO.^[4]

Other providers include Appen, Labelbox, Sama, iMerit, and Snorkel AI. The global data labeling market was estimated at roughly $18.6 billion in 2024 and is projected to grow at over 20% annually through 2030, according to Grand View Research.^[12]

What is label noise, and how is label quality measured?

Labeled examples are only as useful as the labels are correct. "Label noise" refers to errors and inconsistencies in the labels themselves, whether from annotator mistakes, ambiguous cases, inconsistent instructions, or genuine class boundaries that are fuzzy.

A 2021 study by researchers at MIT found that 10 of the most cited machine learning test sets, including ImageNet and CIFAR-10, contained an average label error rate of about 3.4% in their test splits.^[9] A CNN trained on CIFAR-10 with clean labels reached around 73.6% accuracy in one common benchmark; the same architecture trained on a version with 30% injected label noise dropped to 64.1%. Random label noise tends to act like a weak regularizer, while systematic noise (where mistakes follow a pattern) is much more damaging, because models cheerfully learn the pattern.

The usual way to measure annotation quality is inter-annotator agreement. Cohen's kappa, introduced by psychologist Jacob Cohen in 1960, measures agreement between two annotators while correcting for chance agreement.^[10] Fleiss's kappa generalizes this to more than two annotators, and Krippendorff's alpha handles missing data and various scales. High agreement suggests the task is well defined and the annotators are calibrated; low agreement is a sign that the labeling instructions need work, or that the task itself is too ambiguous to label reliably.

Common tactics for improving label quality include using multiple annotators per example with majority vote, building a "gold standard" set that annotators are silently retested on, running automated consensus checks, and having experienced reviewers audit a random sample of work.

How can you label fewer examples? Active learning

When labels are expensive, it pays to be careful about which examples you ask humans to label. Active learning is the family of techniques that lets a model choose its own training data by repeatedly picking the unlabeled examples it expects to learn the most from.^[5]

The most common setup is pool based active learning. The system starts with a small labeled seed set and a large pool of unlabeled examples. A model is trained on the seed set, then ranks every example in the pool by some measure of informativeness. The top-ranked examples are sent to human annotators, their labels are added, the model is retrained, and the loop repeats until the labeling budget runs out or accuracy is sufficient.

Key query strategies include uncertainty sampling (ask about the examples the current model is least confident on), query by committee (ask where an ensemble of models disagrees most), expected model change, and core-set selection (pick examples that cover the input distribution well).^[5] In practice, active learning can reduce the number of labels needed for a target accuracy by 50% to 90% depending on the task. A stream based variant decides on the spot whether to spend a label on each arriving example, useful for labeling rare events in a live data feed.

Can labels be generated without humans? Programmatic labeling and weak supervision

Another way to attack the labeling bottleneck is to skip individual human annotation almost entirely. Weak supervision generates labels with noisy automated sources: heuristic rules, distant supervision from existing knowledge bases, crowdsourced labels of varying quality, or outputs from other models.^[8]

Snorkel is the best known system in this family. Developed at the Stanford AI Lab and first described in a 2017 VLDB paper by Alexander Ratner and colleagues, Snorkel asks domain experts to write "labeling functions," short pieces of code that programmatically label data points or abstain.^[7] Each function on its own is noisy and incomplete, but Snorkel's generative model learns how reliable each is, denoises their outputs, and produces probabilistic training labels. In one published user study, experts using Snorkel built models 2.8 times faster and achieved 45.5% higher predictive performance than they did by hand labeling for seven hours.^[7] Snorkel DryBell, a follow-up case study at Google, used the same approach to train production text classifiers without manually labeled training data.^[8]

Related approaches include distant supervision (using a knowledge base to auto-label text mentions), self-training (a model labels new examples for itself), and semi-supervised methods that combine a small labeled set with a much larger unlabeled set.

How does a labeled example differ from an unlabeled example?

The contrast with unlabeled examples is fundamental to understanding what each branch of machine learning does:

Concept	Has features	Has label	Used by
Labeled example	Yes	Yes	Supervised learning, evaluation
Unlabeled example	Yes	No	Unsupervised learning, inference, semi-supervised learning
Partially labeled example	Yes	Some labels missing	Semi-supervised, multi-task learning

Supervised learning needs labeled examples for training and evaluation. Unsupervised learning works directly on unlabeled examples and tries to find structure such as clusters or low-dimensional manifolds. Semi-supervised learning combines a small labeled set with a much larger unlabeled set, using techniques like pseudo-labeling, consistency regularization, and self-training. Self-supervised learning, the approach behind modern foundation models, sidesteps the labeling problem by inventing prediction tasks from the raw data: predict the next word, predict a masked patch of an image, predict whether two sentences are adjacent.

What is label leakage?

A subtle failure mode is label leakage. This happens when a feature in the training data is actually a downstream consequence of the label, or even a near-duplicate of it. A model trained on such features looks great on the validation set but fails in production, because the leaky feature is not available at prediction time.^[6] A 2023 review of leakage in scientific machine learning identified at least 294 affected academic publications across 17 disciplines.^[6] Careful definition of what counts as a feature versus what counts as label-adjacent information is a routine part of dataset design.

Explain like I'm 5 (ELI5)

Imagine you are learning to sort fruit. Your mom shows you a basket and tells you the name of each piece: "that's an apple, that's an orange, that's a banana." Each fruit is an example. The features are what you see and feel: the color, the shape, the smell. The name your mom tells you is the label. After enough rounds, you can sort new fruits on your own.

Machine learning models learn the same way. A labeled example gives the computer both the inputs (features) and the right answer (label) so it can learn the link between them. An unlabeled example is a fruit with no name attached: the computer can still describe it, but to learn what it is, somebody has to do the labeling first.

References

Google Developers. "Machine Learning Glossary: ML Fundamentals." https://developers.google.com/machine-learning/glossary/fundamentals ↩
Google Developers. "Supervised Learning." Introduction to Machine Learning. https://developers.google.com/machine-learning/intro-to-ml/supervised ↩
Wikipedia. "ImageNet." https://en.wikipedia.org/wiki/ImageNet ↩
Wikipedia. "Scale AI." https://en.wikipedia.org/wiki/Scale_AI ↩
Wikipedia. "Active learning (machine learning)." https://en.wikipedia.org/wiki/Active_learning_(machine_learning) ↩
Wikipedia. "Leakage (machine learning)." https://en.wikipedia.org/wiki/Leakage_(machine_learning) ↩
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. "Snorkel: Rapid Training Data Creation with Weak Supervision." The VLDB Journal, 2019. https://link.springer.com/article/10.1007/s00778-019-00552-1 ↩
Stanford AI Lab Blog. "Weak Supervision: A New Programming Paradigm for Machine Learning." https://ai.stanford.edu/blog/weak-supervision/ ↩
Northcutt, C. G., Athalye, A., and Mueller, J. "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." NeurIPS 2021 Datasets and Benchmarks Track. ↩
Cohen, J. "A Coefficient of Agreement for Nominal Scales." Educational and Psychological Measurement, 1960. ↩
Amazon Mechanical Turk. https://www.mturk.com/ ↩
Grand View Research. "Data Collection And Labeling Market Size Report, 2030." https://www.grandviewresearch.com/industry-analysis/data-collection-labeling-market ↩
AWS. "What is Data Labeling?" https://aws.amazon.com/what-is/data-labeling/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Active Learning Co-Training Example Instance Inter-rater agreement Label Machine learning terms/All Machine learning terms/Fundamentals Rater Semi-Supervised Learning Terms Training Set Unlabeled example Validation Weak supervision Zero shot, one shot and few shot learning

What is a labeled example in machine learning?

Definition

What types of labels are there?

Categorical labels

Continuous labels

How are labeled examples created?

Data acquisition

Data preprocessing

Label assignment

Who labels the data? Crowdsourcing and labeling providers

What is label noise, and how is label quality measured?

How can you label fewer examples? Active learning

Can labels be generated without humans? Programmatic labeling and weak supervision

How does a labeled example differ from an unlabeled example?

What is label leakage?

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here