Unlabeled example

Machine Learning

13 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v4 · 2,608 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

What is an unlabeled example?

An unlabeled example is a data instance that has one or more features but no label, meaning it carries the inputs a model reads but not the target answer the model is meant to produce. Google's Machine Learning Glossary defines it as "an example that contains one or more features but no label," and contrasts it with a labeled example, which adds the target answer.^[1]

In supervised learning, labeled examples train the model and unlabeled examples are used during inference, since the label is what the trained model needs to produce.^[2] In semi-supervised learning, unsupervised learning, and self-supervised learning, unlabeled examples take a bigger role: they become part of training itself. Unlabeled examples are the raw material that makes large language models and modern self-supervised vision systems possible, because they are abundant and cheap while labels are scarce and costly.

Unlabeled data is the default state of most information in the world. A photo on a phone, a paragraph scraped from the web, a click stream from a web server. None of these arrive with a target attached. Labels are added by people (or sometimes by other models), and that process is slow and expensive. The point of working with unlabeled examples is to avoid paying for labels you do not strictly need.

What does an unlabeled example look like?

A labeled training example for a weather model might include three features and a label such as next-day temperature. An unlabeled example drops the label column and keeps only the features:

Temperature	Precipitation	Humidity
25	9	12
20	54	32
31	0	87

Each row is a single unlabeled example. A clustering algorithm could group these rows by similarity without being told which weather pattern they belong to. A semi-supervised model could combine these rows with a smaller set of labeled examples to fit a better predictor than either piece of data could produce alone.

How does an unlabeled example differ from a labeled example?

Property	Labeled example	Unlabeled example
Contains features	Yes	Yes
Contains a label	Yes	No
Typical role	Supervised training	Inference, clustering, pretraining, semi-supervised training
Cost to obtain	Higher (human annotation)	Lower (raw collection is often cheap)
Typical volume	Smaller	Much larger
Used in self-supervised learning	Not directly	Yes, with labels derived from the input itself

A label can be a class ("cat"), a continuous number (a house price), a sequence (a translation), or a structured object (a bounding box). What turns a labeled example into an unlabeled one is just the absence of that target signal at training time.

Why do unlabeled examples matter?

Unlabeled data is usually abundant and cheap. Labeled data is usually scarce and expensive. A web crawl can hand you trillions of tokens of raw text for the cost of the bandwidth. Asking annotators to write a clean reference summary for each of those documents would not be remotely affordable, and for most documents nobody would agree on a single "correct" summary anyway.

This imbalance shapes modern machine learning. The most cited statement of the idea is Yann LeCun's "cake" analogy, delivered at NIPS 2016: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning."^[12] LeCun later revised "unsupervised" to "self-supervised" at the 2019 ISSCC conference, but the point was the same: the bulk of what a system learns has to come from unlabeled data, because that is where almost all of the data is.^[12]

Large language models work because the internet supplies a nearly endless stream of unlabeled text. Modern vision systems generalize well because self-supervised methods can squeeze structure out of unlabeled images before any human opens a labeling tool. Active, semi-supervised, and self-supervised learning all start from the same observation: there is far more unlabeled data than labeled data, so any technique that puts it to work is worth using.

How are unlabeled examples used?

What is unsupervised learning?

Unsupervised learning trains a model on unlabeled examples alone. The goal is to find structure in the data, not to predict a specific label. The two most common families are clustering and dimensionality reduction.^[9]

Clustering groups similar examples. K-means partitions a dataset into a fixed number of clusters by assigning each example to the nearest centroid and recomputing centroids until they stabilize. Hierarchical clustering builds a tree of nested clusters by merging similar points or by splitting a single cluster. DBSCAN and Gaussian mixture models are common alternatives. Clustering shows up in customer segmentation, document organization, anomaly detection, and exploratory analysis.^[9]

Dimensionality reduction compresses examples into a smaller number of features while keeping as much information as possible. Principal component analysis (PCA) finds the directions of maximum variance and projects the data onto them. t-SNE and UMAP are used for visualization. Autoencoders learn a compressed representation by trying to reconstruct their input through a bottleneck layer, and the compressed code can be reused as a feature vector for downstream models.

What is semi-supervised learning?

Semi-supervised learning sits between supervised and unsupervised learning. It uses a small set of labeled examples together with a much larger set of unlabeled examples. The Wikipedia article on weak supervision describes it as a setting where models combine information from both sources to surpass what supervised learning could do on the labeled portion alone.^[6]

The most common semi-supervised technique is pseudo-labeling, also called self-training. It was formalized for deep networks by Dong-Hyun Lee in a 2013 ICML workshop paper, which described it as "the simple and efficient semi-supervised learning method for deep neural networks."^[13] The workflow is straightforward. First, train a model on the labeled data. Second, run it on the unlabeled examples to produce predictions and confidence scores. Third, keep the predictions where the model is very confident (a probability threshold of 0.95 is common) and treat those predictions as if they were true labels. Fourth, retrain on the combined set. Repeat as needed.^[8] The risk is confirmation bias: if the first model is wrong, the pseudo-labels reinforce its mistakes.^[10]

More recent methods add consistency regularization on top of pseudo-labeling.^[10] FixMatch (Sohn et al., 2020) generates a pseudo-label from a weakly augmented unlabeled image, keeps it only if the model is confident, and then trains the model to produce the same label on a heavily augmented version. With this approach FixMatch reached 94.93% accuracy on CIFAR-10 with only 250 labels, and 88.61% with just 40 labels (4 labels per class), close to fully supervised training while using a tiny fraction of the labels.^[14] MixMatch combines consistency, entropy minimization, and the MixUp interpolation trick into one loss that works on labeled and unlabeled examples at once.^[10]

Classic semi-supervised approaches predate these. Label propagation and label spreading treat the dataset as a graph and propagate labels from labeled nodes to unlabeled neighbors. Scikit-learn ships both in its sklearn.semi_supervised module.^[11]

What is self-supervised learning?

Self-supervised learning is the technique that has done the most to make unlabeled examples valuable. The idea is to invent a pretext task that turns the unlabeled data into its own supervisor: hide part of the input, then train the model to predict the hidden part. Because the label comes from the data itself, no human annotation is needed, and the dataset can scale to the entire internet if that is what is available.

SimCLR (Chen et al., 2020) is a contrastive method for images. It generates two augmented views of the same image and trains the model to pull their representations together while pushing apart representations of different images. The paper reported 76.5% top-1 accuracy on ImageNet with a linear classifier on self-supervised features, a 7% relative improvement over the previous state of the art and a match for a supervised ResNet-50.^[3]

BYOL (Bootstrap Your Own Latent, Grill et al., 2020) showed that contrastive learning does not need negative pairs. An online network tries to predict a target network's representation of a different augmented view of the same image; the target network is a slow-moving average of the online network. BYOL reached 74.3% top-1 accuracy on ImageNet with a ResNet-50 and 79.6% with a larger backbone, under the linear evaluation protocol.^[4]

Masked autoencoders (MAE, He et al., 2022) brought masking back to vision. The model masks a high fraction of image patches (typically 75%) and trains a Vision Transformer to reconstruct the missing pixels. Because the encoder only sees the visible 25% of patches, training is fast enough to run for 1600 epochs while using less total compute than contrastive methods trained for fewer epochs. The paper reported 87.8% top-1 accuracy on ImageNet using only ImageNet-1K data with a ViT-Huge backbone.^[5]

In natural language processing, the same family of ideas shows up as masked language modeling and next-token prediction. BERT (Devlin et al., 2018) masks about 15% of the tokens in a sentence and predicts the originals from the surrounding context; of those masked positions, 80% are replaced with a [MASK] token, 10% with a random token, and 10% are left unchanged.^[7] Each unmasked sentence is a free training example: the "label" for every position is just the token that was originally there. Causal language models like GPT train on next-token prediction instead, where the model sees previous tokens and tries to predict the next one. Every position in every document becomes a training example, which is how models can be trained on trillions of tokens of unlabeled text drawn from books, web pages, and code.

LLM pretraining is now the most economically important application of unlabeled examples. The dataset is unlabeled in the human-annotation sense, but the data structure itself supplies a target for every token. That is why next-token prediction is called a self-supervised objective rather than an unsupervised one.

How does active learning choose which examples to label?

Active learning treats unlabeled examples as candidates for human labeling. The model has access to a large pool of unlabeled data and a small budget for asking an annotator to label a few of them. As Burr Settles put it in his widely cited 2009 active learning survey, "a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns."^[15] The interesting question is which examples to ask about.

Pool-based active learning is the most common setup. The model scores every unlabeled example with a query strategy, picks the highest-scoring ones, sends them to a human labeler, adds the new labels to the training set, and retrains. Common strategies include uncertainty sampling (pick examples the model is least confident about), query-by-committee (train an ensemble and pick examples where its members disagree most), and diversity-based methods that try to cover the input space rather than cluster queries near similar uncertain points.^[15] Active learning can reach the same accuracy as supervised learning while labeling a fraction of the data, but it only works if you have a real pool of unlabeled examples to choose from in the first place.

Where do unlabeled examples come from?

Source	Typical use
Web text and web pages	Pretraining LLMs, search
User images, video, audio	Self-supervised vision and audio models
Server and application logs	Anomaly detection, behavior modeling
Sensor data (LiDAR, IMU, cameras)	Robotics, autonomous driving
Books, scientific papers, code	LLM pretraining and retrieval
Synthetic data from simulators or models	Augmentation, robotics, rare-event coverage

None of these come with labels for any specific downstream task. The same image collection can be repurposed for self-supervised pretraining, clustering, retrieval, or, after partial labeling, for supervised classification.

What are the common pitfalls?

Using unlabeled examples is not free of risk. A few patterns come up often.

Distribution mismatch. If the unlabeled pool is drawn from a different distribution than the labeled set, semi-supervised methods can hurt performance instead of helping.^[8]
Confirmation bias in pseudo-labeling. A weak base model produces wrong pseudo-labels that get reinforced on retraining. High confidence thresholds and ensembles help, but the problem does not go away.^[10]
Pretext task mismatch. A self-supervised task can learn features that look strong on the pretext loss but transfer poorly to the downstream task. Evaluation usually requires a linear probe or fine-tuning on a known benchmark.
Privacy and consent. Unlabeled does not mean anonymous. Photos, messages, and logs may still contain personal information.

Relationship to other terms

An example is a single row of data, with or without a label.
A labeled example is an example with a label.
Unlabeled data is a collection of unlabeled examples.
Labeled data is a collection of labeled examples.
A training set can contain only labeled examples, only unlabeled examples, or a mix.

Explain like I'm 5 (ELI5)

Imagine a giant stack of photos. On a few of them, a teacher has written what the photo shows on the back: "dog," "cat," "bicycle." Those are labeled examples. On most of the photos, the back is blank. Those are unlabeled examples.

If you only look at the photos with writing, you will run out fast because the teacher only had time to write on a few. The blank photos are still useful. You can sort them into piles based on what looks similar, even without knowing the right names. You can also play a game with yourself: hide part of a photo, then guess what was hidden. Each time you play, you get a little better at understanding photos in general. Later, when the teacher labels a few more, you only need a handful to start getting the names right.

Unlabeled examples are the blank-backed photos. There are a lot of them, they are easy to collect, and the trick is to find clever ways to learn from them.

References

Google for Developers. "Machine Learning Glossary." https://developers.google.com/machine-learning/glossary ↩
Google for Developers. "Machine Learning Glossary: ML Fundamentals." https://developers.google.com/machine-learning/glossary/fundamentals ↩
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. "A Simple Framework for Contrastive Learning of Visual Representations" (SimCLR), 2020. https://arxiv.org/abs/2002.05709 ↩
Grill, J.-B., et al. "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning" (BYOL), 2020. https://arxiv.org/abs/2006.07733 ↩
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. "Masked Autoencoders Are Scalable Vision Learners" (MAE), CVPR 2022. https://arxiv.org/abs/2111.06377 ↩
Wikipedia. "Weak supervision." https://en.wikipedia.org/wiki/Weak_supervision ↩
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," 2018. https://arxiv.org/abs/1810.04805 ↩
IBM. "What Is Semi-Supervised Learning?" https://www.ibm.com/think/topics/semi-supervised-learning ↩
IBM. "What Is Unsupervised Learning?" https://www.ibm.com/think/topics/unsupervised-learning ↩
Weng, L. "Learning with not Enough Data Part 1: Semi-Supervised Learning," 2021. https://lilianweng.github.io/posts/2021-12-05-semi-supervised/ ↩
scikit-learn. "1.14. Semi-supervised learning." https://scikit-learn.org/stable/modules/semi_supervised.html ↩
LeCun, Y. "Predictive Learning," NIPS 2016 keynote (cake analogy; revised to self-supervised learning at ISSCC 2019). https://syncedreview.com/2019/02/22/yann-lecun-cake-analogy-2-0/ ↩
Lee, D.-H. "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks," ICML 2013 Workshop on Challenges in Representation Learning. https://www.semanticscholar.org/paper/Pseudo-Label-:-The-Simple-and-Efficient-Learning-Lee/798d9840d2439a0e5d47bcf5d164aa46d5e7dc26 ↩
Sohn, K., et al. "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence," NeurIPS 2020. https://arxiv.org/abs/2001.07685 ↩
Settles, B. "Active Learning Literature Survey," Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009. https://burrsettles.com/pub/settles.activelearning.pdf ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Active Learning Co-Training Example Label Labeled example Machine learning terms/All Machine learning terms/Fundamentals Pre-training Semi-Supervised Learning Terms

What is an unlabeled example?

What does an unlabeled example look like?

How does an unlabeled example differ from a labeled example?

Why do unlabeled examples matter?

How are unlabeled examples used?

What is unsupervised learning?

What is semi-supervised learning?

What is self-supervised learning?

How does active learning choose which examples to label?

Where do unlabeled examples come from?

What are the common pitfalls?

Relationship to other terms

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here