Unlabeled example
Last reviewed
May 11, 2026
Sources
11 citations
Review status
Source-backed
Revision
v2 · 2,193 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
11 citations
Review status
Source-backed
Revision
v2 · 2,193 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
An unlabeled example is a data point that contains one or more features but no label. Google's Machine Learning Glossary defines it as "an example consisting of one or more features but no label," and contrasts it with a labeled example, which carries the target answer.
In supervised learning, labeled examples train the model and unlabeled examples are used during inference, since the label is what the trained model needs to produce. In semi-supervised learning, unsupervised learning, and self-supervised learning, unlabeled examples take a bigger role: they become part of training itself.
Unlabeled data is the default state of most information in the world. A photo on a phone, a paragraph scraped from the web, a click stream from a web server. None of these arrive with a target attached. Labels are added by people (or sometimes by other models), and that process is slow and expensive. The point of working with unlabeled examples is to avoid paying for labels you do not strictly need.
A labeled training example for a weather model might include three features and a label such as next-day temperature. An unlabeled example drops the label column and keeps only the features:
| Temperature | Precipitation | Humidity |
|---|---|---|
| 25 | 9 | 12 |
| 20 | 54 | 32 |
| 31 | 0 | 87 |
Each row is a single unlabeled example. A clustering algorithm could group these rows by similarity without being told which weather pattern they belong to. A semi-supervised model could combine these rows with a smaller set of labeled examples to fit a better predictor than either piece of data could produce alone.
| Property | Labeled example | Unlabeled example |
|---|---|---|
| Contains features | Yes | Yes |
| Contains a label | Yes | No |
| Typical role | Supervised training | Inference, clustering, pretraining, semi-supervised training |
| Cost to obtain | Higher (human annotation) | Lower (raw collection is often cheap) |
| Typical volume | Smaller | Much larger |
| Used in self-supervised learning | Not directly | Yes, with labels derived from the input itself |
A label can be a class ("cat"), a continuous number (a house price), a sequence (a translation), or a structured object (a bounding box). What turns a labeled example into an unlabeled one is just the absence of that target signal at training time.
Unlabeled data is usually abundant and cheap. Labeled data is usually scarce and expensive. A web crawl can hand you trillions of tokens of raw text for the cost of the bandwidth. Asking annotators to write a clean reference summary for each of those documents would not be remotely affordable, and for most documents nobody would agree on a single "correct" summary anyway.
This imbalance shapes modern machine learning. Large language models work because the internet supplies a nearly endless stream of unlabeled text. Modern vision systems generalize well because self-supervised methods can squeeze structure out of unlabeled images before any human opens a labeling tool. Active, semi-supervised, and self-supervised learning all start from the same observation: there is far more unlabeled data than labeled data, so any technique that puts it to work is worth using.
Unsupervised learning trains a model on unlabeled examples alone. The goal is to find structure in the data, not to predict a specific label. The two most common families are clustering and dimensionality reduction.
Clustering groups similar examples. K-means partitions a dataset into a fixed number of clusters by assigning each example to the nearest centroid and recomputing centroids until they stabilize. Hierarchical clustering builds a tree of nested clusters by merging similar points or by splitting a single cluster. DBSCAN and Gaussian mixture models are common alternatives. Clustering shows up in customer segmentation, document organization, anomaly detection, and exploratory analysis.
Dimensionality reduction compresses examples into a smaller number of features while keeping as much information as possible. Principal component analysis (PCA) finds the directions of maximum variance and projects the data onto them. t-SNE and UMAP are used for visualization. Autoencoders learn a compressed representation by trying to reconstruct their input through a bottleneck layer, and the compressed code can be reused as a feature vector for downstream models.
Semi-supervised learning sits between supervised and unsupervised learning. It uses a small set of labeled examples together with a much larger set of unlabeled examples. The Wikipedia article on weak supervision describes it as a setting where models combine information from both sources to surpass what supervised learning could do on the labeled portion alone.
The most common semi-supervised technique is pseudo-labeling, also called self-training. The workflow is straightforward. First, train a model on the labeled data. Second, run it on the unlabeled examples to produce predictions and confidence scores. Third, keep the predictions where the model is very confident (a probability threshold of 0.95 is common) and treat those predictions as if they were true labels. Fourth, retrain on the combined set. Repeat as needed. The risk is confirmation bias: if the first model is wrong, the pseudo-labels reinforce its mistakes.
More recent methods add consistency regularization on top of pseudo-labeling. FixMatch generates a pseudo-label from a weakly augmented unlabeled image, keeps it only if the model is confident, and then trains the model to produce the same label on a heavily augmented version. MixMatch combines consistency, entropy minimization, and the MixUp interpolation trick into one loss that works on labeled and unlabeled examples at once. Both methods reach accuracy close to fully supervised training while using a fraction of the labels.
Classic semi-supervised approaches predate these. Label propagation and label spreading treat the dataset as a graph and propagate labels from labeled nodes to unlabeled neighbors. Scikit-learn ships both in its sklearn.semi_supervised module.
Self-supervised learning is the technique that has done the most to make unlabeled examples valuable. The idea is to invent a pretext task that turns the unlabeled data into its own supervisor: hide part of the input, then train the model to predict the hidden part. Because the label comes from the data itself, no human annotation is needed, and the dataset can scale to the entire internet if that is what is available.
SimCLR (Chen et al., 2020) is a contrastive method for images. It generates two augmented views of the same image and trains the model to pull their representations together while pushing apart representations of different images. The paper reported 76.5% top-1 accuracy on ImageNet with a linear classifier on self-supervised features, matching a supervised ResNet-50.
BYOL (Bootstrap Your Own Latent, Grill et al., 2020) showed that contrastive learning does not need negative pairs. An online network tries to predict a target network's representation of a different augmented view of the same image; the target network is a slow-moving average of the online network. BYOL reached 74.3% top-1 accuracy on ImageNet with a ResNet-50 and 79.6% with a larger backbone.
Masked autoencoders (MAE, He et al., 2022) brought masking back to vision. The model masks a high fraction of image patches (typically 75%) and trains a Vision Transformer to reconstruct the missing pixels. The encoder only sees visible patches, which keeps compute low. The paper reported 87.8% top-1 accuracy on ImageNet using only ImageNet-1K data with a ViT-Huge backbone.
In natural language processing, the same family of ideas shows up as masked language modeling and next-token prediction. BERT masks about 15% of the tokens in a sentence and predicts the originals from the surrounding context. Each unmasked sentence is a free training example: the "label" for every position is just the token that was originally there. Causal language models like GPT train on next-token prediction instead, where the model sees previous tokens and tries to predict the next one. Every position in every document becomes a training example, which is how models can be trained on trillions of tokens of unlabeled text drawn from books, web pages, and code.
LLM pretraining is now the most economically important application of unlabeled examples. The dataset is unlabeled in the human-annotation sense, but the data structure itself supplies a target for every token. That is why next-token prediction is called a self-supervised objective rather than an unsupervised one.
Active learning treats unlabeled examples as candidates for human labeling. The model has access to a large pool of unlabeled data and a small budget for asking an annotator to label a few of them. The interesting question is which examples to ask about.
Pool-based active learning is the most common setup. The model scores every unlabeled example with a query strategy, picks the highest-scoring ones, sends them to a human labeler, adds the new labels to the training set, and retrains. Common strategies include uncertainty sampling (pick examples the model is least confident about), query-by-committee (train an ensemble and pick examples where its members disagree most), and diversity-based methods that try to cover the input space rather than cluster queries near similar uncertain points. Active learning can reach the same accuracy as supervised learning while labeling a fraction of the data, but it only works if you have a real pool of unlabeled examples to choose from in the first place.
| Source | Typical use |
|---|---|
| Web text and web pages | Pretraining LLMs, search |
| User images, video, audio | Self-supervised vision and audio models |
| Server and application logs | Anomaly detection, behavior modeling |
| Sensor data (LiDAR, IMU, cameras) | Robotics, autonomous driving |
| Books, scientific papers, code | LLM pretraining and retrieval |
| Synthetic data from simulators or models | Augmentation, robotics, rare-event coverage |
None of these come with labels for any specific downstream task. The same image collection can be repurposed for self-supervised pretraining, clustering, retrieval, or, after partial labeling, for supervised classification.
Using unlabeled examples is not free of risk. A few patterns come up often.
Imagine a giant stack of photos. On a few of them, a teacher has written what the photo shows on the back: "dog," "cat," "bicycle." Those are labeled examples. On most of the photos, the back is blank. Those are unlabeled examples.
If you only look at the photos with writing, you will run out fast because the teacher only had time to write on a few. The blank photos are still useful. You can sort them into piles based on what looks similar, even without knowing the right names. You can also play a game with yourself: hide part of a photo, then guess what was hidden. Each time you play, you get a little better at understanding photos in general. Later, when the teacher labels a few more, you only need a handful to start getting the names right.
Unlabeled examples are the blank-backed photos. There are a lot of them, they are easy to collect, and the trick is to find clever ways to learn from them.