Semi-Supervised Learning

Introduction

Semi-supervised learning is a type of machine learning approach that combines elements of both supervised machine learning and unsupervised machine learning methods. It leverages a small amount of labeled data along with a larger volume of unlabeled data to train models. In most real-world settings, acquiring labeled data is expensive and time-consuming because it requires human annotators with domain expertise, while unlabeled data can often be collected cheaply and in large quantities. Semi-supervised learning bridges this gap by extracting useful structure from the unlabeled data to improve the learning process.

Formally, a semi-supervised learning algorithm receives a labeled dataset $D_l = {(x_1, y_1), ..., (x_n, y_n)}$ and an unlabeled dataset $D_u = {x_{n+1}, ..., x_{n+m}}$, where typically $m \gg n$. The objective is to learn a function $f$ that performs better than what could be learned from the labeled data alone. The general training loss for semi-supervised methods can be expressed as:

$$\mathcal{L} = \mathcal{L}_s + \mu(t) \mathcal{L}_u$$

where $\mathcal{L}_s$ is the supervised loss on labeled data, $\mathcal{L}_u$ is the unsupervised loss on unlabeled data, and $\mu(t)$ is a weighting function that typically ramps up the importance of the unsupervised loss over the course of training.

Historical context

Semi-supervised learning has roots stretching back to the 1960s and 1970s. The earliest forms of self-training (sometimes called "self-teaching") appeared in the work of Scudder (1965) and Fralick (1967), who proposed iterative methods in which a classifier labels its own training data. Around the same time, the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin, 1977) provided a principled framework for learning from incomplete data using generative mixture models, laying the statistical groundwork for semi-supervised methods.

In the 1990s, the field gained renewed momentum. Nigam et al. (1998) applied EM with naive Bayes models to text classification, demonstrating significant gains from unlabeled documents. Blum and Mitchell (1998) introduced co-training and proved learning-theoretic guarantees for multi-view semi-supervised learning. Joachims (1999) proposed Transductive SVMs, applying the margin-maximization principle to unlabeled data. Zhu and Ghahramani (2002) formalized label propagation on graphs, and Grandvalet and Bengio (2004) introduced entropy minimization as a regularization principle.

The publication of the comprehensive textbook Semi-Supervised Learning by Chapelle, Scholkopf, and Zien (2006) consolidated the field's theoretical and algorithmic foundations. The deep learning era brought a new wave of methods starting around 2017, with consistency regularization approaches (Pi-Model, Temporal Ensembling, Mean Teacher) followed by holistic frameworks like MixMatch (2019) and FixMatch (2020) that achieved dramatic performance gains. Meanwhile, the pre-train-then-fine-tune paradigm in natural language processing (exemplified by BERT and GPT) demonstrated that implicit semi-supervised learning at massive scale could transform an entire subfield of AI.

Motivation and advantages

Limited labeled data

One of the main motivations for using semi-supervised learning is the scarcity of labeled example data. Obtaining a large amount of labeled data can be expensive, time-consuming, and often requires domain expertise. For example, labeling medical images requires trained radiologists, and annotating legal documents demands legal professionals. Semi-supervised learning takes advantage of the available unlabeled example data to improve model performance without requiring a substantial increase in labeled data.

Improved model performance

Semi-supervised learning can lead to improved model performance compared to strictly supervised or unsupervised learning methods. By leveraging both labeled and unlabeled data, semi-supervised learning can help models generalize better and reduce overfitting, which may result in better predictions on new, unseen data. Research has shown that even modest amounts of unlabeled data can provide measurable improvements when the underlying assumptions of the semi-supervised method hold.

Practical relevance

The semi-supervised setting mirrors many practical scenarios. In medical imaging, vast archives of unlabeled scans exist alongside a small number of expert-annotated cases. In natural language processing, billions of sentences are available on the internet while task-specific labels (such as sentiment annotations or translation pairs) are limited. In speech recognition, hours of unlabeled audio far exceed the amount of transcribed speech. Semi-supervised learning provides a principled framework for exploiting these abundant unlabeled resources.

Theoretical foundations and assumptions

Semi-supervised learning relies on certain assumptions about the relationship between the input distribution $p(x)$ and the conditional label distribution $p(y|x)$. Without such assumptions, unlabeled data (which only reveals information about $p(x)$) cannot help with the prediction task. The four core assumptions are described below.

Smoothness assumption

The smoothness assumption states that if two data points $x_1$ and $x_2$ are close in the input space and connected by a path through a high-density region, then their corresponding labels $y_1$ and $y_2$ should also be similar. This assumption implies that the decision boundary of the classifier should preferentially lie in regions of low data density. A consequence is that nearby points in a dense region are likely to share the same label.

Cluster assumption

The cluster assumption posits that data points belonging to the same cluster (a group of points more similar to each other than to points outside the group) are likely to belong to the same class. This assumption is closely related to the low-density separation assumption, which states that the decision boundary should pass through regions of low data density rather than cutting through clusters. Methods such as Transductive SVM and entropy minimization directly exploit this assumption.

Low-density separation

The low-density separation assumption states that the decision boundary of the classifier should lie in regions where few data points are observed. Put another way, it asserts that class boundaries correspond to low-density areas of the input space, while high-density regions should be assigned consistently to a single class. This principle is what drives techniques such as entropy minimization and Transductive SVMs, both of which push the decision surface away from data points and into sparse regions. The low-density separation assumption can be seen as the classification-oriented restatement of the cluster assumption: if clusters correspond to classes, then the boundaries between classes must fall between clusters, where the data density is naturally low.

Manifold assumption

The manifold assumption states that the high-dimensional input data lies on or near a lower-dimensional manifold embedded in the input space, and that data points located on the same manifold sub-structure share the same label. This assumption motivates graph-based semi-supervised methods that construct neighborhood graphs on the data and propagate labels along the manifold structure. It also underpins dimensionality reduction techniques that attempt to discover and exploit this low-dimensional structure.

Relationship between the assumptions

These four assumptions are not independent; they can be viewed as different perspectives on the same underlying principle. The smoothness assumption defines similarity through proximity in dense regions. The cluster assumption defines similarity through membership in the same cluster. The low-density separation assumption characterizes where class boundaries should fall. The manifold assumption defines similarity through co-location on a low-dimensional subspace. In practice, all four assumptions can be seen as more specific instances of the general principle that the structure of $p(x)$ is informative about $p(y|x)$.

Classical techniques and approaches

Self-training (pseudo-labeling)

Self-training, also known as pseudo-labeling, is one of the oldest and simplest semi-supervised learning methods. The approach was used as early as the 1960s under the name "self-teaching" and was formalized as pseudo-labeling by Lee (2013). In self-training, an initial model (called the teacher) is trained using the available labeled data. The model then generates predictions on the unlabeled data, and the most confident predictions are assigned as pseudo-labels. These pseudo-labeled examples are incorporated into the training set, and the model is retrained on the combined dataset. The process repeats iteratively.

Lee (2013) showed that pseudo-labeling is theoretically equivalent to entropy regularization, which encourages the model to make confident (low-entropy) predictions on unlabeled data. This connection to entropy minimization provides theoretical justification for the approach. A practical concern with self-training is confirmation bias: if the initial model makes systematic errors, those errors can be reinforced through the pseudo-labels. Strategies to mitigate this include setting a high confidence threshold for accepting pseudo-labels, using data augmentation to add noise during student training, and periodically resetting the pseudo-labels.

A notable modern variant is Noisy Student (Xie et al., 2020), which achieved strong results on ImageNet by training a student model that is equal to or larger than the teacher, injecting noise (stochastic depth, dropout, and RandAugment) during student training while keeping the teacher clean, and iterating the process multiple times.

Co-training

Co-training was introduced by Blum and Mitchell (1998) in their paper "Combining Labeled and Unlabeled Data with Co-Training." The method assumes that each data instance can be described by two conditionally independent "views" (feature sets), where each view is sufficient on its own to learn the target concept. The original application classified web pages using two views: the text content of the page and the anchor text of hyperlinks pointing to the page.

The algorithm works as follows:

Train two separate classifiers, one on each view, using the small labeled dataset.
Each classifier generates predictions for the unlabeled data.
The most confident predictions from each classifier are added to the training set of the other classifier.
The classifiers are retrained, and the process repeats.

Blum and Mitchell proved that under the conditional independence and sufficiency assumptions, co-training can learn from a combination of labeled and unlabeled data in the PAC learning framework. In practice, the strict conditional independence assumption rarely holds perfectly, but co-training has been found to work well even when the views are only approximately independent. Extensions of co-training include multi-view learning, where more than two views are used, and single-view co-training methods such as tri-training (Zhou and Li, 2005), which trains three classifiers on different bootstrap samples of the labeled data.

Graph-based methods and label propagation

Graph-based semi-supervised methods model the relationships between data points as a graph, where nodes represent data points (both labeled and unlabeled) and edges represent similarity between them. The label propagation algorithm, proposed by Zhu and Ghahramani (2002), is the foundational method in this family. It works by constructing a similarity graph (typically using a Gaussian kernel or k-nearest neighbors) and iteratively propagating labels from labeled nodes to their neighbors.

The algorithm proceeds by initializing labels for unlabeled nodes and then repeatedly updating each unlabeled node's label as a weighted average of its neighbors' labels, while clamping the labels of labeled nodes to their known values. The process converges to a unique solution determined by the graph structure. Label spreading (Zhou et al., 2003) is a related variant that uses the normalized graph Laplacian and allows labeled nodes to change their labels slightly, providing robustness to noisy labels. The key difference is that label propagation performs hard clamping of labeled nodes, while label spreading uses soft clamping controlled by a parameter $\alpha$, retaining most of the original label distribution while permitting small adjustments.

Graph-based methods directly exploit the manifold and cluster assumptions: if two points are connected by a path of high-similarity edges, they are likely to share the same label. A limitation of these methods is scalability, since constructing and operating on the full similarity graph can be computationally expensive for large datasets. Implementations of both label propagation and label spreading are available in scikit-learn.

Generative models and EM

Generative approaches to semi-supervised learning model the joint distribution $p(x, y)$ using a mixture model, where each component corresponds to a class. The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is used to estimate the model parameters. In the E-step, the algorithm computes the posterior probability that each unlabeled point belongs to each class. In the M-step, it re-estimates the parameters using both labeled and (soft-assigned) unlabeled data. While generative methods have a solid theoretical foundation, they are sensitive to model misspecification: if the assumed generative model does not match the true data distribution, unlabeled data can actually hurt performance.

Nigam et al. (1998) demonstrated that EM with a naive Bayes generative model could substantially improve text classification by incorporating unlabeled documents alongside a small labeled set. This early result was influential in establishing the practical value of generative semi-supervised methods.

Deep generative models

Kingma et al. (2014) extended the generative approach to deep learning by introducing semi-supervised learning with deep generative models. Their work combined variational autoencoders (VAEs) with semi-supervised objectives, proposing models that treat the class label as a latent variable for unlabeled examples. The approach includes two key variants: the M1 model, which learns a latent representation using a standard VAE and then trains a classifier on top, and the M2 model, which incorporates the class label directly into the generative model so that for unlabeled data the label is marginalized out during training.

This framework achieved state-of-the-art results at the time on benchmarks such as MNIST with very few labels. The success of semi-supervised VAEs demonstrated that deep generative models could effectively leverage unlabeled data by jointly modeling the data distribution and the label distribution. Subsequent work combined VAEs with generative adversarial networks (GANs) for semi-supervised learning. Salimans et al. (2016) proposed techniques for training GANs that included a semi-supervised component, where the discriminator was extended to perform classification in addition to distinguishing real from generated samples.

Entropy minimization

Entropy minimization, introduced by Grandvalet and Bengio (2004), adds a regularization term that encourages the model to make confident (low-entropy) predictions on unlabeled data. The entropy of the predicted distribution for an unlabeled sample $x$ is:

$$H(p(y|x)) = -\sum_c p(y=c|x) \log p(y=c|x)$$

Minimizing this quantity pushes the decision boundary away from unlabeled data points, enforcing the low-density separation (cluster) assumption. Entropy minimization is not a standalone method but rather a regularization principle that appears as a component in many modern semi-supervised methods, including MixMatch and UDA.

Transductive SVM (TSVM)

Transductive Support Vector Machines extend the standard SVM framework to the semi-supervised setting by finding a decision boundary that has a large margin on both labeled and unlabeled data. The optimization problem seeks to assign labels to the unlabeled data such that the resulting SVM has the maximum margin overall. This directly implements the low-density separation assumption. TSVMs are computationally expensive because the optimization is non-convex, but they demonstrated early on that unlabeled data could meaningfully improve classification in the semi-supervised setting.

Consistency regularization

Consistency regularization is one of the most influential paradigms in modern semi-supervised learning. The core idea is that a model should produce the same output for an input regardless of the perturbation applied to it. If a data point is slightly transformed (through noise, augmentation, or dropout), the model's prediction should remain unchanged. This principle exploits the smoothness assumption by encouraging the model to learn a decision function that is locally smooth around each data point.

Pi-Model (2017)

The Pi-Model, introduced by Laine and Aila (2017), is one of the earliest consistency regularization methods for deep learning. For each input $x$, the model performs two forward passes with different stochastic perturbations (such as dropout masks and random augmentations), producing two predictions $z$ and $\tilde{z}$. The unsupervised loss is the mean squared error (MSE) between the two predictions:

$$\mathcal{L}u^\Pi = \sum{x \in D} \text{MSE}(f_\theta(x), f'_\theta(x))$$

The Pi-Model is simple and effective, but it has the drawback of doubling the computational cost because two forward passes are required for each sample. The quality of the consistency targets is also limited because they come from a single noisy pass of the current (potentially unstable) model.

Temporal Ensembling (2017)

Temporal Ensembling, also proposed by Laine and Aila (2017), addresses the noisy targets problem of the Pi-Model. Instead of computing consistency targets from a second forward pass, Temporal Ensembling maintains an exponential moving average (EMA) of the model's predictions for each training example across epochs. This ensemble of past predictions serves as a more stable and less noisy target for consistency regularization.

The EMA prediction for example $i$ is updated at the end of each epoch: $\tilde{z}_i \leftarrow \alpha \tilde{z}_i + (1 - \alpha) z_i$, where $\alpha$ is the EMA decay rate. The targets are then bias-corrected in a manner similar to the Adam optimizer. While Temporal Ensembling produces smoother targets and requires only a single forward pass per sample, the targets are only updated once per epoch, which makes the method slow to adapt when the dataset is large.

Mean Teacher (2017)

The Mean Teacher method, proposed by Tarvainen and Valpola (2017), improves upon Temporal Ensembling by tracking the exponential moving average of the model weights rather than the model outputs. The student model is trained normally, and its weights are used to update a teacher model via EMA:

$$\theta'{\text{teacher}} \leftarrow \beta \theta'{\text{teacher}} + (1 - \beta) \theta_{\text{student}}$$

The teacher model generates consistency targets for the student. Because the teacher's weights are updated after every training step (not just every epoch), Mean Teacher adapts much faster than Temporal Ensembling. The authors found that $\beta = 0.99$ works well during the initial ramp-up phase and $\beta = 0.999$ is better later in training. They also observed that MSE outperforms KL divergence as the consistency cost and that input augmentation or dropout is necessary for the method to work, since without any stochastic perturbation, the student and teacher would produce identical outputs. Mean Teacher achieved an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1,000 labels.

Virtual Adversarial Training (VAT, 2018)

Virtual Adversarial Training, proposed by Miyato et al. (2018), replaces random perturbations with adversarial perturbations. For each input, VAT computes the small perturbation $r$ that maximally changes the model's output distribution, then penalizes the KL divergence between the original and perturbed predictions. This produces tighter and more informative consistency constraints because the perturbations are directed at the model's most vulnerable directions.

Unsupervised Data Augmentation (UDA, 2020)

Unsupervised Data Augmentation (UDA), introduced by Xie et al. (2020), advances consistency regularization by demonstrating that the quality of the applied perturbation matters significantly. Instead of using simple noise (like Gaussian noise or dropout), UDA applies advanced data augmentation strategies: RandAugment for images and back-translation or TF-IDF-based word replacement for text. UDA also incorporates several training strategies:

Confidence masking: Only unlabeled examples where the model's maximum predicted probability exceeds a threshold $\tau$ contribute to the unsupervised loss.
Distribution sharpening: The predicted distribution is sharpened using a temperature parameter $T$ before computing the consistency loss, encouraging more confident predictions.
Training Signal Annealing (TSA): Gradually releases training signal from labeled examples to prevent overfitting on the small labeled set.

On the IMDb sentiment classification dataset with only 20 labeled examples, UDA achieved an error rate of 4.20%, surpassing models trained on the full 25,000 labeled examples. On CIFAR-10 with 250 labels, UDA achieved 5.43% error.

Modern holistic methods

Starting around 2019, a series of papers proposed holistic semi-supervised methods that unify multiple techniques (consistency regularization, pseudo-labeling, entropy minimization, and augmentation) into single cohesive frameworks.

MixMatch (2019)

MixMatch, introduced by Berthelot et al. (2019), was a landmark paper that combined three key ideas into a unified algorithm:

Consistency regularization: For each unlabeled example, MixMatch generates $K$ augmented versions, obtains predictions for each, and averages them to produce a single pseudo-label.
Entropy minimization: The averaged pseudo-label is sharpened using a temperature parameter, reducing its entropy and encouraging confident predictions.
MixUp regularization: Both labeled and (pseudo-)labeled examples are mixed together using the MixUp technique (Zhang et al., 2018), which creates convex combinations of training pairs and their labels.

MixMatch reduced the error rate on CIFAR-10 with 250 labels from roughly 38% (previous best) to about 11%, a factor-of-four improvement. Ablation studies showed that MixUp (especially on unlabeled data), temperature sharpening, and averaging multiple augmentations were all critical components.

ReMixMatch (2020)

ReMixMatch, also by Berthelot et al. (2020), extended MixMatch with two additional techniques:

Distribution alignment: The model's predictions on unlabeled data are normalized so that the marginal class distribution matches the ground-truth class distribution estimated from the labeled set.
Augmentation anchoring: Instead of standard augmentations, ReMixMatch uses CTAugment, a learned augmentation strategy that maintains a running estimate of augmentation quality.

These additions further improved performance, especially in very low-label regimes.

FixMatch (2020)

FixMatch, introduced by Sohn et al. (2020), simplified the semi-supervised learning pipeline by combining pseudo-labeling with consistency regularization in a straightforward way. The method works in two steps:

For each unlabeled example, apply a weak augmentation (such as random horizontal flip and crop) and obtain the model's prediction. If the maximum class probability exceeds a threshold $\tau$ (typically 0.95), convert the prediction into a hard pseudo-label.
Apply a strong augmentation (such as RandAugment or CTAugment) to the same unlabeled example and train the model to predict the pseudo-label on the strongly augmented version.

The supervised and unsupervised losses are:

$$\mathcal{L}s = \frac{1}{B} \sum{b=1}^{B} \text{CE}(y_b, p_\theta(y | \mathcal{A}_{\text{weak}}(x_b)))$$

$$\mathcal{L}u = \frac{1}{\mu B} \sum{b=1}^{\mu B} \mathbb{1}[\max(\hat{y}b) \geq \tau] \cdot \text{CE}(\hat{y}b, p\theta(y | \mathcal{A}{\text{strong}}(u_b)))$$

Despite its simplicity, FixMatch achieved state-of-the-art results across multiple benchmarks. On CIFAR-10 with just 4 labels per class (40 total), FixMatch achieved 88.61% accuracy. With 250 labels, it achieved approximately 95% accuracy (roughly 5% error). Its clarity and effectiveness made it a widely adopted baseline.

FlexMatch (2021)

FlexMatch, introduced by Zhang et al. (2021), identified a limitation of FixMatch: using a fixed confidence threshold for all classes ignores the fact that different classes are learned at different rates. Classes with fewer labeled examples or higher intrinsic difficulty may rarely produce predictions above the threshold, causing them to receive very few pseudo-labels.

FlexMatch addresses this through Curriculum Pseudo Labeling (CPL), which dynamically adjusts the confidence threshold for each class based on the model's current learning status. Classes that the model has learned well receive higher thresholds, while classes that the model struggles with receive lower thresholds. This curriculum-style approach ensures that all classes contribute to the unsupervised loss throughout training. FlexMatch achieved error rate reductions of 13.96% on CIFAR-100 and 18.96% on STL-10 compared to FixMatch when only 4 labels per class were available.

Other notable methods

Several other methods have extended the FixMatch framework:

FreeMatch (Wang et al., 2023): Uses a self-adaptive threshold that adjusts globally and per-class based on model confidence.
SoftMatch (Chen et al., 2023): Replaces hard thresholding with a soft weighting function that assigns a continuous weight to each pseudo-label based on its confidence.
SimMatch (Zheng et al., 2022): Combines semantic similarity and instance similarity for more effective pseudo-label generation.
CoMatch (Li et al., 2021): Jointly learns graph-based contrastive representations and pseudo-labels.

Contrastive learning and semi-supervised learning

Contrastive learning has become an important bridge between self-supervised learning and semi-supervised learning. In contrastive learning, a model learns representations by pulling together different augmented views of the same example (positive pairs) while pushing apart views of different examples (negative pairs). Although contrastive methods are often categorized as self-supervised (since they do not require labels during pre-training), they have direct and powerful applications in the semi-supervised setting.

SimCLR and semi-supervised fine-tuning

SimCLR (Chen et al., 2020) demonstrated that a simple contrastive framework, combining strong data augmentation, a neural network projection head, and the NT-Xent contrastive loss, could learn visual representations competitive with supervised pre-training. SimCLRv2 (Chen et al., 2020) extended this to the semi-supervised setting by showing that larger models are more label-efficient. After self-supervised contrastive pre-training, a small amount of labeled data is used to fine-tune the model, and then knowledge distillation transfers the representations to a smaller model. With 1% of ImageNet labels, SimCLRv2 achieved 73.9% top-1 accuracy, demonstrating the power of contrastive pre-training for semi-supervised tasks.

MoCo and momentum-based approaches

Momentum Contrast (MoCo), proposed by He et al. (2020), addresses the large batch size requirement of SimCLR by maintaining a dynamic queue of negative embeddings and a momentum-updated encoder. This design makes contrastive pre-training feasible on standard hardware. Like SimCLR, MoCo-pretrained representations can be fine-tuned with a small labeled set for semi-supervised learning, achieving strong results on downstream tasks.

S4L: self-supervised semi-supervised learning

S4L, proposed by Zhai et al. (2019), explicitly bridges self-supervised and semi-supervised learning. The key insight is that self-supervised pretext tasks (which learn representations from unlabeled data alone) can serve as a powerful form of regularization in the semi-supervised setting. The paper introduced two variants:

S4L-Rotation: Uses the rotation prediction task, where the model must predict which of four rotation angles (0, 90, 180, or 270 degrees) was applied to an input image.
S4L-Exemplar: Uses an exemplar-based contrastive task where different augmentations of the same image should produce similar representations.

Both self-supervised objectives are combined with the standard supervised loss on labeled data. S4L achieved a new state-of-the-art on semi-supervised ILSVRC-2012 (ImageNet) with 10% of labels. The work demonstrated that the rapidly advancing field of self-supervised representation learning can directly benefit semi-supervised methods.

Semi-supervised learning in natural language processing

Semi-supervised learning has had a profound impact on natural language processing, particularly through the pre-training and fine-tuning paradigm that has become the dominant approach in the field.

ULMFiT (2018)

Universal Language Model Fine-tuning (ULMFiT), proposed by Howard and Ruder (2018), was one of the first methods to demonstrate the effectiveness of transfer learning and semi-supervised pre-training for NLP tasks. ULMFiT follows a three-stage process:

General-domain pre-training: A language model based on the AWD-LSTM architecture is pre-trained on a large general corpus (Wikitext-103, approximately 28,000 Wikipedia articles).
Target-domain fine-tuning: The pre-trained language model is fine-tuned on the unlabeled text from the target domain, adapting its representations to the task-specific vocabulary and style.
Classifier fine-tuning: A classification head is added, and the model is fine-tuned on the small labeled dataset for the downstream task.

ULMFiT introduced key techniques such as discriminative fine-tuning (using different learning rates for different layers), slanted triangular learning rates, and gradual unfreezing (progressively unfreezing layers from top to bottom during fine-tuning). With only 100 labeled examples, ULMFiT matched the performance of models trained on 100 times more data. The method reduced classification error by 18-24% across six text classification benchmarks.

Pre-trained language models as implicit semi-supervised learning

ULMFiT's success inspired a wave of large-scale pre-trained language models that follow the same semi-supervised principle of learning from unlabeled text before fine-tuning on labeled data:

BERT (Devlin et al., 2019) introduced masked language modeling and next sentence prediction as pre-training objectives, using a Transformer encoder.
GPT (Radford et al., 2018) used autoregressive language modeling for pre-training a Transformer decoder. The original GPT paper explicitly framed this as semi-supervised learning, training on the BookCorpus and then fine-tuning on downstream tasks.
T5 (Raffel et al., 2020) framed all NLP tasks as text-to-text problems and pre-trained on the C4 corpus.

The pre-train-then-fine-tune paradigm can be understood as a form of implicit semi-supervised learning: the model first learns general linguistic representations from massive unlabeled corpora, then adapts to specific tasks with relatively few labeled examples. This connection highlights how semi-supervised learning principles have scaled to become the foundation of modern NLP.

Semi-supervised methods for NLP tasks

Beyond the pre-training paradigm, classical semi-supervised techniques have been adapted for NLP. UDA demonstrated the effectiveness of back-translation as a data augmentation strategy for consistency regularization in text classification. Self-training has been applied to named entity recognition, machine translation, and question answering, where pseudo-labels from a teacher model are used to expand the training set.

Semi-supervised learning in computer vision

Computer vision has been one of the primary application domains for semi-supervised learning research, partly because obtaining pixel-level annotations (for segmentation) or bounding box annotations (for detection) is particularly labor-intensive.

ImageNet with few labels

The ImageNet ILSVRC-2012 dataset, with its 1.28 million training images across 1,000 classes, has served as a key benchmark for semi-supervised methods. Common evaluation protocols use 1% (approximately 12,800 images) or 10% (approximately 128,000 images) of the labels while treating the remaining images as unlabeled:

S4L (Zhai et al., 2019) achieved new state-of-the-art results with 10% labels using self-supervised objectives.
SimCLRv2 (Chen et al., 2020) combined contrastive self-supervised learning with semi-supervised distillation and found that larger models are more label-efficient.
Noisy Student (Xie et al., 2020) used self-training with an EfficientNet architecture and 300 million unlabeled images from JFT-300M to surpass the fully supervised state-of-the-art on ImageNet.

CIFAR-10 and CIFAR-100

The CIFAR datasets are the most commonly used benchmarks for semi-supervised learning methods. Standard protocols evaluate with 40, 250, or 4,000 labels on CIFAR-10 (10 classes) and similar small-label configurations on CIFAR-100 (100 classes). The progression of results on CIFAR-10 with 250 labels illustrates the rapid progress in the field: from approximately 38% error with earlier methods, to 11% with MixMatch, to roughly 5% with FixMatch, and below 5% with FlexMatch and subsequent methods.

Medical image analysis

Semi-supervised learning is especially valuable in medical imaging, where labeled data is scarce because annotation requires trained medical professionals. Consistency regularization methods, such as Mean Teacher variants, have been successfully applied to medical image segmentation tasks including cardiac MRI segmentation, retinal fundus image analysis, and histopathology. Semi-supervised approaches using pseudo-labeling and GANs have also been applied to medical image classification, reducing the dependence on large-scale labeled datasets. The ability to leverage large archives of unlabeled clinical images alongside a small number of expert-annotated cases makes semi-supervised learning a natural fit for healthcare applications.

Applications

Beyond computer vision and NLP, semi-supervised learning has found success in a range of domains:

Domain	Application	Why SSL helps
Medical imaging	Tumor detection, organ segmentation, disease classification	Expert annotations are expensive; large archives of unlabeled scans exist
Speech recognition	Acoustic model training, speaker adaptation	Transcribed speech is costly to produce; untranscribed audio is abundant
Natural language processing	Text classification, named entity recognition, sentiment analysis	Task-specific labels are limited; raw text is plentiful
Bioinformatics	Protein function prediction, gene expression analysis	Labeled biological data requires wet-lab experiments
Remote sensing	Land cover classification, satellite image segmentation	Manual annotation of satellite imagery is slow and expensive
Cybersecurity	Intrusion detection, malware classification	New attack types emerge constantly; labeled attack data is scarce
Autonomous driving	Object detection, lane segmentation	Pixel-level driving annotations are extremely labor-intensive

Semi-supervised learning is one of several approaches designed to address the challenge of limited labeled data. Understanding the relationships and differences between these paradigms is important for choosing the right approach for a given problem.

Aspect	Semi-Supervised Learning	Self-Supervised Learning	Active Learning	Transfer Learning
Labeled data required	Small amount	None (labels derived from data)	Small amount, iteratively expanded	Labeled data from source task
Unlabeled data usage	Used alongside labeled data	Used exclusively during pre-training	Pool for querying	Not used directly
Human involvement	Labeling initial small set	None during pre-training	Annotator in the loop	Labeling source task data
Core mechanism	Exploits structure in $p(x)$ to improve $p(y \| x)$	Learns representations via pretext tasks	Selects most informative samples for labeling	Reuses knowledge from related tasks
When to use	Abundant unlabeled data, few labels	Need general representations, no labels	Budget for selective labeling	Related source task available
Computational cost	Moderate to high	High (pre-training)	Moderate (requires retraining)	Low to moderate (fine-tuning)
Example methods	FixMatch, MixMatch, Label Propagation	SimCLR, MAE, DINO	Uncertainty sampling, Query-by-Committee	Fine-tuning BERT, ImageNet pre-training

Semi-supervised vs. self-supervised learning: Self-supervised learning uses no labeled data at all during pre-training and instead derives supervisory signals from the data itself (for example, predicting masked tokens or image rotations). Semi-supervised learning requires at least some labeled data. The two paradigms are complementary: self-supervised pre-training can be followed by semi-supervised fine-tuning, as demonstrated by S4L and SimCLRv2.

Semi-supervised vs. active learning: Active learning also operates in a limited-label setting but takes a different strategy. Rather than extracting information from unlabeled data, active learning strategically selects which unlabeled examples should be labeled by a human annotator to maximize the model's improvement per label. Active learning requires a human-in-the-loop, while semi-supervised learning does not. The two approaches can be combined: active learning can select the most informative points for labeling, and semi-supervised learning can exploit the remaining unlabeled points.

Semi-supervised vs. transfer learning: Transfer learning leverages knowledge from a related source task (often with abundant labels) to improve performance on a target task. Pre-trained language models like BERT are an example. Semi-supervised learning does not require a separate source task but instead uses unlabeled data from the same domain. The pre-train-then-fine-tune paradigm that dominates modern NLP and computer vision blurs the boundary between transfer learning and semi-supervised learning.

When semi-supervised learning helps (and when it does not)

Semi-supervised learning is not a universal solution. Its effectiveness depends on whether the unlabeled data provides useful structural information about the classification task. Understanding the conditions under which SSL succeeds or fails is essential for practitioners.

When it helps:

The unlabeled data comes from the same distribution as the labeled data, and the structural assumptions (smoothness, cluster, manifold) hold reasonably well.
Classes form distinct, well-separated clusters in the input space or in a learned representation space.
The labeled set is very small relative to the unlabeled set, so even modest information from the data distribution substantially reduces uncertainty.
High-quality data augmentation is available, enabling effective consistency regularization.

When it does not help (or hurts):

The assumptions about data structure are violated. For instance, if classes overlap heavily in the input space, the cluster assumption fails and pseudo-labels will be noisy.
The unlabeled data contains significant distribution shift relative to the labeled data, or the unlabeled data includes out-of-distribution samples from classes not present in the labeled set (the open-set problem).
The generative model is misspecified. As noted by Cozman and Cohen (2006), if the assumed parametric model does not match the true data-generating process, EM-based semi-supervised learning can degrade performance below that of the supervised baseline.
The labeled dataset is already large enough that the model can learn the task well without unlabeled data. In this regime, the added complexity of semi-supervised training may not justify the marginal gains.

Oliver et al. (2018) published an influential paper, "Realistic Evaluation of Deep Semi-Supervised Learning Algorithms," which highlighted that many reported gains in the literature depended on specific experimental setups and that semi-supervised methods sometimes failed to improve over carefully tuned supervised baselines. This work underscored the importance of rigorous evaluation in semi-supervised learning research.

Table of major semi-supervised learning methods

Method	Year	Category	Key Innovation	Authors
EM with Generative Models	1970s	Generative	Expectation-Maximization for mixture models	Dempster, Laird, Rubin
Self-Training	1960s/2013	Pseudo-labeling	Iterative pseudo-label assignment	Scudder (1965); Lee (2013)
Co-Training	1998	Multi-view	Two classifiers on conditionally independent views	Blum and Mitchell
Label Propagation	2002	Graph-based	Propagate labels through similarity graph	Zhu and Ghahramani
Entropy Minimization	2004	Regularization	Minimize prediction entropy on unlabeled data	Grandvalet and Bengio
Transductive SVM	1999	Margin-based	Maximum margin with unlabeled data	Joachims
Semi-Supervised VAE	2014	Deep generative	Class label as latent variable in VAE	Kingma et al.
Pi-Model	2017	Consistency regularization	MSE between two stochastic passes	Laine and Aila
Temporal Ensembling	2017	Consistency regularization	EMA of predictions across epochs	Laine and Aila
Mean Teacher	2017	Consistency regularization	EMA of model weights	Tarvainen and Valpola
VAT	2018	Consistency regularization	Adversarial perturbations for consistency	Miyato et al.
ULMFiT	2018	Pre-training + fine-tuning	Three-stage language model transfer	Howard and Ruder
MixMatch	2019	Holistic	Combines consistency, entropy min., MixUp	Berthelot et al.
S4L	2019	Self-supervised + semi-supervised	Self-supervised pretext tasks as regularization	Zhai et al.
UDA	2020	Consistency regularization	High-quality augmentation (RandAugment, back-translation)	Xie et al.
FixMatch	2020	Holistic	Weak-strong augmentation with pseudo-labels	Sohn et al.
Noisy Student	2020	Self-training	Noise injection during student training	Xie et al.
ReMixMatch	2020	Holistic	Distribution alignment + augmentation anchoring	Berthelot et al.
SimCLRv2	2020	Contrastive + semi-supervised	Contrastive pre-training + semi-supervised distillation	Chen et al.
CoMatch	2021	Contrastive + pseudo-labeling	Joint contrastive learning and pseudo-labels	Li et al.
FlexMatch	2021	Holistic	Class-adaptive confidence thresholds (CPL)	Zhang et al.
FreeMatch	2023	Holistic	Self-adaptive global and per-class thresholds	Wang et al.
SoftMatch	2023	Holistic	Soft weighting function for pseudo-labels	Chen et al.

Challenges and limitations

Confirmation bias

A persistent challenge in semi-supervised learning is confirmation bias, where incorrect pseudo-labels from the model are treated as ground truth, reinforcing the model's mistakes. This is especially problematic in self-training methods. Strategies to mitigate confirmation bias include using high confidence thresholds, MixUp regularization (Arazo et al., 2019), and ensuring a minimum number of labeled examples per mini-batch.

Sensitivity to assumptions

Semi-supervised methods rely on assumptions about the data distribution. When these assumptions are violated (for example, when classes are not well-separated or when the manifold assumption does not hold), unlabeled data can actually degrade performance compared to using labeled data alone. Careful validation and model selection are important in practice.

Class distribution mismatch

If the class distribution in the unlabeled data differs significantly from the labeled data, semi-supervised methods may learn biased models. This is common in real-world settings where the labeled subset may not be representative of the overall population. Distribution alignment techniques (as in ReMixMatch) and class-adaptive thresholds (as in FlexMatch) partially address this issue.

Scalability

Some semi-supervised methods, particularly graph-based approaches, face scalability challenges with very large datasets. Constructing and storing a full similarity graph over millions of data points can be prohibitively expensive. Modern deep learning-based methods scale better because they operate in mini-batch fashion, but they still require careful tuning of hyperparameters such as the confidence threshold, EMA decay rate, and unsupervised loss weight.

Open set and out-of-distribution data

Standard semi-supervised methods assume that the unlabeled data comes from the same classes as the labeled data. When the unlabeled data contains samples from unknown classes (the open-set setting), naive pseudo-labeling can assign these samples to known classes, harming performance. Robust semi-supervised methods that can detect and handle out-of-distribution unlabeled samples remain an active area of research.

Explain Like I'm 5 (ELI5)

Imagine you are learning to recognize different types of fruits by looking at pictures. Your teacher shows you a few pictures of apples, oranges, and bananas with their names (labeled data). Then, your teacher gives you a large pile of pictures without names (unlabeled data).

Semi-supervised learning is like using the pictures you already know (labeled data) to help you guess the names of the new pictures (unlabeled data). You might notice that some of the unnamed pictures look a lot like the apples you already know, so you can be pretty confident they are apples too. Once you have made some good guesses, you can use those guesses to help you figure out even more pictures. You keep doing this until you have learned to recognize most of the fruits, even though your teacher only told you the names of a few pictures at the start.

References

Scudder, H. (1965). "Probability of Error of Some Adaptive Pattern-Recognition Machines." IEEE Transactions on Information Theory, 11(3), pp. 363-371.
Dempster, A., Laird, N., and Rubin, D. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm." Journal of the Royal Statistical Society: Series B, 39(1), pp. 1-38.
Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (1998). "Learning to Classify Text from Labeled and Unlabeled Documents." Proceedings of the 15th National Conference on Artificial Intelligence (AAAI), pp. 792-799.
Blum, A. and Mitchell, T. (1998). "Combining Labeled and Unlabeled Data with Co-Training." Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), pp. 92-100.
Joachims, T. (1999). "Transductive Inference for Text Classification using Support Vector Machines." Proceedings of the 16th International Conference on Machine Learning (ICML), pp. 200-209.
Zhu, X. and Ghahramani, Z. (2002). "Learning from Labeled and Unlabeled Data with Label Propagation." CMU-CALD-02-107, Carnegie Mellon University.
Zhou, D., Bousquet, O., Lal, T., Weston, J., and Scholkopf, B. (2003). "Learning with Local and Global Consistency." Advances in Neural Information Processing Systems (NeurIPS).
Grandvalet, Y. and Bengio, Y. (2004). "Semi-supervised Learning by Entropy Minimization." Advances in Neural Information Processing Systems (NeurIPS), pp. 529-536.
Chapelle, O., Scholkopf, B., and Zien, A. (2006). *Semi-Supervised Learning.* MIT Press.
Lee, D.-H. (2013). "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks." ICML Workshop on Challenges in Representation Learning.
Kingma, D., Mohamed, S., Rezende, D., and Welling, M. (2014). "Semi-supervised Learning with Deep Generative Models." Advances in Neural Information Processing Systems (NeurIPS).
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). "Improved Techniques for Training GANs." Advances in Neural Information Processing Systems (NeurIPS).
Laine, S. and Aila, T. (2017). "Temporal Ensembling for Semi-Supervised Learning." International Conference on Learning Representations (ICLR).
Tarvainen, A. and Valpola, H. (2017). "Mean Teachers are Better Role Models: Weight-averaged Consistency Targets Improve Semi-Supervised Learning Results." Advances in Neural Information Processing Systems (NeurIPS).
Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 328-339.
Miyato, T., Maeda, S., Koyama, M., and Ishii, S. (2018). "Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning." IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), pp. 1979-1993.
Oliver, A., Odena, A., Raffel, C., Cubuk, E., and Goodfellow, I. (2018). "Realistic Evaluation of Deep Semi-Supervised Learning Algorithms." Advances in Neural Information Processing Systems (NeurIPS).
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. (2019). "MixMatch: A Holistic Approach to Semi-Supervised Learning." Advances in Neural Information Processing Systems (NeurIPS).
Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. (2019). "S4L: Self-Supervised Semi-Supervised Learning." Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1476-1485.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)." Proceedings of the 37th International Conference on Machine Learning (ICML).
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and Hinton, G. (2020). "Big Self-Supervised Models are Strong Semi-Supervised Learners." Advances in Neural Information Processing Systems (NeurIPS).
Xie, Q., Dai, Z., Hovy, E., Luong, T., and Le, Q. (2020). "Unsupervised Data Augmentation for Consistency Training." Advances in Neural Information Processing Systems (NeurIPS).
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C., Cubuk, E., and Le, Q. (2020). "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence." Advances in Neural Information Processing Systems (NeurIPS).
Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. (2020). "Self-Training with Noisy Student Improves ImageNet Classification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10687-10698.
Berthelot, D., Carlini, N., Cubuk, E., Kurakin, A., Sohn, K., Zhang, H., and Raffel, C. (2020). "ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring." International Conference on Learning Representations (ICLR).
Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., and Shinozaki, T. (2021). "FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling." Advances in Neural Information Processing Systems (NeurIPS).
van Engelen, J. and Hoos, H. (2020). "A Survey on Semi-Supervised Learning." Machine Learning, 109, pp. 373-440.

Introduction

Historical context

Motivation and advantages

Limited labeled data

Improved model performance

Practical relevance

Theoretical foundations and assumptions

Smoothness assumption

Cluster assumption

Low-density separation

Manifold assumption

Relationship between the assumptions

Classical techniques and approaches

Self-training (pseudo-labeling)

Co-training

Graph-based methods and label propagation

Generative models and EM

Deep generative models

Entropy minimization

Transductive SVM (TSVM)

Consistency regularization

Pi-Model (2017)

Temporal Ensembling (2017)

Mean Teacher (2017)

Virtual Adversarial Training (VAT, 2018)

Unsupervised Data Augmentation (UDA, 2020)

Modern holistic methods

MixMatch (2019)

ReMixMatch (2020)

FixMatch (2020)

FlexMatch (2021)

Other notable methods

Contrastive learning and semi-supervised learning

SimCLR and semi-supervised fine-tuning

MoCo and momentum-based approaches

S4L: self-supervised semi-supervised learning

Semi-supervised learning in natural language processing

ULMFiT (2018)

Pre-trained language models as implicit semi-supervised learning

Semi-supervised methods for NLP tasks

Semi-supervised learning in computer vision

ImageNet with few labels

CIFAR-10 and CIFAR-100

Medical image analysis

Applications

Comparison with related learning paradigms

When semi-supervised learning helps (and when it does not)

Table of major semi-supervised learning methods

Challenges and limitations

Confirmation bias

Sensitivity to assumptions

Class distribution mismatch

Scalability

Open set and out-of-distribution data

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Machine learning terms/Fundamentals

Introduction

Historical context

Motivation and advantages

Limited labeled data

Improved model performance

Practical relevance

Theoretical foundations and assumptions

Smoothness assumption

Cluster assumption

Low-density separation

Manifold assumption

Relationship between the assumptions

Classical techniques and approaches

Self-training (pseudo-labeling)

Co-training

Graph-based methods and label propagation