Self-training

Deep Learning Machine Learning

35 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v6 · 7,046 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

Self-training is a semi-supervised learning procedure in which a model trained on a small labeled set is used to generate predictions on unlabeled data, and the model is then retrained on its own most confident predictions (called pseudo-labels) as if they were ground-truth labels, repeating the cycle. The procedure is also called self-labeling, self-teaching, decision-directed learning, or bootstrapping, and it is one of the oldest ideas in machine learning that uses unlabeled data: the basic loop appears in H. J. Scudder's 1965 IEEE paper on adaptive pattern recognition machines ^[1] and was popularized for natural language processing by David Yarowsky's 1995 word sense disambiguation paper ^[2]. The same loop sits at the heart of modern deep learning techniques such as Pseudo-Label ^[3], Noisy Student ^[4], FixMatch ^[5], and SimCLRv2 ^[6], and it appears again in large language model post-training pipelines such as Anthropic's Constitutional AI ^[7] and Stanford's STaR (Self-Taught Reasoner) ^[8].

The appeal of self-training is its simplicity. The algorithm is essentially "train, predict on unlabeled data, retrain on the confident predictions, repeat," and it is wrapper-style: any supervised learner can be plugged into the loop. The risk is equally simple to describe. If the initial model is wrong about a pseudo-label and that pseudo-label gets reinforced in the next round, the error compounds. This failure mode is often called confirmation bias or a negative feedback loop ^[21], and a substantial fraction of the modern self-training literature (FixMatch, Mean Teacher, Noisy Student, Curriculum Labeling) consists of techniques to keep this loop from going off the rails.

The single result that established self-training as a frontier method, not just a low-resource trick, is Noisy Student (Xie et al., 2020): trained with 300 million unlabeled images, an EfficientNet-L2 student reached 88.4% top-1 accuracy on ImageNet, 2.0 percentage points better than the previous state of the art, which had required 3.5 billion weakly labeled Instagram images ^[4].

This article describes the basic algorithm, its history, the major variants in vision and NLP, the theoretical understanding (focusing on Wei, Shen, Chen, and Ma's ICLR 2021 analysis ^[9]), self-distillation as a special case, the recent application of self-training to LLMs, the main failure modes, and how self-training compares to related semi-supervised methods such as co-training and tri-training.

What is self-training?

Self-training is a wrapper procedure that turns a supervised learner into a semi-supervised one. Given a labeled set $L = \{(x_i, y_i)\}$ and an unlabeled set $U = \{x_j\}$ , self-training repeats the following steps:

Train a base model f on L using ordinary supervised learning.
Use f to predict labels y_hat on a subset S of U.
Filter S to keep only confidently predicted examples (often by thresholding the predicted probability).
Add the filtered (x, y_hat) pairs to L (or to a separate pseudo-label set used jointly with L).
Retrain f on the augmented set.
Repeat steps 2 through 5 until a stopping criterion is met.

The set of pseudo-labels added in step 4 may be removed and recomputed each round (the standard formulation) or accumulated across rounds. The retraining in step 5 may be from scratch or by continuing from the previous parameters. The model in step 5 may be the same architecture as in step 1 or a larger one (the Noisy Student variant).

Self-training is distinguished from the broader family of semi-supervised learning methods (which also includes co-training, graph-based label propagation, generative methods, and consistency regularization) by its wrapper-style structure: it does not require the labeled and unlabeled losses to share a common functional form, and it does not require multiple views of the data. It is closely related to but distinct from transductive learning, in which the model is asked only to label a fixed unlabeled set rather than to generalize to new data.

How does self-training work?

The canonical self-training loop, as described by Yarowsky ^[2] and formalized for deep neural networks by Lee ^[3], takes the form below.

function SelfTraining(L, U, threshold tau, max_iters T):
    f = TrainSupervised(L)
    for t in 1..T:
        Y_hat = f.predict_proba(U)
        confident = { (x, argmax_c Y_hat(x, c))
                      for x in U if max_c Y_hat(x, c) >= tau }
        L = L union confident
        U = U minus { x : (x, _) in confident }
        f = TrainSupervised(L)
        if stopping_criterion(f): break
    return f

A few choices in this skeleton determine the behaviour:

Choice	Common values	Effect
Confidence threshold tau	0.7 to 0.95 in vision, 0.5 to 1.0 in NLP	Higher thresholds reduce noisy pseudo-labels at the cost of fewer additions per round
Selection rule	Top-K most confident, all above threshold, or class-balanced top-K	Class-balanced sampling helps prevent class collapse on imbalanced data
Hard vs soft labels	Hard (argmax) labels or soft (full distribution) labels	Soft labels carry uncertainty information and tend to perform better in deep learning
Restart vs continue	Train from scratch each round, or continue from previous parameters	Continuing is faster but more vulnerable to confirmation bias
Pseudo-label refresh	Recompute every round, or keep accumulated pseudo-labels	Refreshing tracks a moving teacher; accumulating is more stable
Stopping criterion	Fixed number of rounds, no improvement on validation, or no new confident pseudo-labels	Fixed-round stopping is most common in deep learning

The confidence threshold in step 3 is the load-bearing hyperparameter: it is the verification gate that decides which pseudo-labels are trusted enough to retrain on. Set it too low and the model trains on its own mistakes (confirmation bias); set it too high and almost no unlabeled data enters the loop and self-training reduces to ordinary supervised learning on the original labeled set. This skeleton is what later variants (Pseudo-Label, Noisy Student, FixMatch, SimCLRv2) modify in specific ways. The next section traces those modifications historically.

History

Scudder (1965): the original self-learning machine

The earliest reference to a self-training procedure in machine learning is H. J. Scudder's 1965 paper in the IEEE Transactions on Information Theory, "Probability of Error of Some Adaptive Pattern-Recognition Machines" ^[1]. Scudder analyzed a classifier that uses its own predictions to update its parameters in the absence of teacher labels (a setting he called "decision-directed" learning) and derived bounds on the asymptotic probability of error. The work is the standard reference for the observation that self-training can converge to a useful classifier even when the training signal comes entirely from the model's own past decisions, although Scudder also noted that the procedure can fail badly if the initial parameters are far from a good solution. Modern self-training papers usually cite Scudder as the origin point of the technique.

Yarowsky (1995): bootstrapping for word sense disambiguation

David Yarowsky's 1995 ACL paper, "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods," is the modern reference for self-training in NLP ^[2]. Yarowsky tackled word sense disambiguation: given a polysemous word such as "plant" (factory or vegetation), decide which sense is intended in a given sentence.

Yarowsky's algorithm exploits two linguistic observations:

One sense per collocation: a polysemous word usually takes the same sense in any given local context (the words immediately around it).
One sense per discourse: a polysemous word usually has a single sense in any given document.

The algorithm starts from a small set of seed collocations for each sense (for example, "plant life" for the vegetation sense, "manufacturing plant" for the factory sense). A decision-list classifier is trained on the seed-tagged examples. The classifier is then applied to the rest of the unlabeled corpus, and high-confidence labels are added to the training pool. The process iterates, with the per-discourse constraint applied as a global consistency check at each round.

Yarowsky reported accuracies of 96.5% on a 12-word evaluation, matching or beating fully supervised classifiers trained on hand-tagged corpora. The paper also explicitly used the words "bootstrapping" and "self-training," which are the terms still used today. Subsequent theoretical analysis by Abney ^[10] and others has shown that the Yarowsky algorithm can be understood as approximately optimizing a log-likelihood under a particular probabilistic model, which gave the procedure a firmer footing.

McClosky, Charniak, Johnson (2006): self-training for parsing

For a long period after Yarowsky, the consensus in NLP was that self-training did not work for syntactic parsing, because parsers were already strong and pseudo-labels were too noisy. McClosky, Charniak, and Johnson's 2006 NAACL paper, "Effective Self-Training for Parsing," reversed this view by showing that a Charniak parser could be improved by self-training when paired with a separate reranker ^[11]. The reranker effectively broke the symmetry of the self-training loop (the parser was not training on its own raw predictions but on reranked predictions), which prefigured the multi-model schemes (teacher-student, FixMatch, Mean Teacher) that came to dominate the deep-learning era.

Pseudo-Label (Lee 2013): self-training comes to deep learning

Dong-Hyun Lee's 2013 ICML Workshop paper, "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks," applied the self-training idea to deep neural networks under the name pseudo-labeling ^[3]. Lee's procedure picks the class with the maximum predicted probability for each unlabeled example and treats that class "as if they were true labels," then trains on the labeled and pseudo-labeled data together ^[3]. The one important detail is that the labeled-loss term and the pseudo-labeled-loss term are combined within each minibatch with a time-dependent weighting $\alpha(t)$ that ramps up from zero. This avoided the problem of the model learning the wrong thing too early in training, before the supervised classifier was strong enough to produce reliable pseudo-labels.

Lee showed that pseudo-labeling improved MNIST accuracy in low-label regimes, reported it as state-of-the-art semi-supervised performance for deep neural networks without any unsupervised pretraining, and connected the approach to entropy regularization in semi-supervised learning ^[3]. The paper also established the term "pseudo-label" that is now standard. Pseudo-Label was for several years the simplest deep-learning baseline for semi-supervised image classification.

Noisy Student (Xie et al. 2020): self-training at ImageNet scale

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le's CVPR 2020 paper, "Self-Training with Noisy Student Improves ImageNet Classification," pushed self-training to a state-of-the-art result on the standard benchmark of computer vision ^[4]. The recipe is:

Train a teacher EfficientNet on labeled ImageNet.
Use the teacher to generate pseudo-labels for 300 million unlabeled images from the JFT dataset.
Train a larger student EfficientNet on the union of labeled and pseudo-labeled data, with noise injected into the student via dropout, stochastic depth, and RandAugment data augmentation.
Make the student the new teacher and repeat.

The authors describe the central mechanism directly: "During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher" ^[4]. The final EfficientNet-L2 model reached 88.4% top-1 accuracy on ImageNet, beating the previous state of the art (which used 3.5 billion weakly labeled Instagram images) by 2.0 percentage points, and it substantially improved robustness: ImageNet-A top-1 accuracy rose from 61.0% to 83.7%, ImageNet-C mean corruption error fell from 45.7 to 28.3, and ImageNet-P mean flip rate fell from 27.8 to 12.2 ^[4]. Two design choices are essential to the result. First, the student is larger than the teacher, so it has more capacity to fit the augmented dataset. Second, the student is trained with strong noise (the teacher is not), which forces the student to learn more invariant features than the teacher. Without these choices, naive self-training plateaus on ImageNet rather than improving.

Noisy Student was the proof point that self-training was not just a low-resource trick but could push the absolute frontier of vision models when paired with enough unlabeled data and enough compute.

FixMatch (Sohn et al. 2020): self-training meets consistency regularization

Kihyuk Sohn and colleagues' NeurIPS 2020 paper, "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence," demonstrated that the self-training and consistency regularization families of semi-supervised learning could be unified into a strikingly simple algorithm ^[5]. FixMatch works as follows:

For each unlabeled image, apply a weak augmentation (for example, a flip and crop) and obtain the model's prediction.
If the maximum predicted probability exceeds a threshold (the paper uses 0.95), record the argmax as a pseudo-label.
Apply a strong augmentation (RandAugment or CTAugment) to the same image and require the model to predict the recorded pseudo-label.

FixMatch reaches 94.93% accuracy on CIFAR-10 with only 250 labeled images, and 88.61% with only 40 labels (4 per class), competitive with fully supervised training that uses 50,000 labels ^[5]. The strong-versus-weak augmentation asymmetry is the key trick: the pseudo-label comes from the easy view (which the model is more likely to get right), and the supervised loss is applied to the hard view (which forces invariance to perturbations). This gives FixMatch the noise-injection benefits of Noisy Student in a single-pass training procedure rather than an iterative teacher-student loop.

SimCLRv2 (Chen et al. 2020): self-supervised pretraining plus self-training distillation

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton's NeurIPS 2020 paper, "Big Self-Supervised Models Are Strong Semi-Supervised Learners," applied a three-stage recipe in which self-training appears in the final stage ^[6]. The pipeline is:

Self-supervised pretraining of a large ResNet using SimCLRv2 contrastive learning on unlabeled ImageNet.
Supervised fine-tuning of the pretrained model on the small labeled subset (1% or 10% of ImageNet labels).
Self-training distillation: use the fine-tuned big model as a teacher to generate soft pseudo-labels on all unlabeled examples, and distill those labels into a smaller student.

The result was 73.9% top-1 ImageNet accuracy with only 1% of labels (using ResNet-50 as the student), a tenfold improvement in label efficiency over the previous state of the art. The two main lessons were that bigger pretrained models are more label-efficient, and that the big model can be distilled into a small one with little loss in accuracy. The distillation step is essentially a one-round self-training procedure with a fixed teacher, which puts SimCLRv2 in the same algorithmic family as Pseudo-Label and Noisy Student.

Self-training for neural machine translation: back-translation

In neural machine translation, self-training takes a specialized form known as back-translation, introduced by Sennrich, Haddow, and Birch in their 2016 ACL paper "Improving Neural Machine Translation Models with Monolingual Data" ^[12]. Given a translation system that maps source language A to target language B, back-translation works as follows:

Train a reverse model from B to A using the available parallel data.
Use the reverse model to translate a large monolingual B-language corpus into pseudo-source A-language sentences.
Pair each pseudo-source A sentence with its real B sentence and add this synthetic parallel corpus to the training data of the forward (A to B) model.

Back-translation is self-training with a twist: the pseudo-labels are inputs (synthetic source sentences) rather than outputs (target sentences). The technique gave +2.8 to +3.7 BLEU on WMT 15 English-German and +2.1 to +3.4 BLEU on the low-resource IWSLT 14 Turkish-English benchmark, and it has been a standard ingredient of competitive machine translation systems ever since. Iterative back-translation, in which the forward and reverse models alternate training rounds, is a direct analogue of the iterated teacher-student loop in Noisy Student.

What are the main variants of self-training?

The canonical self-training loop has been extended in many directions. The table below summarizes the most influential variants.

Variant	Year	Authors	Key idea	Domain
Yarowsky algorithm	1995	Yarowsky	Decision-list classifier with one-sense-per-collocation and one-sense-per-discourse constraints	NLP (word sense)
Self-training for parsing	2006	McClosky, Charniak, Johnson	Reranker breaks symmetry of self-training loop	NLP (parsing)
Pseudo-Label	2013	Lee	Time-ramped weighting on pseudo-label loss for deep nets	Vision
Back-translation	2016	Sennrich, Haddow, Birch	Reverse model produces synthetic source sentences	Machine translation
Mean Teacher	2017	Tarvainen, Valpola	Teacher is exponential moving average of student weights	Vision
Noisy Student	2020	Xie, Luong, Hovy, Le	Larger student, strong noise injection on student, iterated	Vision (ImageNet)
FixMatch	2020	Sohn et al.	Weak augmentation produces pseudo-label, strong augmentation receives it	Vision
SimCLRv2 distillation	2020	Chen, Kornblith, Swersky, Norouzi, Hinton	Big self-supervised teacher distilled into small student	Vision
Curriculum Labeling	2021	Cascante-Bonilla, Tan, Qi, Ordonez	Quantile-based threshold over training rounds	Vision
STaR	2022	Zelikman, Wu, Mu, Goodman	Self-train an LLM on its own correct chain-of-thought rationales	LLMs
Constitutional AI / RLAIF	2022	Bai et al. (Anthropic)	LLM critiques and revises its own outputs against a constitution	LLMs

Mean Teacher: an alternative form of self-training

Antti Tarvainen and Harri Valpola's 2017 NeurIPS paper, "Mean Teachers Are Better Role Models," introduced a particularly influential variant in which the teacher is not a separate model but the exponential moving average (EMA) of the student's own weights over time ^[13]. The student is trained with the usual supervised loss on labeled data plus a consistency loss that pushes the student's predictions on unlabeled data toward the EMA-teacher's predictions. Mean Teacher is technically a consistency-regularization method rather than a strict pseudo-labeling method, but it sits in the same algorithmic neighbourhood: the EMA-teacher's predictions function as soft, slowly evolving pseudo-labels. Mean Teacher reached 4.35% error on SVHN with only 250 labels and was for several years the strongest semi-supervised baseline before FixMatch.

Curriculum Labeling and confidence-aware variants

A recurring theme in modern self-training is that the confidence threshold should not be fixed throughout training. Curriculum Labeling (Cascante-Bonilla, Tan, Qi, Ordonez, AAAI 2021) ^[14] uses a quantile-based threshold so that, for example, the top 20% of pseudo-labels are added in round 1, the top 40% in round 2, and so on. This curriculum-style schedule mirrors the Yarowsky algorithm's tendency to start with the most certain examples and expand to harder ones over time. UPS (Uncertainty-aware Pseudo-Label Selection) ^[15] adds Monte Carlo dropout to estimate epistemic uncertainty and selects pseudo-labels with both high probability and low uncertainty.

How does self-training compare to co-training and other semi-supervised methods?

Self-training is one of several major families of semi-supervised learning. The clearest contrast is with co-training (Blum and Mitchell, 1998), which trains two classifiers on two conditionally independent and individually sufficient "views" of each example and lets each classifier's confident predictions on unlabeled data become training labels for the other ^[16]. Self-training uses a single model and a single view, so it is simpler to apply but lacks the error-correcting cross-check that co-training's second view provides. Tri-training (Zhou and Li, 2005) reaches a similar cross-checking effect without requiring two views by training three classifiers on bootstrap samples and adding a pseudo-label only when two of the three agree ^[17]. The table below contrasts the major families.

Method	Mechanism	Requires multiple views?	Requires unlabeled loss?	Failure mode
Self-training (pseudo-labeling)	Train, predict, retrain on confident predictions	No	No (uses standard supervised loss on pseudo-labels)	Confirmation bias on noisy pseudo-labels
Co-training (Blum and Mitchell 1998) ^[16]	Two models on two conditionally independent views label data for each other	Yes (two sufficient views)	No	Breaks down when views are not conditionally independent
Tri-training (Zhou and Li 2005) ^[17]	Three models; an unlabeled example is labeled when two agree	No (uses bootstrap samples instead of views)	No	Less label noise than self-training but more computation
Consistency regularization (Pi-Model, Mean Teacher, FixMatch)	Penalize differences between predictions on perturbed copies of the same input	No	Yes (consistency loss)	Sensitive to choice of perturbation
Generative models (mixture models, deep generative SSL)	Model joint distribution $p(x, y)$ using unlabeled data	No	Yes (likelihood term)	Model misspecification can hurt
Graph-based label propagation	Spread labels through a similarity graph	No (graph encodes structure)	Yes (smoothness loss)	Requires meaningful similarity metric
Entropy minimization	Add a low-entropy preference on unlabeled predictions	No	Yes (entropy term)	Encourages overconfident predictions
Self-supervised pretraining	Pretrain on a pretext task without labels, then fine-tune	No	Yes (pretext loss)	Pretext task may not transfer

Many modern systems combine families. FixMatch combines self-training (pseudo-labeling on weak augmentation) with consistency regularization (matching prediction on strong augmentation). SimCLRv2 combines self-supervised pretraining with self-training distillation. The boundary between "pure self-training" and "hybrid semi-supervised method" is blurry in practice.

Why does self-training work? Theoretical understanding

A persistent challenge for self-training has been that it lacked a theoretical justification beyond linear models for a long time. Why should retraining on a model's own predictions improve the model? Naively, the new pseudo-labels carry no information that was not already in the original predictions, so the gradient updates should average to zero.

Wei, Shen, Chen, Ma (2021)

Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma's ICLR 2021 oral paper, "Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data," gave the first theoretical analysis of self-training that applies to deep neural networks ^[9]. Their core idea is that self-training works when there is an expansion property: data points that are close to each other (in the sense of a small input perturbation) tend to belong to the same class. Under this assumption, the consistency loss imposed by self-training ("the prediction on a perturbed copy of an input should match the original prediction") propagates labels from the labeled set outward through neighborhoods, eventually covering the entire data manifold.

Wei et al. showed that this analysis applies to neural networks, not just linear models, because the loss landscape is no longer the bottleneck once the expansion assumption is in place. Their result also covers consistency regularization (FixMatch, Mean Teacher) and unsupervised domain adaptation as special cases, providing a unified theoretical framework. The paper was selected as an oral presentation at ICLR 2021 and is now the standard theoretical reference for why deep self-training works.

The Wei et al. analysis is consistent with the long-standing folk wisdom in self-training: the labeled examples must be sufficiently representative of the underlying classes that nearby unlabeled examples are correctly classified by the initial model. When this holds, the loop expands the labeled neighbourhoods. When it does not, the loop drifts.

Earlier theory

Before the deep learning era, theoretical analysis of self-training was restricted to linear models, decision lists, and mixture models. Abney's 2004 paper, "Understanding the Yarowsky Algorithm," reformulated the Yarowsky procedure as approximate optimization of a log-likelihood ^[10] and clarified what assumptions it implicitly relied on. Haffari and Sarkar's analyses ^[18] gave conditions under which self-training converges. The literature on the related EM algorithm provided additional results because EM with unlabeled data can be viewed as a soft version of self-training.

High-dimensional and high-overparameterization analyses

More recent analyses have examined self-training in high-dimensional Gaussian mixture settings, showing that pseudo-labeling can either help or hurt depending on the signal-to-noise ratio and the relative sizes of the labeled and unlabeled sets ^[19]. The general finding is that self-training is most beneficial when the initial model is already reasonably good (so pseudo-labels are mostly correct) and when the unlabeled set is large enough to substantially expand the effective training distribution. When the initial model is weak, self-training can amplify rather than correct its errors.

What is self-distillation?

Self-distillation is the special case of self-training in which the teacher and student have the same architecture (or even identical model classes). The idea was given a name and a careful empirical study by Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar in their 2018 ICML paper, "Born-Again Neural Networks" ^[20]. Their procedure trains a teacher in the usual way, then trains a student of the same architecture using the teacher's outputs as soft targets, then trains a third generation using the second's outputs, and so on. Surprisingly, each generation slightly outperforms the previous one on CIFAR-10 and CIFAR-100, with the BAN-DenseNets reaching 3.5% error on CIFAR-10 and 15.5% error on CIFAR-100.

Self-distillation can be viewed through several lenses:

As an ensembling procedure that compresses the implicit ensemble defined by training noise into a single model.
As a regularizer that smooths the training targets and avoids overfitting to hard labels.
As a knowledge transfer procedure where the dark knowledge in the teacher's output distribution carries information about class similarity that is missing from one-hot labels.

Self-distillation is widely used in practice. It is one of the components of SimCLRv2 (the supervised fine-tuned big model is distilled into the smaller deployment model) and of Noisy Student (later iterations are essentially self-distillation with noise). It is also used heavily in knowledge distillation pipelines for compressing large models into smaller deployable ones.

How is self-training used in large language models?

The most active area of self-training research in 2025 and 2026 is the post-training of large language models. Several distinct lines of work apply the self-training pattern in different ways.

STaR (Zelikman et al. 2022): self-taught reasoner

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman's NeurIPS 2022 paper, "STaR: Bootstrapping Reasoning with Reasoning," applied self-training to chain-of-thought reasoning ^[8]. The loop is:

Prompt the LLM with a few examples of question-answer-rationale triples to generate rationales for new questions in a large unlabeled dataset.
Compare the model's final answer to a known correct answer (which is available in many reasoning datasets even without rationales).
Keep the rationales whose final answers are correct.
Fine-tune the LLM on the kept rationales.
To handle questions the model gets wrong, apply rationalization: prompt the model with the correct answer and ask it to construct a rationale that arrives at it. Add successful rationalizations to the training set.
Repeat.

STaR showed that this loop dramatically improves reasoning accuracy on math word problems and CommonsenseQA, with a 10x smaller model approaching the accuracy of much larger models. The STaR pattern has since been generalized into many "self-improvement" pipelines for LLMs, including ReST (Reinforced Self-Training), V-STaR, and various rejection-sampling fine-tuning procedures used in the post-training of frontier models.

Constitutional AI and RLAIF (Anthropic 2022)

Anthropic's 2022 paper, "Constitutional AI: Harmlessness from AI Feedback" by Yuntao Bai and colleagues, applied self-training to alignment ^[7]. The supervised stage works as follows:

Sample a response from an initial helpful-only model.
Prompt the model to critique its own response against a written constitution (a list of principles such as "please choose the response that is least harmful and most ethical").
Prompt the model to revise the response in light of the critique.
Fine-tune the original model on the revised responses.

This is a form of self-training in which the pseudo-labels are not class predictions but improved completions, and the "correctness" signal comes from the model's own application of the constitution. The reinforcement-learning stage, RLAIF (Reinforcement Learning from AI Feedback), then trains a preference model on AI-generated comparisons and uses it as a reward signal in PPO-style fine-tuning, again replacing the human labels in RLHF with AI-generated ones. Constitutional AI was the proof of concept that AI feedback could substitute for substantial fractions of human feedback in the alignment pipeline. RLAIF has since become a standard ingredient in frontier model post-training, including in subsequent Claude models.

Synthetic data and rejection sampling fine-tuning

A broader pattern in modern LLM post-training is rejection sampling fine-tuning (RFT), which is essentially self-training applied to instruction following. The loop is:

Sample many candidate completions from the current model.
Score the completions using a reward model, an automatic verifier (for code or math), or a separate judge model.
Keep the top-scoring completions and discard the rest.
Fine-tune the model on the kept completions.
Repeat.

This procedure has been used in the post-training of LLaMA, Qwen, DeepSeek, and other open and closed frontier models. It is essentially the STaR loop applied to general instruction following rather than to chain-of-thought reasoning specifically. The success of these methods has reopened the question of how much improvement is available from self-distillation alone, without new external data.

What are the risks of self-training?

The central failure mode of self-training is confirmation bias. Eric Arazo, Diego Ortego, Paul Albert, Noel O'Connor, and Kevin McGuinness analyzed this carefully in their 2020 IJCNN paper, "Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning" ^[21]. They showed that naive pseudo-labeling is prone to a vicious cycle: incorrect pseudo-labels reinforce themselves, the model becomes increasingly confident in its mistakes, and performance degrades sharply, particularly under class imbalance where minority classes can be erased entirely. Their proposed mitigations include mixup augmentation, ensuring a minimum number of labeled samples per minibatch, and oversampling underrepresented classes ^[21].

A related set of failure modes is summarized below.

Failure mode	Mechanism	Common mitigation
Confirmation bias	Wrong pseudo-labels are reinforced into harder errors	Strong noise injection, confidence thresholding, mixup
Class collapse	Majority classes capture all pseudo-labels; minority classes vanish	Class-balanced selection, per-class thresholds (FlexMatch)
Overconfident calibration	Modern deep nets are systematically overconfident; thresholds become meaningless	Temperature scaling, MC dropout, ensemble teachers
Domain shift	Unlabeled data has a different distribution from labeled data	Domain-adversarial training, importance weighting
Premature commitment	Hard pseudo-labels lock in early errors	Soft pseudo-labels, time-ramped loss weighting (Pseudo-Label)
Reward hacking (in LLM self-training)	Model learns to game the judge or verifier rather than improve true performance	Process supervision, multiple judges, holdout evaluation
Mode collapse (in synthetic data loops)	Repeated self-training on synthetic data narrows the output distribution	Curate human data into the loop, control diversity, test on held-out distributions

The Shumailov et al. 2024 Nature paper on "model collapse" ^[22] generalized confirmation bias to the LLM-on-LLM setting and showed that models trained recursively on their own outputs eventually lose tail behaviour and converge to a narrower distribution. This work is sometimes called model autophagy disorder (MAD) in subsequent literature ^[23] and is the LLM-scale analogue of the confirmation bias that Arazo et al. identified at the image classification scale.

Practical lessons

The self-training literature, taken as a whole, suggests several rules of thumb:

Never run self-training without a held-out evaluation that does not contribute pseudo-labels. Use it to detect when the loop starts hurting rather than helping.
Inject noise into the student. Noisy Student and FixMatch both rely on this. Without noise, the student has no reason to generalize beyond the teacher.
Keep labeled data in every minibatch. Without it, gradient updates can drift away from the true distribution.
Use class-balanced selection. Otherwise minority classes can collapse.
Be conservative early. Time-ramped loss weighting (Pseudo-Label) and curriculum thresholds (Curriculum Labeling) both encode this principle.
Prefer soft labels. They preserve uncertainty and are less prone to lock-in.

Where is self-training used?

Self-training appears in many production systems and research benchmarks. A non-exhaustive list:

Domain	Example	System
Word sense disambiguation	Yarowsky algorithm and successors	NLP toolkits
Image classification	Noisy Student, FixMatch, SimCLRv2	EfficientNet-L2 (state-of-the-art ImageNet at release)
Object detection	Self-training detectors (STAC, Soft Teacher)	Modern COCO models
Semantic segmentation	Pseudo-label-based SSL for dense prediction	DeepLab variants
Speech recognition	Pseudo-labeling and noisy student for ASR	wav2vec 2.0 plus self-training
Machine translation	Back-translation, iterative back-translation	Most competitive WMT systems
Parsing	McClosky-Charniak-Johnson reranker self-training	Charniak parser
Recommendation	Pseudo-labeled implicit feedback	Industrial recommenders
LLM reasoning	STaR, V-STaR, ReST	Stanford and DeepMind systems
LLM alignment	Constitutional AI, RLAIF	Claude and similar assistants
Code generation	RFT on verified completions	DeepSeek-Coder and similar
Robotics	Self-training on simulated trajectories	Sim-to-real pipelines

The pattern of "use the model to generate training data for itself, then verify, then retrain" appears across all of these domains. The verification step (a confidence threshold, a separate verifier, a constitution, an automatic test) is what distinguishes useful self-training from runaway confirmation bias.

How does self-training relate to other concepts?

Self-training is closely connected to a number of adjacent ideas in machine learning.

Knowledge distillation: standard knowledge distillation transfers knowledge from a large teacher to a small student using soft labels. Self-distillation is the special case where teacher and student share architecture, and self-training is the special case where the teacher and student are different snapshots of the same training procedure.
Active learning: active learning asks a human to label the most uncertain examples, while self-training assigns pseudo-labels to the least uncertain ones. The two procedures are complementary and are sometimes combined.
Transfer learning: transfer learning brings knowledge from a different labeled dataset; self-training brings information from unlabeled data of the same task.
Self-supervised learning: self-supervised learning constructs surrogate labels from the structure of the input itself (for example, predicting masked tokens). Self-training generates surrogate labels from the model's own predictions about the actual task. The two are complementary and are often stacked, as in SimCLRv2.
EM algorithm: the soft-pseudo-label form of self-training is closely related to the Expectation-Maximization algorithm for mixture models, where the E-step computes soft assignments (analogous to soft pseudo-labels) and the M-step updates parameters (analogous to retraining).
Generative models in semi-supervised learning: generative SSL methods (variational mixtures, GAN-based SSL, deep generative models) model $p(x)$ jointly with $p(y \mid x)$ and use unlabeled data to fit the marginal. Self-training models only $p(y \mid x)$ and uses unlabeled data implicitly through pseudo-labels.
Co-training and tri-training: see the comparison table above. Self-training uses one model; co-training uses two models on two views; tri-training uses three models on bootstrap samples.

Modern relevance

In 2026, self-training is no longer a single technique but a design pattern that appears in nearly every part of modern machine learning. The Noisy Student, FixMatch, and SimCLRv2 papers established that self-training can drive state-of-the-art results in vision when paired with sufficient unlabeled data and noise. The STaR and Constitutional AI papers extended the same pattern to LLM reasoning and alignment. The various rejection-sampling fine-tuning recipes used in frontier model post-training are essentially self-training loops with verification gates.

The theoretical work by Wei, Shen, Chen, and Ma ^[9] gave self-training a respectable footing in the deep-learning era, and the practical work by Arazo et al. ^[21] and Shumailov et al. ^[22] mapped out the failure modes carefully enough that practitioners can reason about them in advance. The result is that self-training has moved from a heuristic that sometimes works to a design pattern that is well understood and widely used, with known limitations and known mitigations.

Where self-training is going next is unclear. The most active questions are whether iterated self-training on synthetic data can sustain progress without external data input (the model collapse literature suggests probably not without curation), how to combine self-training with verifiable reward signals for chain-of-thought training, and how to scale RLAIF-style self-improvement to harder reasoning and agentic tasks. All three questions are central to current LLM research, and self-training in some form is part of every proposed answer.

Explain Like I'm 5 (ELI5)

Imagine you're learning to recognize different types of animals. At first, you only know a few animals (the labeled data), but you see many more animals you don't know (the unlabeled data). In self-training, you first learn from the animals you know, then you start making guesses about the animals you don't know. If you're very sure about some of your guesses, you add them to the animals you know and keep learning. You repeat this until you stop getting better.

The smart trick is to add only the guesses you're really, really sure about. The dangerous part is that if you guess wrong on some animals and add those wrong guesses to your list, you might keep getting more confident in your wrong answers. That's called confirmation bias, and it's the main reason this kind of learning sometimes goes badly.

A more recent twist is asking the model to guess and then check itself. In language models, this looks like asking the AI to write out its reasoning, only keeping the answers where the reasoning leads to the right final answer, and then training on those good reasoning examples. That's how systems like STaR and Constitutional AI work: the AI helps train itself, but with a check that prevents it from learning the wrong things.

References

Scudder, H. J. (1965). "Probability of Error of Some Adaptive Pattern-Recognition Machines." *IEEE Transactions on Information Theory*, 11(3), 363-371. https://ieeexplore.ieee.org/document/1053799 ↩
Yarowsky, D. (1995). "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods." *Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics*, 189-196. https://www.cis.upenn.edu/~danroth/Teaching/CS598-05/Papers/Yarowsky-ACL95.pdf ↩
Lee, D.-H. (2013). "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks." *ICML 2013 Workshop on Challenges in Representation Learning*. http://deeplearning.net/wp-content/uploads/2013/03/pseudo_label_final.pdf ↩
Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. (2020). "Self-Training with Noisy Student Improves ImageNet Classification." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 10687-10698. https://arxiv.org/abs/1911.04252 ↩
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., and Li, C.-L. (2020). "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence." *Advances in Neural Information Processing Systems (NeurIPS)*, 33, 596-608. https://arxiv.org/abs/2001.07685 ↩
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and Hinton, G. E. (2020). "Big Self-Supervised Models Are Strong Semi-Supervised Learners." *Advances in Neural Information Processing Systems (NeurIPS)*, 33, 22243-22255. https://arxiv.org/abs/2006.10029 ↩
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." *arXiv preprint arXiv:2212.08073*. https://arxiv.org/abs/2212.08073 ↩
Zelikman, E., Wu, Y., Mu, J., and Goodman, N. D. (2022). "STaR: Bootstrapping Reasoning with Reasoning." *Advances in Neural Information Processing Systems (NeurIPS)*. https://arxiv.org/abs/2203.14465 ↩
Wei, C., Shen, K., Chen, Y., and Ma, T. (2021). "Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data." *Proceedings of the 9th International Conference on Learning Representations (ICLR)*, oral. https://openreview.net/forum?id=rC8sJ4i6kaH ↩
Abney, S. (2004). "Understanding the Yarowsky Algorithm." *Computational Linguistics*, 30(3), 365-395. https://aclanthology.org/J04-3002/ ↩
McClosky, D., Charniak, E., and Johnson, M. (2006). "Effective Self-Training for Parsing." *Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (NAACL)*, 152-159. https://aclanthology.org/N06-1020/ ↩
Sennrich, R., Haddow, B., and Birch, A. (2016). "Improving Neural Machine Translation Models with Monolingual Data." *Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)*, 86-96. https://aclanthology.org/P16-1009/ ↩
Tarvainen, A., and Valpola, H. (2017). "Mean Teachers Are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results." *Advances in Neural Information Processing Systems (NIPS)*, 30, 1195-1204. https://arxiv.org/abs/1703.01780 ↩
Cascante-Bonilla, P., Tan, F., Qi, Y., and Ordonez, V. (2021). "Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning." *Proceedings of the AAAI Conference on Artificial Intelligence*, 35(8), 6912-6920. https://arxiv.org/abs/2001.06001 ↩
Rizve, M. N., Duarte, K., Rawat, Y. S., and Shah, M. (2021). "In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-Label Selection Framework for Semi-Supervised Learning." *Proceedings of the 9th International Conference on Learning Representations (ICLR)*. https://openreview.net/forum?id=-ODN6SbiUU ↩
Blum, A., and Mitchell, T. (1998). "Combining Labeled and Unlabeled Data with Co-Training." *Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT)*, 92-100. https://www.cs.cmu.edu/~avrim/Papers/cotrain.pdf ↩
Zhou, Z.-H., and Li, M. (2005). "Tri-Training: Exploiting Unlabeled Data Using Three Classifiers." *IEEE Transactions on Knowledge and Data Engineering*, 17(11), 1529-1541. https://ieeexplore.ieee.org/document/1512038 ↩
Haffari, G. R., and Sarkar, A. (2007). "Analysis of Semi-Supervised Learning with the Yarowsky Algorithm." *Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI)*. https://www.researchgate.net/publication/228058928_Analysis_of_Semi-Supervised_Learning_with_the_Yarowsky_Algorithm ↩
Mai, X., and Couillet, R. (2022). "The Role of Pseudo-Labels in Self-Training Linear Classifiers on High-Dimensional Gaussian Mixture Data." *arXiv preprint arXiv:2205.07739*. https://arxiv.org/abs/2205.07739 ↩
Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., and Anandkumar, A. (2018). "Born-Again Neural Networks." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, PMLR 80:1607-1616. https://arxiv.org/abs/1805.04770 ↩
Arazo, E., Ortego, D., Albert, P., O'Connor, N. E., and McGuinness, K. (2020). "Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning." *Proceedings of the International Joint Conference on Neural Networks (IJCNN)*. https://arxiv.org/abs/1908.02983 ↩
Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., and Gal, Y. (2024). "AI Models Collapse When Trained on Recursively Generated Data." *Nature*, 631, 755-759. https://www.nature.com/articles/s41586-024-07566-y ↩
Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., LeJeune, D., Siahkoohi, A., and Baraniuk, R. G. (2024). "Self-Consuming Generative Models Go MAD." *Proceedings of the 12th International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2307.01850 ↩
Chapelle, O., Schölkopf, B., and Zien, A. (Eds.). (2006). *Semi-Supervised Learning*. MIT Press. https://mitpress.mit.edu/9780262514125/semi-supervised-learning/
Zhu, X., and Goldberg, A. B. (2009). "Introduction to Semi-Supervised Learning." *Synthesis Lectures on Artificial Intelligence and Machine Learning*. Morgan and Claypool. https://doi.org/10.2200/S00196ED1V01Y200906AIM006
Van Engelen, J. E., and Hoos, H. H. (2020). "A Survey on Semi-Supervised Learning." *Machine Learning*, 109(2), 373-440. https://doi.org/10.1007/s10994-019-05855-6
Amini, M.-R., Feofanov, V., Pauletto, L., Devijver, E., and Maximov, Y. (2024). "Self-Training: A Survey." *Neurocomputing*, 616. https://arxiv.org/abs/2202.12040
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. (2019). "MixMatch: A Holistic Approach to Semi-Supervised Learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 32. https://arxiv.org/abs/1905.02249
Anthropic. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic Research Blog. https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Sohn, K., Zhang, Z., Li, C.-L., Zhang, H., Lee, C.-Y., and Pfister, T. (2020). "A Simple Semi-Supervised Learning Framework for Object Detection." *arXiv preprint arXiv:2005.04757*. https://arxiv.org/abs/2005.04757

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Co-Training Derived label Machine learning terms/All Proxy labels Terms Weak supervision

What is self-training?

How does self-training work?

History

Scudder (1965): the original self-learning machine

Yarowsky (1995): bootstrapping for word sense disambiguation

McClosky, Charniak, Johnson (2006): self-training for parsing

Pseudo-Label (Lee 2013): self-training comes to deep learning

Noisy Student (Xie et al. 2020): self-training at ImageNet scale

FixMatch (Sohn et al. 2020): self-training meets consistency regularization

SimCLRv2 (Chen et al. 2020): self-supervised pretraining plus self-training distillation

Self-training for neural machine translation: back-translation

What are the main variants of self-training?

Mean Teacher: an alternative form of self-training

Curriculum Labeling and confidence-aware variants

How does self-training compare to co-training and other semi-supervised methods?

Why does self-training work? Theoretical understanding

Wei, Shen, Chen, Ma (2021)

Earlier theory

High-dimensional and high-overparameterization analyses

What is self-distillation?

How is self-training used in large language models?

STaR (Zelikman et al. 2022): self-taught reasoner

Constitutional AI and RLAIF (Anthropic 2022)

Synthetic data and rejection sampling fine-tuning

What are the risks of self-training?

Practical lessons

Where is self-training used?

How does self-training relate to other concepts?

Modern relevance

Explain Like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here