Co-Training

Co-training is a semi-supervised learning algorithm that leverages both labeled and unlabeled data by training two classifiers on two distinct "views" of the data, allowing them to teach each other iteratively. Introduced by Avrim Blum and Tom Mitchell in 1998, co-training is one of the foundational techniques in multi-view learning and remains widely used in settings where labeled data is scarce but unlabeled data is plentiful.

Although modern deep learning has shifted attention toward consistency-based methods such as FixMatch and MixMatch, co-training continues to influence the design of multi-view representation learners and contrastive systems like CLIP, where two encoders trained on paired views of the same content learn from one another through shared signal.

Overview

In many real-world machine learning problems, obtaining labeled data is expensive, time-consuming, or requires domain expertise. At the same time, unlabeled data is often available in large quantities. Co-training addresses this imbalance by exploiting the structure of the feature space itself.

The core idea behind co-training is that the features describing each data instance can be naturally divided into two separate subsets, called "views." Each view should contain enough information on its own to make accurate predictions. Two separate classifiers are trained, one on each view. After initial training on a small labeled dataset, each classifier labels some of the unlabeled data. The most confident predictions from one classifier are then added to the training set of the other classifier. This process repeats for several rounds, with both classifiers gradually improving as they receive more training data from each other.

Co-training belongs to the broader family of "wrapper" or "proxy-label" methods in semi-supervised learning, where pseudo-labels generated by one model are used to expand the training set for another. Other members of this family include self-training, tri-training, and many of the pseudo-labeling approaches that became popular in deep learning during the 2010s.

Property	Co-Training
Year introduced	1998
Authors	Avrim Blum, Tom Mitchell
Original venue	COLT (Conference on Learning Theory)
Family	Wrapper or proxy-label semi-supervised learning
Required input	Small labeled set, large unlabeled pool, two views per example
Number of base learners	Two (one per view)
Output	Two trained classifiers, often combined for prediction
Status as of 2026	Influential reference baseline; rarely the state of the art

History and origin

Co-training was first proposed by Avrim Blum and Tom Mitchell in their 1998 paper "Combining Labeled and Unlabeled Data with Co-Training," presented at the 11th Annual Conference on Computational Learning Theory (COLT). The paper received the 10-Year Best Paper Award at the International Conference on Machine Learning (ICML) in 2008, recognizing its lasting impact on the field. As of the mid-2020s, the original paper has been cited well over 5,000 times.

Blum and Mitchell's motivating application was classifying web pages as "academic course home pages" or not. They observed that web pages have two natural views: (1) the text content of the page itself, and (2) the anchor text of hyperlinks on other pages that point to it. Using just 12 labeled web pages as initial training examples, their co-training algorithm correctly classified roughly 95% of 788 web pages. This result demonstrated the power of leveraging unlabeled data with multiple views.

The paper sat at the intersection of two research traditions. From the computational learning theory side, it built on PAC learning analyses of how unlabeled data could relax sample complexity bounds. From the empirical side, it connected to growing interest in web mining and information extraction, where researchers had enormous amounts of HTML and few human labels. Co-training proved compelling because it produced a clean theorem and a working algorithm in the same paper.

In the years after publication, co-training inspired a wave of follow-up work, including Co-EM (Nigam and Ghani, 2000), Democratic Co-Learning (Goldman and Zhou, 2000), and Tri-Training (Zhou and Li, 2005). It also influenced the more general study of multi-view learning, which now spans kernel methods, neural networks, and contrastive objectives.

Algorithm

The standard co-training algorithm proceeds through the following steps.

Step	Action
1. Feature split	Divide the input features into two disjoint sets, View 1 (V1) and View 2 (V2)
2. Initialization	Train two classifiers, C1 on V1 and C2 on V2, using the small set of labeled examples
3. Labeling	Apply C1 and C2 to a pool of unlabeled examples; each classifier produces predictions with confidence scores
4. Selection	From C1's predictions, select the most confident positive and negative examples; do the same for C2
5. Augmentation	Add C1's confident predictions to C2's training set, and C2's confident predictions to C1's training set
6. Retraining	Retrain both classifiers on their expanded training sets
7. Iteration	Repeat steps 3 through 6 for a fixed number of rounds or until no more confident predictions can be made

A critical aspect of the algorithm is that each classifier labels data for the other classifier, not for itself. This cross-labeling mechanism helps prevent the reinforcement of errors that can occur in self-training, where a single classifier labels data for its own retraining.

Pseudocode

A minimal version of the original Blum and Mitchell procedure can be written in Python-like pseudocode as follows.

def co_training(L, U, view_split, classifier_factory,
                p, n, k, u_pool_size):
    """
    L: small labeled set of (x1, x2, y) triples
    U: large unlabeled set of (x1, x2) pairs
    view_split: function that returns the two views (x1, x2)
    classifier_factory: returns a fresh classifier instance
    p: number of confident positive picks per round
    n: number of confident negative picks per round
    k: number of co-training rounds
    u_pool_size: size of the random working pool drawn from U
    """
    c1 = classifier_factory()
    c2 = classifier_factory()

    pool = sample_random(U, u_pool_size)

    for round_id in range(k):
        c1.fit([x1 for x1, x2, y in L], [y for _, _, y in L])
        c2.fit([x2 for x1, x2, y in L], [y for _, _, y in L])

        s1 = c1.predict_proba([x1 for x1, x2 in pool])
        s2 = c2.predict_proba([x2 for x1, x2 in pool])

        top_pos_1 = top_indices(s1[:, 1], p)
        top_neg_1 = top_indices(s1[:, 0], n)
        top_pos_2 = top_indices(s2[:, 1], p)
        top_neg_2 = top_indices(s2[:, 0], n)

        for i in top_pos_1:
            L.append((pool[i]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, pool[i]<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, 1))
        for i in top_neg_1:
            L.append((pool[i]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, pool[i]<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, 0))
        for i in top_pos_2:
            L.append((pool[i]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, pool[i]<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, 1))
        for i in top_neg_2:
            L.append((pool[i]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, pool[i]<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, 0))

        used = set(top_pos_1 + top_neg_1 + top_pos_2 + top_neg_2)
        pool = [pool[i] for i in range(len(pool)) if i not in used]
        pool += sample_random(U, p + n + p + n)

    return c1, c2

This matches the structure used in the original 1998 paper. The authors maintained a smaller working pool of unlabeled examples and refilled it after each round, which keeps the per-round computation roughly constant and concentrates classifier attention on a manageable batch of candidate pseudo-labels.

Hyperparameters and practical defaults

Researchers and practitioners have settled on rules of thumb that often work well across domains.

Hyperparameter	Typical value	Notes
Working pool size U'	50 to 100 examples	Original paper used 75
Confident positives per round p	1 to 5	Smaller values keep noise low
Confident negatives per round n	Approximately p times the prior class ratio	Preserves class balance
Number of rounds k	20 to 40	More rounds rarely help
Classifier choice	Naive Bayes or logistic regression	Probabilistic outputs simplify confidence ranking
Confidence threshold	Top-k ranking instead of fixed threshold	Avoids issues with miscalibrated probabilities

Combining the two classifiers

At prediction time, several combination strategies are common.

Strategy	Description
Multiply probabilities	Treat the views as conditionally independent given the class and multiply per-class scores
Average probabilities	Take the mean of the two probability vectors and pick the argmax
Confidence-weighted	Weight each classifier by an estimate of its accuracy on a held-out set
One-view fallback	Use only the view that is available at test time, useful when one view is sometimes missing

The multiplication rule has the cleanest theoretical interpretation under conditional independence, but averaging is more robust when independence is only approximate.

Key assumptions

The theoretical guarantees of co-training rely on three main assumptions about the feature views.

Assumption	Description
Sufficiency	Each view alone contains enough information to train an accurate classification model. The class label can be reliably predicted from either view independently.
Conditional independence	Given the class label, the two views are conditionally independent. Knowing one view provides no extra information about the other view once the class is known.
Compatibility	The target functions learned from both views agree on the label of any given instance with high probability.

Blum and Mitchell provided a PAC-style (Probably Approximately Correct) theoretical analysis showing that if these assumptions hold and each view has a weakly useful classifier, co-training can learn arbitrarily accurate classifiers from unlabeled data. In practice, however, these assumptions are rarely satisfied perfectly. Research by Krogel and Scheffer (2004) demonstrated that when classifier dependence exceeded approximately 60%, co-training performance actually worsened compared to using labeled data alone. Despite this, co-training often works well even when the assumptions are only approximately satisfied, as long as the views provide somewhat complementary information.

A helpful way to think about the assumptions is in terms of the joint distribution. Conditional independence requires that p(x1, x2 | y) factorizes as p(x1 | y) p(x2 | y). Sufficiency requires classifiers f1 and f2 with low error on each view alone. Compatibility requires f1(x1) and f2(x2) to agree on most examples. When these conditions hold, agreement on unlabeled data becomes a meaningful proxy for accuracy, giving the algorithm a sound mathematical basis.

Co-training vs. self-training

Co-training is closely related to self-training, but there are important differences between the two approaches.

Aspect	Self-Training	Co-Training
Number of classifiers	One	Two (or more)
Feature views	Single view (all features)	Multiple views (disjoint feature sets)
Labeling mechanism	The classifier labels data for itself	Each classifier labels data for the other
Error correction	Cannot correct its own mistakes; errors may compound	Cross-labeling provides a form of error correction
Requirements	No special assumptions about features	Requires naturally separable, sufficient, and conditionally independent views
Applicability	Any supervised learning task	Tasks with naturally separable feature representations

Self-training uses a single model that iteratively adds its own most confident predictions on unlabeled data to the training set. The main risk is that the model cannot correct its own mistakes: if it confidently assigns a wrong label, that error becomes part of the training data and may compound over subsequent iterations. This failure mode is sometimes called confirmation bias or self-amplification.

Co-training mitigates this risk through diversification. Because the two classifiers operate on different feature sets, they tend to make different errors. When one classifier is wrong about an example, the other may correctly label it, providing a natural mechanism for error correction. This complementary relationship is the primary advantage of co-training over self-training.

In deep learning practice, however, self-training has made a strong comeback. Modern self-training methods use very large unlabeled corpora, strong pretrained backbones, and confidence thresholds that filter out uncertain pseudo-labels. Approaches such as Noisy Student (Xie et al., 2020) and back-translation for machine translation achieve state-of-the-art results without any explicit two-view structure. These methods can be viewed as self-training with implicit diversity coming from data augmentation rather than a feature split.

Variants and extensions

Since Blum and Mitchell's original paper, numerous variants have been developed to address the limitations of standard co-training or to extend it to new settings.

Variant	Year	Key idea
Co-EM	2000	Soft pseudo-labels updated jointly via EM
Democratic Co-Learning	2000	Multiple algorithms on the same view, majority vote
Tri-Training	2005	Three classifiers, agreement of any two pseudo-labels for the third
Co-Forest	2007	Random Forest-style ensemble with co-training updates
Co-Regression	2007	Extends co-training to regression with consistency loss
Multi-view co-regularization	2008	Penalize disagreement between views during joint training
Tri-training with disagreement	2018	Targets examples where the third learner disagrees
Deep Co-Training	2018	Deep nets with adversarial view-divergence pressure
Multi-Head Co-Training	2021	Single backbone with multiple output heads as views

Co-EM

Co-EM, introduced by Nigam and Ghani (2000), combines the multi-view framework of co-training with the Expectation-Maximization (EM) algorithm. Unlike standard co-training, which adds only the most confident predictions, co-EM operates on all unlabeled samples simultaneously in an iterative batch mode. Each classifier provides probabilistic labels for the unlabeled data, and these soft labels are used to retrain the other classifier. Co-EM has been shown to outperform standard co-training on several text classification benchmarks, but it requires classifiers that output class probabilities and can suffer from local maxima.

Tri-Training

Tri-training, proposed by Zhou and Li (2005), uses three classifiers instead of two and eliminates the requirement for multiple views of the data. The three classifiers are initialized by training on different bootstrap samples of the labeled data, which introduces diversity among them. In each round, an unlabeled example is labeled for a given classifier if and only if the other two classifiers agree on its label. This majority-vote mechanism provides a natural safeguard against noisy pseudo-labels.

Tri-training does not require the feature space to be split into redundant and independent views, imposes no constraints on the supervised learning algorithm, and has broader applicability than classical co-training. Despite being proposed nearly two decades ago, empirical research by Ruder and Plank (2018) confirmed that tri-training remains a strong baseline for semi-supervised learning in natural language processing.

A notable variant is tri-training with disagreement, which modifies the labeling condition so that the third classifier must disagree with the pseudo-label. This targets the algorithm's learning toward examples where it is weakest, preventing the labeled data from becoming skewed.

Democratic Co-Learning

Democratic co-learning, proposed by Zhou and Goldman (2004), replaces the requirement for multiple feature views with multiple learning algorithms. Instead of splitting features, it trains several classifiers using different algorithms (for example, a decision tree, a naive Bayes classifier, and a support vector machine) on the same complete feature set. These classifiers have different inductive biases, which provides the diversity needed for effective mutual teaching.

For each unlabeled example, a prediction is made by majority vote of the classifiers. If the majority confidently agrees, the example is added to the training sets of the dissenting classifiers. The final prediction for new instances uses weighted majority voting, where each classifier's vote is weighted by its estimated accuracy. This approach is particularly useful in single-view settings where natural feature splits do not exist.

Single-view co-training

Several researchers have developed methods to apply co-training principles when only a single view of the data is available. Goldman and Zhou (2000) proposed using two different learning algorithms on the same feature set, relying on algorithmic diversity rather than feature diversity. Other approaches include:

Automatic feature decomposition: Algorithms that automatically partition the feature space into two subsets that approximate the multi-view requirement.
Co-Training by Committee (CoBC): An ensemble of diverse classifiers trained on the same features, with diversity encouraged through random subspace selection or different model architectures.
Random split co-training: Features are randomly divided into two groups; empirical studies show this can work surprisingly well even without meaningful feature separation.

Multi-Head Co-Training

Multi-Head Co-Training is a more recent development that consolidates multiple classifiers into a single neural network with multiple output heads. This approach adds minimal extra parameters compared to maintaining entirely separate models, making it more computationally efficient. Each head provides pseudo-labels for the others, maintaining the co-training principle while sharing lower-level feature representations. Research has demonstrated accuracy improvements of up to 3.1% on semi-supervised benchmarks like CIFAR-10 compared to other recent methods.

Deep Co-Training

Deep Co-Training, proposed by Qiao, Shen, Zhang, Wang, and Yuille (2018), adapts the original framework to deep neural networks for image classification. The authors observed that simply training two networks on different pixel subsets does not satisfy the conditional independence assumption, because deep features tend to converge to similar representations. To enforce diversity, they used adversarial examples: for each unlabeled image, an adversarial perturbation that fools one network is computed, and the other network is trained to remain robust to it. This view-difference loss explicitly pushes the two networks to make different mistakes.

Deep Co-Training reported strong results on CIFAR-10, CIFAR-100, and ImageNet under semi-supervised settings, and was extended to four and eight networks for further gains. The work bridged classical co-training theory and modern deep learning practice, although consistency-based methods like FixMatch later outperformed it on standard image SSL benchmarks.

Worked example: web page classification

The canonical co-training example uses web pages, mirroring the original 1998 setup. Suppose the goal is to identify course home pages on a university website.

Step	What happens
1. Collect data	Crawl the university domain. Tag 12 example pages by hand. Leave 1,000 untagged.
2. Build views	View 1 is the bag-of-words representation of the page text. View 2 is the bag-of-words representation of anchor text on incoming links.
3. Initial models	Train naive Bayes classifier C1 on view 1, and another C2 on view 2, using the 12 labels.
4. Pool	Sample 75 random unlabeled pages into a working pool.
5. Score	Each classifier ranks pool pages by predicted probability of being a course page.
6. Pseudo-label	Each classifier picks its top 1 most confident positive and top 3 most confident negatives.
7. Cross-add	C1's picks join C2's training set; C2's picks join C1's training set.
8. Refill	Replace the chosen pages with new random samples from the unlabeled pool.
9. Iterate	Repeat steps 5 to 8 for 30 rounds.
10. Predict	At test time, average the probabilities from C1 and C2 to classify a new page.

In the original experiment, this loop pushed accuracy from roughly 86% (using only the 12 labels) to about 95% on a held-out test set. The improvement comes from the unlabeled pool, which provides useful structure even without human annotation.

Co-training in NLP

Co-training and its descendants have a long history in natural language processing.

NLP task	Views used	Reference
Named entity recognition	Spelling features and contextual features	Collins and Singer (1999)
Word sense disambiguation	Local context and topic context	Yarowsky (1995), as a precursor
Statistical parsing	Different parser models on the same sentences	Sarkar (2001), Steedman et al. (2003)
Statistical machine translation	Source-side and target-side features	Callison-Burch and Osborne (2003)
Sentiment analysis	Personal pronouns + adjectives vs. n-gram features	Wan (2009)
Cross-lingual classification	Source-language and target-language features	Wan (2009), Amini et al. (2010)
Relation extraction	Lexical and syntactic patterns	Niu et al. (2017)

Collins and Singer's 1999 work on named entity recognition is particularly notable. They showed that splitting the features for each candidate entity into spelling cues (such as capitalization patterns) and contextual cues (the surrounding words) gave two views that approximated the sufficiency condition well. Their co-training-style algorithm, sometimes called DL-CoTrain, was a reference baseline for low-resource entity tagging for nearly a decade.

Co-training and modern semi-supervised learning

The semi-supervised learning landscape changed sharply with the rise of deep learning in the 2010s. New methods such as Virtual Adversarial Training (Miyato et al., 2017), Mean Teacher (Tarvainen and Valpola, 2017), MixMatch (Berthelot et al., 2019), FixMatch (Sohn et al., 2020), and FlexMatch (Zhang et al., 2021) replaced co-training as the methods of choice on standard image benchmarks. These methods rely on consistency regularization or pseudo-labeling with strong data augmentation rather than explicit multi-view structure.

Method	Year	Core idea	Multi-view?
Co-Training	1998	Two classifiers on disjoint views label data for each other	Yes
Yarowsky algorithm	1995	Iterative self-training with seed rules	No
Self-training	classical	Single classifier labels its own training data	No
Tri-Training	2005	Three classifiers, majority pseudo-labels	Sometimes
Pi-Model	2017	Two stochastic forward passes should agree	Implicit (augmentation)
Mean Teacher	2017	EMA teacher network supervises the student	Implicit
VAT	2017	Adversarial perturbation should not change prediction	No
MixMatch	2019	Sharpened pseudo-labels plus MixUp	No
FixMatch	2020	Weak augmentation produces a label that supervises a strong augmentation	Implicit (augmentation)
FlexMatch	2021	FixMatch with curriculum-style class-aware thresholds	Implicit
Noisy Student	2020	Iterative self-training with progressively larger student models	No

While co-training is no longer the highest-performing SSL method on benchmarks like CIFAR-10 and STL-10, its conceptual contribution remains visible in current research. The idea that two related but different views of the same data can supervise each other is at the heart of:

Cross-view consistency. Models like FixMatch treat weakly and strongly augmented copies of an image as two views and force their predictions to match, similar to the agreement maximization objective in co-training.
Contrastive learning. Methods such as SimCLR, MoCo, and CLIP train two encoders or two augmented copies to produce similar embeddings. Co-training's two-view setup is a direct ancestor of these systems.
Multi-modal pretraining. Image-text models such as CLIP, ALIGN, and BLIP learn from paired views of the same content. Co-training was one of the first algorithms to formalize the use of paired views for learning.

In this sense, co-training did not disappear; it dispersed into the broader ecosystem of representation learning.

When does co-training help?

Co-training is most beneficial in the following situations.

Labeled data is scarce but unlabeled data is abundant. This is the fundamental requirement for any semi-supervised method to be useful.
Natural feature views exist. The data has a meaningful way to be split into two (or more) complementary subsets. Classic examples include text and hyperlinks for web pages, audio and video for multimedia, and different sensor modalities.
The views are at least partially independent. The more independent the views are conditioned on the class label, the more benefit co-training provides. When views are highly correlated, co-training may not improve over supervised learning.
Each view is individually informative. Both views should contain enough signal to make reasonable predictions on their own. If one view is essentially noise, co-training will degrade performance.
The labeled set is too small for reliable fully supervised learning. When sufficient labeled data is available, the gains from co-training diminish.

Conversely, co-training may fail when the independence assumption is severely violated, when the initial classifiers are too weak, or when the unlabeled data distribution differs significantly from the labeled data distribution.

Failure modes

The following table summarizes the most common ways co-training breaks down in practice and how to recognize each failure.

Failure mode	Symptom	Mitigation
Correlated views	One view dominates and the other agrees blindly	Engineer better view splits or switch to tri-training
Confidence miscalibration	Confident wrong predictions enter the training set	Use top-k ranking instead of probability thresholds
Class drift	Class proportions in pseudo-labels diverge from the true prior	Cap positive and negative additions per round to match prior
Distribution shift	Unlabeled data comes from a different domain	Apply domain adaptation or filter the unlabeled pool
Weak base learners	Initial classifiers do barely better than random	Bootstrap with seed rules or pretrain on related data
Convergence stagnation	Pseudo-labels stop changing after a few rounds	Increase pool size or perturb classifiers between rounds

Theoretical foundations

Beyond the original PAC-learning analysis by Blum and Mitchell, several theoretical contributions have deepened the understanding of co-training.

Dasgupta, Littman, and McAllester (2001) provided PAC generalization bounds for co-training, proving that the agreement rate of the two classifiers on unlabeled data can serve as an upper bound on each classifier's error rate. This result formalized the intuition that maximizing agreement between diverse classifiers on unlabeled data is a principled learning strategy.

Balcan, Blum, and Yang (2004) relaxed the conditional independence assumption, showing that co-training can succeed under weaker conditions. Specifically, they proved that an "expansion" property of the data distribution, combined with sufficient views, is enough for co-training to work. The expansion condition asks that for any reasonable subset of the input space, the conditional probability of an example being similar in one view but different in the other is bounded away from zero. This explains why co-training often succeeds in practice even when strict conditional independence does not hold.

Wang and Zhou (2007, 2010) analyzed co-training from the perspective of large-margin classifiers and connected it to the maximum entropy principle. They also derived bounds that depend on diversity between the two classifiers, formalizing the intuition that more diverse classifiers extract more information from unlabeled data.

More recently, theoretical work has connected co-training to information-theoretic objectives. Federici et al. (2020) showed that minimizing a multi-view information bottleneck recovers a co-training-like objective in which the two views discard information not relevant to the shared content. This perspective links co-training to modern self-supervised representation learning.

Implementation considerations

The following checklist summarizes the practical decisions that arise when applying co-training to a new task.

Decision	Guidance
Choose two views	Look for natural splits: structure vs. content, lexical vs. context, audio vs. video. Avoid splitting features that are highly correlated.
Pick the base learner	Naive Bayes and logistic regression are common because their probability outputs help with ranking pseudo-labels.
Set the labeled budget	Co-training pays the most when labels are very scarce, often fewer than 100 labeled examples per class.
Determine pool size	Working pools of 50 to 200 examples work well. Larger pools slow convergence; smaller pools risk noise.
Cap pseudo-labels per round	Adding too many pseudo-labels per round amplifies errors. One to five per class per round is typical.
Match the class prior	Add positives and negatives in proportion to the true class ratio to avoid prior drift.
Use top-k selection	Top-k ranked pseudo-labels avoid issues with miscalibrated confidence values.
Stop early	Hold out a small validation set and stop when performance plateaus.
Combine views at test time	Average or multiply per-class probabilities; fall back to a single view if necessary.
Audit the unlabeled pool	Inspect samples for label noise, duplication, and distribution shift.

Library support

Several open-source projects offer ready-made implementations of co-training and related multi-view methods.

Library	Language	Notes
scikit-learn	Python	Provides self-training and label propagation; co-training requires custom code on top of base classifiers.
Sslearn	Python	Open-source SSL library that includes co-training, tri-training, and democratic co-learning.
MEKA	Java	Multi-label and multi-view extensions to the Weka framework.
RapidMiner	GUI	Provides drag-and-drop co-training operators for users without coding experience.

For most practical projects, a custom implementation of the original Blum and Mitchell loop using scikit-learn classifiers is straightforward, and the pseudocode above translates almost directly into Python.

Datasets and benchmarks

Several datasets have served as standard benchmarks for co-training research.

Dataset	Domain	Notes
WebKB	Web pages	Used in the original Blum and Mitchell paper; pages from four universities labeled by category.
CiteSeer	Citation network	Document classification with text view and citation view.
Cora	Citation network	Similar to CiteSeer with seven research subject categories.
20 Newsgroups	Text	Posts from 20 Usenet newsgroups, often split into different feature subsets.
MUC-6 / MUC-7	Named entities	Text view and contextual view used in NER co-training.
CIFAR-10 (SSL split)	Images	4,000 labels out of 50,000 training images is a common SSL setup.
STL-10	Images	5,000 labels and 100,000 unlabeled images, designed for SSL evaluation.
ImageNet (1% labeled)	Images	Used by Deep Co-Training and later SSL methods.
FashionMNIST (SSL)	Images	Smaller-scale benchmark popular for educational implementations.

Reporting on these datasets allows fair comparison across SSL methods, although co-training-specific results are most often reported on text and citation graphs, where two natural views are easier to construct.

Co-training sits inside a broader network of ideas about exploiting unlabeled data and multiple views.

Related concept	Relation to co-training
Multi-view learning	Co-training is the canonical example of a multi-view algorithm; multi-view learning generalizes to many objectives beyond pseudo-labeling.
Contrastive learning	Two augmented views of the same example serve as anchor and positive; the two-view structure is inherited from co-training.
CLIP	Image and caption are two views of the same content; the contrastive loss aligns their representations.
Mean Teacher	The teacher and student form a soft two-view system in which the teacher's predictions act as pseudo-labels for the student.
FixMatch	Weakly and strongly augmented copies of an image play the role of two views, and the weak prediction supervises the strong one.
Knowledge distillation	Distillation transfers a teacher's predictions to a student in a way that resembles one round of co-training.
Active learning	Co-training selects high-confidence pseudo-labels; active learning selects high-uncertainty queries for human labeling. The two are sometimes combined.
Domain adaptation	Co-training has been adapted to bridge labeled source data and unlabeled target data, treating each domain as a view.
Graph-based SSL	Label propagation and co-training are complementary: one uses graph smoothness, the other uses view agreement.

Strengths and limitations

Aspect	Strengths	Limitations
Theoretical foundation	PAC-style guarantees under stated assumptions	Assumptions rarely fully hold in practice
Sample efficiency	Strong gains in extreme low-label regimes	Gains shrink as labeled data grows
Flexibility of base learner	Works with naive Bayes, logistic regression, decision trees, and many others	Probabilistic output helpful for ranking pseudo-labels
Compatibility with views	Naturally suited to multi-modal and structured data	Requires meaningful view splits, which are not always available
Robustness	Cross-labeling reduces error reinforcement	Highly correlated views reintroduce confirmation bias
Computational cost	Modest; two classifiers and an outer loop	Can be slow when retraining heavy models per round
Modern relevance	Influences contrastive and consistency methods	Rarely state of the art on image SSL benchmarks

Frequently asked questions

Is co-training the same as ensemble learning?

No. Ensembles such as random forests train many models on the same labeled data and combine their predictions at test time. Co-training trains two models that explicitly teach each other on unlabeled data. The goal is a better model in a low-label setting, not just a better aggregate prediction.

Can co-training be used for regression?

Yes. Co-Regression methods extend co-training to continuous targets by adding pseudo-targets when the two regressors predict similar values. Zhou and Li (2007) describe one such variant called COREG, which uses k-nearest-neighbor regressors as the base learners.

What if I cannot easily split my features into two views?

Several options exist. Tri-training uses three classifiers initialized on bootstrap samples of the labeled data, removing the need for a feature split. Democratic Co-Learning uses several different algorithms on the same features. Random feature splits have also been shown to work surprisingly well in some empirical studies.

Why use two views instead of just self-training?

The key reason is error correction. A single self-training classifier cannot detect its own systematic mistakes: the same biases that cause an error during prediction also cause that error to appear confidently in the pseudo-labels. Two classifiers on different views are likely to make different mistakes, so each one's confident predictions correct the other's blind spots.

Has deep learning made co-training obsolete?

Not exactly. Methods like FixMatch and Mean Teacher dominate image SSL benchmarks today, but the underlying principle of using two related views to supervise each other is still active. CLIP, SimCLR, and consistency regularization can all be viewed as descendants of co-training. The classical algorithm itself remains useful for tabular and text data with naturally distinct feature views.

Explain like I'm 5 (ELI5)

Imagine you have a bag of differently shaped and colored toys, and you want to teach your two friends, Alice and Bob, how to sort them into two groups. However, you only have a few toys with stickers telling which group they belong to.

Alice will learn to sort toys based on their shape, while Bob will learn to sort them based on their color. First, you show Alice and Bob the labeled toys and explain how to sort them. Then, you give them some unlabeled toys to practice on. They start sorting the toys, and if they are very sure about a toy's group, they put a sticker on it. Then, they swap the newly stickered toys with each other to learn from each other's confident decisions.

Alice might notice that a round toy belongs in Group A based on its shape, and she tells Bob. Bob did not know that because he was only looking at colors. Now Bob learns something new. In the next round, Bob might confidently sort a blue toy into Group B based on its color, and he tells Alice. This process continues until Alice and Bob agree on how to sort most of the toys.

This is how co-training works in machine learning. Two classifiers learn from a small amount of labeled data and then teach each other using unlabeled data that they are confident about. Because they look at different features, they catch things the other one misses, helping both of them get better over time.

References

Blum, A., & Mitchell, T. (1998). "Combining Labeled and Unlabeled Data with Co-Training." *COLT 1998*, pp. 92-100.
Yarowsky, D. (1995). "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods." *ACL 1995*, pp. 189-196.
Collins, M., & Singer, Y. (1999). "Unsupervised Models for Named Entity Classification." *EMNLP/VLC 1999*, pp. 100-110.
Nigam, K., & Ghani, R. (2000). "Analyzing the Effectiveness and Applicability of Co-training." *CIKM 2000*, pp. 86-93.
Goldman, S., & Zhou, Y. (2000). "Enhancing Supervised Learning with Unlabeled Data." *ICML 2000*, pp. 327-334.
Sarkar, A. (2001). "Applying Co-Training Methods to Statistical Parsing." *NAACL 2001*, pp. 95-102.
Dasgupta, S., Littman, M. L., & McAllester, D. (2001). "PAC Generalization Bounds for Co-Training." *NeurIPS 14*.
Steedman, M., et al. (2003). "Bootstrapping Statistical Parsers from Small Datasets." *EACL 2003*, pp. 331-338.
Callison-Burch, C., & Osborne, M. (2003). "Co-training for Statistical Machine Translation." *Master's Thesis, University of Edinburgh*.
Zhou, Y., & Goldman, S. (2004). "Democratic Co-Learning." *ICTAI 2004*, pp. 594-602.
Krogel, M.-A., & Scheffer, T. (2004). "Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics." *Machine Learning*, 57(1-2), pp. 61-81.
Balcan, M.-F., Blum, A., & Yang, K. (2004). "Co-Training and Expansion: Towards Bridging Theory and Practice." *NeurIPS 17*.
Zhou, Z.-H., & Li, M. (2005). "Tri-Training: Exploiting Unlabeled Data Using Three Classifiers." *IEEE TKDE*, 17(11), pp. 1529-1541.
Zhou, Z.-H., & Li, M. (2007). "Semi-Supervised Regression with Co-Training Style Algorithms." *IEEE TKDE*, 19(11), pp. 1479-1493.
Wang, W., & Zhou, Z.-H. (2007). "Analyzing Co-training Style Algorithms." *ECML 2007*, pp. 454-465.
Wan, X. (2009). "Co-Training for Cross-Lingual Sentiment Classification." *ACL-IJCNLP 2009*, pp. 235-243.
Wang, W., & Zhou, Z.-H. (2010). "A New Analysis of Co-Training." *ICML 2010*, pp. 1135-1142.
Miyato, T., Maeda, S., Koyama, M., & Ishii, S. (2017). "Virtual Adversarial Training." *arXiv:1704.03976*.
Tarvainen, A., & Valpola, H. (2017). "Mean Teachers Are Better Role Models." *NeurIPS 30*.
Qiao, S., Shen, W., Zhang, Z., Wang, B., & Yuille, A. (2018). "Deep Co-Training for Semi-Supervised Image Recognition." *ECCV 2018*, pp. 135-152. arXiv:1803.05984.
Ruder, S., & Plank, B. (2018). "Strong Baselines for Neural Semi-Supervised Learning under Domain Shift." *ACL 2018*, pp. 1044-1054.
Berthelot, D., et al. (2019). "MixMatch: A Holistic Approach to Semi-Supervised Learning." *NeurIPS 32*. arXiv:1905.02249.
Sohn, K., et al. (2020). "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence." *NeurIPS 33*. arXiv:2001.07685.
Xie, Q., Luong, M.-T., Hovy, E., & Le, Q. V. (2020). "Self-Training with Noisy Student Improves ImageNet Classification." *CVPR 2020*, pp. 10687-10698.
Federici, M., et al. (2020). "Learning Robust Representations via Multi-View Information Bottleneck." *ICLR 2020*.
Zhang, B., et al. (2021). "FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling." *NeurIPS 34*.
Wikipedia contributors. "Co-training." *Wikipedia*.
Wikipedia contributors. "Semi-supervised learning." *Wikipedia*.

Overview

History and origin

Algorithm

Pseudocode

Hyperparameters and practical defaults

Combining the two classifiers

Key assumptions

Co-training vs. self-training

Variants and extensions

Co-EM

Tri-Training

Democratic Co-Learning

Single-view co-training

Multi-Head Co-Training

Deep Co-Training

Worked example: web page classification

Co-training in NLP

Co-training and modern semi-supervised learning

When does co-training help?

Failure modes

Theoretical foundations

Implementation considerations

Library support

Datasets and benchmarks

Connections to related ideas

Strengths and limitations

Frequently asked questions

Is co-training the same as ensemble learning?

Can co-training be used for regression?

What if I cannot easily split my features into two views?

Why use two views instead of just self-training?

Has deep learning made co-training obsolete?

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

ARC-AGI 2

Self-training

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Overview

History and origin

Algorithm

Pseudocode

Hyperparameters and practical defaults

Combining the two classifiers

Key assumptions

Co-training vs. self-training

Variants and extensions

Co-EM

Tri-Training

Democratic Co-Learning

Single-view co-training

Multi-Head Co-Training

Deep Co-Training

Worked example: web page classification

Co-training in NLP

Co-training and modern semi-supervised learning

When does co-training help?

Failure modes

Theoretical foundations

Implementation considerations

Library support

Datasets and benchmarks

Connections to related ideas

Strengths and limitations

Frequently asked questions

Is co-training the same as ensemble learning?

Can co-training be used for regression?

What if I cannot easily split my features into two views?

Why use two views instead of just self-training?

Has deep learning made co-training obsolete?

Explain like I'm 5 (ELI5)

See also

References

Related Articles

ARC-AGI 2