Co-Training
Last reviewed
May 9, 2026
Sources
28 citations
Review status
Source-backed
Revision
v3 ยท 6,363 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
28 citations
Review status
Source-backed
Revision
v3 ยท 6,363 words
Add missing citations, update stale details, or suggest a clearer explanation.
Co-training is a semi-supervised learning algorithm that leverages both labeled and unlabeled data by training two classifiers on two distinct "views" of the data, allowing them to teach each other iteratively. Introduced by Avrim Blum and Tom Mitchell in 1998, co-training is one of the foundational techniques in multi-view learning and remains widely used in settings where labeled data is scarce but unlabeled data is plentiful.
Although modern deep learning has shifted attention toward consistency-based methods such as FixMatch and MixMatch, co-training continues to influence the design of multi-view representation learners and contrastive systems like CLIP, where two encoders trained on paired views of the same content learn from one another through shared signal.
In many real-world machine learning problems, obtaining labeled data is expensive, time-consuming, or requires domain expertise. At the same time, unlabeled data is often available in large quantities. Co-training addresses this imbalance by exploiting the structure of the feature space itself.
The core idea behind co-training is that the features describing each data instance can be naturally divided into two separate subsets, called "views." Each view should contain enough information on its own to make accurate predictions. Two separate classifiers are trained, one on each view. After initial training on a small labeled dataset, each classifier labels some of the unlabeled data. The most confident predictions from one classifier are then added to the training set of the other classifier. This process repeats for several rounds, with both classifiers gradually improving as they receive more training data from each other.
Co-training belongs to the broader family of "wrapper" or "proxy-label" methods in semi-supervised learning, where pseudo-labels generated by one model are used to expand the training set for another. Other members of this family include self-training, tri-training, and many of the pseudo-labeling approaches that became popular in deep learning during the 2010s.
| Property | Co-Training |
|---|---|
| Year introduced | 1998 |
| Authors | Avrim Blum, Tom Mitchell |
| Original venue | COLT (Conference on Learning Theory) |
| Family | Wrapper or proxy-label semi-supervised learning |
| Required input | Small labeled set, large unlabeled pool, two views per example |
| Number of base learners | Two (one per view) |
| Output | Two trained classifiers, often combined for prediction |
| Status as of 2026 | Influential reference baseline; rarely the state of the art |
Co-training was first proposed by Avrim Blum and Tom Mitchell in their 1998 paper "Combining Labeled and Unlabeled Data with Co-Training," presented at the 11th Annual Conference on Computational Learning Theory (COLT). The paper received the 10-Year Best Paper Award at the International Conference on Machine Learning (ICML) in 2008, recognizing its lasting impact on the field. As of the mid-2020s, the original paper has been cited well over 5,000 times.
Blum and Mitchell's motivating application was classifying web pages as "academic course home pages" or not. They observed that web pages have two natural views: (1) the text content of the page itself, and (2) the anchor text of hyperlinks on other pages that point to it. Using just 12 labeled web pages as initial training examples, their co-training algorithm correctly classified roughly 95% of 788 web pages. This result demonstrated the power of leveraging unlabeled data with multiple views.
The paper sat at the intersection of two research traditions. From the computational learning theory side, it built on PAC learning analyses of how unlabeled data could relax sample complexity bounds. From the empirical side, it connected to growing interest in web mining and information extraction, where researchers had enormous amounts of HTML and few human labels. Co-training proved compelling because it produced a clean theorem and a working algorithm in the same paper.
In the years after publication, co-training inspired a wave of follow-up work, including Co-EM (Nigam and Ghani, 2000), Democratic Co-Learning (Goldman and Zhou, 2000), and Tri-Training (Zhou and Li, 2005). It also influenced the more general study of multi-view learning, which now spans kernel methods, neural networks, and contrastive objectives.
The standard co-training algorithm proceeds through the following steps.
| Step | Action |
|---|---|
| 1. Feature split | Divide the input features into two disjoint sets, View 1 (V1) and View 2 (V2) |
| 2. Initialization | Train two classifiers, C1 on V1 and C2 on V2, using the small set of labeled examples |
| 3. Labeling | Apply C1 and C2 to a pool of unlabeled examples; each classifier produces predictions with confidence scores |
| 4. Selection | From C1's predictions, select the most confident positive and negative examples; do the same for C2 |
| 5. Augmentation | Add C1's confident predictions to C2's training set, and C2's confident predictions to C1's training set |
| 6. Retraining | Retrain both classifiers on their expanded training sets |
| 7. Iteration | Repeat steps 3 through 6 for a fixed number of rounds or until no more confident predictions can be made |
A critical aspect of the algorithm is that each classifier labels data for the other classifier, not for itself. This cross-labeling mechanism helps prevent the reinforcement of errors that can occur in self-training, where a single classifier labels data for its own retraining.
A minimal version of the original Blum and Mitchell procedure can be written in Python-like pseudocode as follows.
def co_training(L, U, view_split, classifier_factory,
p, n, k, u_pool_size):
"""
L: small labeled set of (x1, x2, y) triples
U: large unlabeled set of (x1, x2) pairs
view_split: function that returns the two views (x1, x2)
classifier_factory: returns a fresh classifier instance
p: number of confident positive picks per round
n: number of confident negative picks per round
k: number of co-training rounds
u_pool_size: size of the random working pool drawn from U
"""
c1 = classifier_factory()
c2 = classifier_factory()
pool = sample_random(U, u_pool_size)
for round_id in range(k):
c1.fit([x1 for x1, x2, y in L], [y for _, _, y in L])
c2.fit([x2 for x1, x2, y in L], [y for _, _, y in L])
s1 = c1.predict_proba([x1 for x1, x2 in pool])
s2 = c2.predict_proba([x2 for x1, x2 in pool])
top_pos_1 = top_indices(s1[:, 1], p)
top_neg_1 = top_indices(s1[:, 0], n)
top_pos_2 = top_indices(s2[:, 1], p)
top_neg_2 = top_indices(s2[:, 0], n)
for i in top_pos_1:
L.append((pool[i]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, pool[i]<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, 1))
for i in top_neg_1:
L.append((pool[i]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, pool[i]<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, 0))
for i in top_pos_2:
L.append((pool[i]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, pool[i]<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, 1))
for i in top_neg_2:
L.append((pool[i]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, pool[i]<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, 0))
used = set(top_pos_1 + top_neg_1 + top_pos_2 + top_neg_2)
pool = [pool[i] for i in range(len(pool)) if i not in used]
pool += sample_random(U, p + n + p + n)
return c1, c2
This matches the structure used in the original 1998 paper. The authors maintained a smaller working pool of unlabeled examples and refilled it after each round, which keeps the per-round computation roughly constant and concentrates classifier attention on a manageable batch of candidate pseudo-labels.
Researchers and practitioners have settled on rules of thumb that often work well across domains.
| Hyperparameter | Typical value | Notes |
|---|---|---|
| Working pool size U' | 50 to 100 examples | Original paper used 75 |
| Confident positives per round p | 1 to 5 | Smaller values keep noise low |
| Confident negatives per round n | Approximately p times the prior class ratio | Preserves class balance |
| Number of rounds k | 20 to 40 | More rounds rarely help |
| Classifier choice | Naive Bayes or logistic regression | Probabilistic outputs simplify confidence ranking |
| Confidence threshold | Top-k ranking instead of fixed threshold | Avoids issues with miscalibrated probabilities |
At prediction time, several combination strategies are common.
| Strategy | Description |
|---|---|
| Multiply probabilities | Treat the views as conditionally independent given the class and multiply per-class scores |
| Average probabilities | Take the mean of the two probability vectors and pick the argmax |
| Confidence-weighted | Weight each classifier by an estimate of its accuracy on a held-out set |
| One-view fallback | Use only the view that is available at test time, useful when one view is sometimes missing |
The multiplication rule has the cleanest theoretical interpretation under conditional independence, but averaging is more robust when independence is only approximate.
The theoretical guarantees of co-training rely on three main assumptions about the feature views.
| Assumption | Description |
|---|---|
| Sufficiency | Each view alone contains enough information to train an accurate classification model. The class label can be reliably predicted from either view independently. |
| Conditional independence | Given the class label, the two views are conditionally independent. Knowing one view provides no extra information about the other view once the class is known. |
| Compatibility | The target functions learned from both views agree on the label of any given instance with high probability. |
Blum and Mitchell provided a PAC-style (Probably Approximately Correct) theoretical analysis showing that if these assumptions hold and each view has a weakly useful classifier, co-training can learn arbitrarily accurate classifiers from unlabeled data. In practice, however, these assumptions are rarely satisfied perfectly. Research by Krogel and Scheffer (2004) demonstrated that when classifier dependence exceeded approximately 60%, co-training performance actually worsened compared to using labeled data alone. Despite this, co-training often works well even when the assumptions are only approximately satisfied, as long as the views provide somewhat complementary information.
A helpful way to think about the assumptions is in terms of the joint distribution. Conditional independence requires that p(x1, x2 | y) factorizes as p(x1 | y) p(x2 | y). Sufficiency requires classifiers f1 and f2 with low error on each view alone. Compatibility requires f1(x1) and f2(x2) to agree on most examples. When these conditions hold, agreement on unlabeled data becomes a meaningful proxy for accuracy, giving the algorithm a sound mathematical basis.
Co-training is closely related to self-training, but there are important differences between the two approaches.
| Aspect | Self-Training | Co-Training |
|---|---|---|
| Number of classifiers | One | Two (or more) |
| Feature views | Single view (all features) | Multiple views (disjoint feature sets) |
| Labeling mechanism | The classifier labels data for itself | Each classifier labels data for the other |
| Error correction | Cannot correct its own mistakes; errors may compound | Cross-labeling provides a form of error correction |
| Requirements | No special assumptions about features | Requires naturally separable, sufficient, and conditionally independent views |
| Applicability | Any supervised learning task | Tasks with naturally separable feature representations |
Self-training uses a single model that iteratively adds its own most confident predictions on unlabeled data to the training set. The main risk is that the model cannot correct its own mistakes: if it confidently assigns a wrong label, that error becomes part of the training data and may compound over subsequent iterations. This failure mode is sometimes called confirmation bias or self-amplification.
Co-training mitigates this risk through diversification. Because the two classifiers operate on different feature sets, they tend to make different errors. When one classifier is wrong about an example, the other may correctly label it, providing a natural mechanism for error correction. This complementary relationship is the primary advantage of co-training over self-training.
In deep learning practice, however, self-training has made a strong comeback. Modern self-training methods use very large unlabeled corpora, strong pretrained backbones, and confidence thresholds that filter out uncertain pseudo-labels. Approaches such as Noisy Student (Xie et al., 2020) and back-translation for machine translation achieve state-of-the-art results without any explicit two-view structure. These methods can be viewed as self-training with implicit diversity coming from data augmentation rather than a feature split.
Since Blum and Mitchell's original paper, numerous variants have been developed to address the limitations of standard co-training or to extend it to new settings.
| Variant | Year | Key idea |
|---|---|---|
| Co-EM | 2000 | Soft pseudo-labels updated jointly via EM |
| Democratic Co-Learning | 2000 | Multiple algorithms on the same view, majority vote |
| Tri-Training | 2005 | Three classifiers, agreement of any two pseudo-labels for the third |
| Co-Forest | 2007 | Random Forest-style ensemble with co-training updates |
| Co-Regression | 2007 | Extends co-training to regression with consistency loss |
| Multi-view co-regularization | 2008 | Penalize disagreement between views during joint training |
| Tri-training with disagreement | 2018 | Targets examples where the third learner disagrees |
| Deep Co-Training | 2018 | Deep nets with adversarial view-divergence pressure |
| Multi-Head Co-Training | 2021 | Single backbone with multiple output heads as views |
Co-EM, introduced by Nigam and Ghani (2000), combines the multi-view framework of co-training with the Expectation-Maximization (EM) algorithm. Unlike standard co-training, which adds only the most confident predictions, co-EM operates on all unlabeled samples simultaneously in an iterative batch mode. Each classifier provides probabilistic labels for the unlabeled data, and these soft labels are used to retrain the other classifier. Co-EM has been shown to outperform standard co-training on several text classification benchmarks, but it requires classifiers that output class probabilities and can suffer from local maxima.
Tri-training, proposed by Zhou and Li (2005), uses three classifiers instead of two and eliminates the requirement for multiple views of the data. The three classifiers are initialized by training on different bootstrap samples of the labeled data, which introduces diversity among them. In each round, an unlabeled example is labeled for a given classifier if and only if the other two classifiers agree on its label. This majority-vote mechanism provides a natural safeguard against noisy pseudo-labels.
Tri-training does not require the feature space to be split into redundant and independent views, imposes no constraints on the supervised learning algorithm, and has broader applicability than classical co-training. Despite being proposed nearly two decades ago, empirical research by Ruder and Plank (2018) confirmed that tri-training remains a strong baseline for semi-supervised learning in natural language processing.
A notable variant is tri-training with disagreement, which modifies the labeling condition so that the third classifier must disagree with the pseudo-label. This targets the algorithm's learning toward examples where it is weakest, preventing the labeled data from becoming skewed.
Democratic co-learning, proposed by Zhou and Goldman (2004), replaces the requirement for multiple feature views with multiple learning algorithms. Instead of splitting features, it trains several classifiers using different algorithms (for example, a decision tree, a naive Bayes classifier, and a support vector machine) on the same complete feature set. These classifiers have different inductive biases, which provides the diversity needed for effective mutual teaching.
For each unlabeled example, a prediction is made by majority vote of the classifiers. If the majority confidently agrees, the example is added to the training sets of the dissenting classifiers. The final prediction for new instances uses weighted majority voting, where each classifier's vote is weighted by its estimated accuracy. This approach is particularly useful in single-view settings where natural feature splits do not exist.
Several researchers have developed methods to apply co-training principles when only a single view of the data is available. Goldman and Zhou (2000) proposed using two different learning algorithms on the same feature set, relying on algorithmic diversity rather than feature diversity. Other approaches include:
Multi-Head Co-Training is a more recent development that consolidates multiple classifiers into a single neural network with multiple output heads. This approach adds minimal extra parameters compared to maintaining entirely separate models, making it more computationally efficient. Each head provides pseudo-labels for the others, maintaining the co-training principle while sharing lower-level feature representations. Research has demonstrated accuracy improvements of up to 3.1% on semi-supervised benchmarks like CIFAR-10 compared to other recent methods.
Deep Co-Training, proposed by Qiao, Shen, Zhang, Wang, and Yuille (2018), adapts the original framework to deep neural networks for image classification. The authors observed that simply training two networks on different pixel subsets does not satisfy the conditional independence assumption, because deep features tend to converge to similar representations. To enforce diversity, they used adversarial examples: for each unlabeled image, an adversarial perturbation that fools one network is computed, and the other network is trained to remain robust to it. This view-difference loss explicitly pushes the two networks to make different mistakes.
Deep Co-Training reported strong results on CIFAR-10, CIFAR-100, and ImageNet under semi-supervised settings, and was extended to four and eight networks for further gains. The work bridged classical co-training theory and modern deep learning practice, although consistency-based methods like FixMatch later outperformed it on standard image SSL benchmarks.
The canonical co-training example uses web pages, mirroring the original 1998 setup. Suppose the goal is to identify course home pages on a university website.
| Step | What happens |
|---|---|
| 1. Collect data | Crawl the university domain. Tag 12 example pages by hand. Leave 1,000 untagged. |
| 2. Build views | View 1 is the bag-of-words representation of the page text. View 2 is the bag-of-words representation of anchor text on incoming links. |
| 3. Initial models | Train naive Bayes classifier C1 on view 1, and another C2 on view 2, using the 12 labels. |
| 4. Pool | Sample 75 random unlabeled pages into a working pool. |
| 5. Score | Each classifier ranks pool pages by predicted probability of being a course page. |
| 6. Pseudo-label | Each classifier picks its top 1 most confident positive and top 3 most confident negatives. |
| 7. Cross-add | C1's picks join C2's training set; C2's picks join C1's training set. |
| 8. Refill | Replace the chosen pages with new random samples from the unlabeled pool. |
| 9. Iterate | Repeat steps 5 to 8 for 30 rounds. |
| 10. Predict | At test time, average the probabilities from C1 and C2 to classify a new page. |
In the original experiment, this loop pushed accuracy from roughly 86% (using only the 12 labels) to about 95% on a held-out test set. The improvement comes from the unlabeled pool, which provides useful structure even without human annotation.
Co-training and its descendants have a long history in natural language processing.
| NLP task | Views used | Reference |
|---|---|---|
| Named entity recognition | Spelling features and contextual features | Collins and Singer (1999) |
| Word sense disambiguation | Local context and topic context | Yarowsky (1995), as a precursor |
| Statistical parsing | Different parser models on the same sentences | Sarkar (2001), Steedman et al. (2003) |
| Statistical machine translation | Source-side and target-side features | Callison-Burch and Osborne (2003) |
| Sentiment analysis | Personal pronouns + adjectives vs. n-gram features | Wan (2009) |
| Cross-lingual classification | Source-language and target-language features | Wan (2009), Amini et al. (2010) |
| Relation extraction | Lexical and syntactic patterns | Niu et al. (2017) |
Collins and Singer's 1999 work on named entity recognition is particularly notable. They showed that splitting the features for each candidate entity into spelling cues (such as capitalization patterns) and contextual cues (the surrounding words) gave two views that approximated the sufficiency condition well. Their co-training-style algorithm, sometimes called DL-CoTrain, was a reference baseline for low-resource entity tagging for nearly a decade.
The semi-supervised learning landscape changed sharply with the rise of deep learning in the 2010s. New methods such as Virtual Adversarial Training (Miyato et al., 2017), Mean Teacher (Tarvainen and Valpola, 2017), MixMatch (Berthelot et al., 2019), FixMatch (Sohn et al., 2020), and FlexMatch (Zhang et al., 2021) replaced co-training as the methods of choice on standard image benchmarks. These methods rely on consistency regularization or pseudo-labeling with strong data augmentation rather than explicit multi-view structure.
| Method | Year | Core idea | Multi-view? |
|---|---|---|---|
| Co-Training | 1998 | Two classifiers on disjoint views label data for each other | Yes |
| Yarowsky algorithm | 1995 | Iterative self-training with seed rules | No |
| Self-training | classical | Single classifier labels its own training data | No |
| Tri-Training | 2005 | Three classifiers, majority pseudo-labels | Sometimes |
| Pi-Model | 2017 | Two stochastic forward passes should agree | Implicit (augmentation) |
| Mean Teacher | 2017 | EMA teacher network supervises the student | Implicit |
| VAT | 2017 | Adversarial perturbation should not change prediction | No |
| MixMatch | 2019 | Sharpened pseudo-labels plus MixUp | No |
| FixMatch | 2020 | Weak augmentation produces a label that supervises a strong augmentation | Implicit (augmentation) |
| FlexMatch | 2021 | FixMatch with curriculum-style class-aware thresholds | Implicit |
| Noisy Student | 2020 | Iterative self-training with progressively larger student models | No |
While co-training is no longer the highest-performing SSL method on benchmarks like CIFAR-10 and STL-10, its conceptual contribution remains visible in current research. The idea that two related but different views of the same data can supervise each other is at the heart of:
In this sense, co-training did not disappear; it dispersed into the broader ecosystem of representation learning.
Co-training is most beneficial in the following situations.
Conversely, co-training may fail when the independence assumption is severely violated, when the initial classifiers are too weak, or when the unlabeled data distribution differs significantly from the labeled data distribution.
The following table summarizes the most common ways co-training breaks down in practice and how to recognize each failure.
| Failure mode | Symptom | Mitigation |
|---|---|---|
| Correlated views | One view dominates and the other agrees blindly | Engineer better view splits or switch to tri-training |
| Confidence miscalibration | Confident wrong predictions enter the training set | Use top-k ranking instead of probability thresholds |
| Class drift | Class proportions in pseudo-labels diverge from the true prior | Cap positive and negative additions per round to match prior |
| Distribution shift | Unlabeled data comes from a different domain | Apply domain adaptation or filter the unlabeled pool |
| Weak base learners | Initial classifiers do barely better than random | Bootstrap with seed rules or pretrain on related data |
| Convergence stagnation | Pseudo-labels stop changing after a few rounds | Increase pool size or perturb classifiers between rounds |
Beyond the original PAC-learning analysis by Blum and Mitchell, several theoretical contributions have deepened the understanding of co-training.
Dasgupta, Littman, and McAllester (2001) provided PAC generalization bounds for co-training, proving that the agreement rate of the two classifiers on unlabeled data can serve as an upper bound on each classifier's error rate. This result formalized the intuition that maximizing agreement between diverse classifiers on unlabeled data is a principled learning strategy.
Balcan, Blum, and Yang (2004) relaxed the conditional independence assumption, showing that co-training can succeed under weaker conditions. Specifically, they proved that an "expansion" property of the data distribution, combined with sufficient views, is enough for co-training to work. The expansion condition asks that for any reasonable subset of the input space, the conditional probability of an example being similar in one view but different in the other is bounded away from zero. This explains why co-training often succeeds in practice even when strict conditional independence does not hold.
Wang and Zhou (2007, 2010) analyzed co-training from the perspective of large-margin classifiers and connected it to the maximum entropy principle. They also derived bounds that depend on diversity between the two classifiers, formalizing the intuition that more diverse classifiers extract more information from unlabeled data.
More recently, theoretical work has connected co-training to information-theoretic objectives. Federici et al. (2020) showed that minimizing a multi-view information bottleneck recovers a co-training-like objective in which the two views discard information not relevant to the shared content. This perspective links co-training to modern self-supervised representation learning.
The following checklist summarizes the practical decisions that arise when applying co-training to a new task.
| Decision | Guidance |
|---|---|
| Choose two views | Look for natural splits: structure vs. content, lexical vs. context, audio vs. video. Avoid splitting features that are highly correlated. |
| Pick the base learner | Naive Bayes and logistic regression are common because their probability outputs help with ranking pseudo-labels. |
| Set the labeled budget | Co-training pays the most when labels are very scarce, often fewer than 100 labeled examples per class. |
| Determine pool size | Working pools of 50 to 200 examples work well. Larger pools slow convergence; smaller pools risk noise. |
| Cap pseudo-labels per round | Adding too many pseudo-labels per round amplifies errors. One to five per class per round is typical. |
| Match the class prior | Add positives and negatives in proportion to the true class ratio to avoid prior drift. |
| Use top-k selection | Top-k ranked pseudo-labels avoid issues with miscalibrated confidence values. |
| Stop early | Hold out a small validation set and stop when performance plateaus. |
| Combine views at test time | Average or multiply per-class probabilities; fall back to a single view if necessary. |
| Audit the unlabeled pool | Inspect samples for label noise, duplication, and distribution shift. |
Several open-source projects offer ready-made implementations of co-training and related multi-view methods.
| Library | Language | Notes |
|---|---|---|
| scikit-learn | Python | Provides self-training and label propagation; co-training requires custom code on top of base classifiers. |
| Sslearn | Python | Open-source SSL library that includes co-training, tri-training, and democratic co-learning. |
| MEKA | Java | Multi-label and multi-view extensions to the Weka framework. |
| RapidMiner | GUI | Provides drag-and-drop co-training operators for users without coding experience. |
For most practical projects, a custom implementation of the original Blum and Mitchell loop using scikit-learn classifiers is straightforward, and the pseudocode above translates almost directly into Python.
Several datasets have served as standard benchmarks for co-training research.
| Dataset | Domain | Notes |
|---|---|---|
| WebKB | Web pages | Used in the original Blum and Mitchell paper; pages from four universities labeled by category. |
| CiteSeer | Citation network | Document classification with text view and citation view. |
| Cora | Citation network | Similar to CiteSeer with seven research subject categories. |
| 20 Newsgroups | Text | Posts from 20 Usenet newsgroups, often split into different feature subsets. |
| MUC-6 / MUC-7 | Named entities | Text view and contextual view used in NER co-training. |
| CIFAR-10 (SSL split) | Images | 4,000 labels out of 50,000 training images is a common SSL setup. |
| STL-10 | Images | 5,000 labels and 100,000 unlabeled images, designed for SSL evaluation. |
| ImageNet (1% labeled) | Images | Used by Deep Co-Training and later SSL methods. |
| FashionMNIST (SSL) | Images | Smaller-scale benchmark popular for educational implementations. |
Reporting on these datasets allows fair comparison across SSL methods, although co-training-specific results are most often reported on text and citation graphs, where two natural views are easier to construct.
Co-training sits inside a broader network of ideas about exploiting unlabeled data and multiple views.
| Related concept | Relation to co-training |
|---|---|
| Multi-view learning | Co-training is the canonical example of a multi-view algorithm; multi-view learning generalizes to many objectives beyond pseudo-labeling. |
| Contrastive learning | Two augmented views of the same example serve as anchor and positive; the two-view structure is inherited from co-training. |
| CLIP | Image and caption are two views of the same content; the contrastive loss aligns their representations. |
| Mean Teacher | The teacher and student form a soft two-view system in which the teacher's predictions act as pseudo-labels for the student. |
| FixMatch | Weakly and strongly augmented copies of an image play the role of two views, and the weak prediction supervises the strong one. |
| Knowledge distillation | Distillation transfers a teacher's predictions to a student in a way that resembles one round of co-training. |
| Active learning | Co-training selects high-confidence pseudo-labels; active learning selects high-uncertainty queries for human labeling. The two are sometimes combined. |
| Domain adaptation | Co-training has been adapted to bridge labeled source data and unlabeled target data, treating each domain as a view. |
| Graph-based SSL | Label propagation and co-training are complementary: one uses graph smoothness, the other uses view agreement. |
| Aspect | Strengths | Limitations |
|---|---|---|
| Theoretical foundation | PAC-style guarantees under stated assumptions | Assumptions rarely fully hold in practice |
| Sample efficiency | Strong gains in extreme low-label regimes | Gains shrink as labeled data grows |
| Flexibility of base learner | Works with naive Bayes, logistic regression, decision trees, and many others | Probabilistic output helpful for ranking pseudo-labels |
| Compatibility with views | Naturally suited to multi-modal and structured data | Requires meaningful view splits, which are not always available |
| Robustness | Cross-labeling reduces error reinforcement | Highly correlated views reintroduce confirmation bias |
| Computational cost | Modest; two classifiers and an outer loop | Can be slow when retraining heavy models per round |
| Modern relevance | Influences contrastive and consistency methods | Rarely state of the art on image SSL benchmarks |
No. Ensembles such as random forests train many models on the same labeled data and combine their predictions at test time. Co-training trains two models that explicitly teach each other on unlabeled data. The goal is a better model in a low-label setting, not just a better aggregate prediction.
Yes. Co-Regression methods extend co-training to continuous targets by adding pseudo-targets when the two regressors predict similar values. Zhou and Li (2007) describe one such variant called COREG, which uses k-nearest-neighbor regressors as the base learners.
Several options exist. Tri-training uses three classifiers initialized on bootstrap samples of the labeled data, removing the need for a feature split. Democratic Co-Learning uses several different algorithms on the same features. Random feature splits have also been shown to work surprisingly well in some empirical studies.
The key reason is error correction. A single self-training classifier cannot detect its own systematic mistakes: the same biases that cause an error during prediction also cause that error to appear confidently in the pseudo-labels. Two classifiers on different views are likely to make different mistakes, so each one's confident predictions correct the other's blind spots.
Not exactly. Methods like FixMatch and Mean Teacher dominate image SSL benchmarks today, but the underlying principle of using two related views to supervise each other is still active. CLIP, SimCLR, and consistency regularization can all be viewed as descendants of co-training. The classical algorithm itself remains useful for tabular and text data with naturally distinct feature views.
Imagine you have a bag of differently shaped and colored toys, and you want to teach your two friends, Alice and Bob, how to sort them into two groups. However, you only have a few toys with stickers telling which group they belong to.
Alice will learn to sort toys based on their shape, while Bob will learn to sort them based on their color. First, you show Alice and Bob the labeled toys and explain how to sort them. Then, you give them some unlabeled toys to practice on. They start sorting the toys, and if they are very sure about a toy's group, they put a sticker on it. Then, they swap the newly stickered toys with each other to learn from each other's confident decisions.
Alice might notice that a round toy belongs in Group A based on its shape, and she tells Bob. Bob did not know that because he was only looking at colors. Now Bob learns something new. In the next round, Bob might confidently sort a blue toy into Group B based on its color, and he tells Alice. This process continues until Alice and Bob agree on how to sort most of the toys.
This is how co-training works in machine learning. Two classifiers learn from a small amount of labeled data and then teach each other using unlabeled data that they are confident about. Because they look at different features, they catch things the other one misses, helping both of them get better over time.