Co-training is a semi-supervised learning algorithm that leverages both labeled and unlabeled data by training two classifiers on two distinct "views" of the data, allowing them to teach each other iteratively. Introduced by Avrim Blum and Tom Mitchell in 1998, co-training is one of the foundational techniques in multi-view learning and remains widely used in settings where labeled data is scarce but unlabeled data is plentiful.
In many real-world machine learning problems, obtaining labeled data is expensive, time-consuming, or requires domain expertise. At the same time, unlabeled data is often available in large quantities. Co-training addresses this imbalance by exploiting the structure of the feature space itself.
The core idea behind co-training is that the features describing each data instance can be naturally divided into two separate subsets, called "views." Each view should contain enough information on its own to make accurate predictions. Two separate classifiers are trained, one on each view. After initial training on a small labeled dataset, each classifier labels some of the unlabeled data. The most confident predictions from one classifier are then added to the training set of the other classifier. This process repeats for several rounds, with both classifiers gradually improving as they receive more training data from each other.
Co-training belongs to the broader family of "wrapper" or "proxy-label" methods in semi-supervised learning, where pseudo-labels generated by one model are used to expand the training set for another.
Co-training was first proposed by Avrim Blum and Tom Mitchell in their 1998 paper "Combining Labeled and Unlabeled Data with Co-Training," presented at the 11th Annual Conference on Computational Learning Theory (COLT). The paper received the 10-Year Best Paper Award at the International Conference on Machine Learning (ICML) in 2008, recognizing its lasting impact on the field. As of the mid-2020s, the original paper has been cited over 5,000 times.
Blum and Mitchell's motivating application was classifying web pages as "academic course home pages" or not. They observed that web pages have two natural views: (1) the text content of the page itself, and (2) the anchor text of hyperlinks on other pages that point to it. Using just 12 labeled web pages as initial training examples, their co-training algorithm correctly classified 95% of 788 web pages. This result demonstrated the power of leveraging unlabeled data with multiple views.
The standard co-training algorithm proceeds through the following steps:
| Step | Action |
|---|---|
| 1. Feature split | Divide the input features into two disjoint sets, View 1 (V1) and View 2 (V2) |
| 2. Initialization | Train two classifiers, C1 on V1 and C2 on V2, using the small set of labeled examples |
| 3. Labeling | Apply C1 and C2 to a pool of unlabeled examples; each classifier produces predictions with confidence scores |
| 4. Selection | From C1's predictions, select the most confident positive and negative examples; do the same for C2 |
| 5. Augmentation | Add C1's confident predictions to C2's training set, and C2's confident predictions to C1's training set |
| 6. Retraining | Retrain both classifiers on their expanded training sets |
| 7. Iteration | Repeat steps 3 through 6 for a fixed number of rounds or until no more confident predictions can be made |
A critical aspect of the algorithm is that each classifier labels data for the other classifier, not for itself. This cross-labeling mechanism helps prevent the reinforcement of errors that can occur in self-training, where a single classifier labels data for its own retraining.
The theoretical guarantees of co-training rely on two main assumptions about the feature views:
| Assumption | Description |
|---|---|
| Sufficiency | Each view alone contains enough information to train an accurate classification model. In other words, the class label can be reliably predicted from either view independently. |
| Conditional independence | Given the class label, the two views are conditionally independent of each other. This means that knowing one view provides no additional information about the other view once the class is known. |
A third assumption, sometimes called compatibility, requires that the target functions learned from both views assign the same label to any given instance with high probability.
Blum and Mitchell provided a PAC-style (Probably Approximately Correct) theoretical analysis showing that if these assumptions hold and each view has a weakly useful classifier, co-training can learn arbitrarily accurate classifiers from unlabeled data. In practice, however, these assumptions are rarely satisfied perfectly. Research by Krogel and Scheffer (2004) demonstrated that when classifier dependence exceeded approximately 60%, co-training performance actually worsened compared to using labeled data alone. Despite this, co-training often works well even when the assumptions are only approximately satisfied, as long as the views provide somewhat complementary information.
Co-training is closely related to self-training, but there are important differences between the two approaches.
| Aspect | Self-Training | Co-Training |
|---|---|---|
| Number of classifiers | One | Two (or more) |
| Feature views | Single view (all features) | Multiple views (disjoint feature sets) |
| Labeling mechanism | The classifier labels data for itself | Each classifier labels data for the other |
| Error correction | Cannot correct its own mistakes; errors may compound | Cross-labeling provides a form of error correction |
| Requirements | No special assumptions about features | Requires naturally separable, sufficient, and conditionally independent views |
| Applicability | Any supervised learning task | Tasks with naturally separable feature representations |
Self-training uses a single model that iteratively adds its own most confident predictions on unlabeled data to the training set. The main risk is that the model cannot correct its own mistakes: if it confidently assigns a wrong label, that error becomes part of the training data and may compound over subsequent iterations.
Co-training mitigates this risk through diversification. Because the two classifiers operate on different feature sets, they tend to make different errors. When one classifier is wrong about an example, the other may correctly label it, providing a natural mechanism for error correction. This complementary relationship is the primary advantage of co-training over self-training.
Since Blum and Mitchell's original paper, numerous variants have been developed to address the limitations of standard co-training or to extend it to new settings.
Co-EM, introduced by Nigam and Ghani (2000), combines the multi-view framework of co-training with the Expectation-Maximization (EM) algorithm. Unlike standard co-training, which adds only the most confident predictions in a bootstrap fashion, co-EM operates on all unlabeled samples simultaneously in an iterative batch mode. Each classifier provides probabilistic labels for the unlabeled data, and these soft labels are used to retrain the other classifier. Co-EM has been shown to outperform standard co-training on several text classification benchmarks. However, it requires classifiers that can handle probabilistically labeled data and output class probabilities, and it can suffer from local maxima problems.
Tri-training, proposed by Zhou and Li (2005), uses three classifiers instead of two and eliminates the requirement for multiple views of the data. The three classifiers are initialized by training on different bootstrap samples of the labeled data, which introduces diversity among them. In each round, an unlabeled example is labeled for a given classifier if and only if the other two classifiers agree on its label. This majority-vote mechanism provides a natural safeguard against noisy pseudo-labels.
Tri-training has several advantages over standard co-training: it does not require the feature space to be split into redundant and independent views, it imposes no constraints on the type of supervised learning algorithm used, and it has broader applicability. Experiments on UCI datasets and web page classification tasks have shown that tri-training effectively exploits unlabeled data. Despite being proposed nearly two decades ago, empirical research has confirmed that tri-training remains a strong baseline for semi-supervised learning in natural language processing.
A notable variant is tri-training with disagreement, which modifies the labeling condition so that the third classifier must disagree with the pseudo-label. This targets the algorithm's learning toward examples where it is weakest, preventing the labeled data from becoming skewed.
Democratic co-learning, proposed by Zhou and Goldman (2004), replaces the requirement for multiple feature views with multiple learning algorithms. Instead of splitting features, it trains several classifiers using different algorithms (for example, a decision tree, a naive Bayes classifier, and a support vector machine) on the same complete feature set. These classifiers have different inductive biases, which provides the diversity needed for effective mutual teaching.
For each unlabeled example, a prediction is made by majority vote of the classifiers. If the majority confidently agrees, the example is added to the training sets of the dissenting classifiers. The final prediction for new instances uses weighted majority voting, where each classifier's vote is weighted by its estimated accuracy. This approach is particularly useful in single-view settings where natural feature splits do not exist.
Several researchers have developed methods to apply co-training principles when only a single view of the data is available. Goldman and Zhou (2000) proposed using two different learning algorithms on the same feature set, relying on algorithmic diversity rather than feature diversity to generate complementary classifiers. Other approaches include:
Multi-Head Co-Training is a more recent development that consolidates multiple classifiers into a single neural network with multiple output heads. This approach adds minimal extra parameters compared to maintaining entirely separate models, making it more computationally efficient. Each head provides pseudo-labels for the others, maintaining the co-training principle while sharing lower-level feature representations. Research has demonstrated accuracy improvements of up to 3.1% on semi-supervised benchmarks like CIFAR compared to other recent methods.
Co-training and its variants have been applied successfully across many domains.
| Domain | Views / Setup | Description |
|---|---|---|
| Web page classification | Page text + anchor text from incoming hyperlinks | The original application by Blum and Mitchell (1998); achieved 95% accuracy with only 12 labeled examples |
| Natural language processing | Different feature representations of text (e.g., lexical features + syntactic features) | Used for named entity recognition, sentiment analysis, and statistical parsing |
| Bioinformatics | Sequence features + structural features of proteins | Applied to protein function prediction, gene expression analysis, and biomedical text mining |
| Computer vision | Visual features from different modalities (e.g., RGB + depth, or image + text caption) | Used in image classification, object detection, and medical image segmentation |
| Speech recognition | Audio features + visual features (lip movements) | Multi-modal speech processing leverages audio and visual streams as natural views |
| Information retrieval | Document content + query-click data | Web search ranking where page content and user interaction signals form separate views |
Co-training was also used in commercial applications, including FlipDog.com (a job search site) and the U.S. Department of Labor's directory of continuing and distance education programs.
Co-training is most beneficial in the following situations:
Conversely, co-training may fail when the independence assumption is severely violated, when the initial classifiers are too weak, or when the unlabeled data distribution differs significantly from the labeled data distribution.
Beyond the original PAC-learning analysis by Blum and Mitchell, several theoretical contributions have deepened the understanding of co-training.
Dasgupta, Littman, and McAllester (2001) provided PAC generalization bounds for co-training, proving that the agreement rate of the two classifiers on unlabeled data can serve as an upper bound on each classifier's error rate. This result formalized the intuition that maximizing agreement between diverse classifiers on unlabeled data is a principled learning strategy.
Balcan, Blum, and Yang (2004) relaxed the conditional independence assumption, showing that co-training can succeed under weaker conditions. Specifically, they proved that an "expansion" property of the data distribution, combined with sufficient views, is enough for co-training to work. This explained why co-training often succeeds in practice even when strict conditional independence does not hold.
Wang and Zhou (2007) analyzed co-training from the perspective of large-margin classifiers and showed connections between co-training and the maximum entropy principle, providing further theoretical justification for the algorithm's effectiveness.
Imagine you have a bag of differently shaped and colored toys, and you want to teach your two friends, Alice and Bob, how to sort them into two groups. However, you only have a few toys with stickers telling which group they belong to.
Alice will learn to sort toys based on their shape, while Bob will learn to sort them based on their color. First, you show Alice and Bob the labeled toys and explain how to sort them. Then, you give them some unlabeled toys to practice on. They start sorting the toys, and if they are very sure about a toy's group, they put a sticker on it. Then, they swap the newly stickered toys with each other to learn from each other's confident decisions.
Alice might notice that a round toy belongs in Group A based on its shape, and she tells Bob. Bob did not know that because he was only looking at colors. Now Bob learns something new. In the next round, Bob might confidently sort a blue toy into Group B based on its color, and he tells Alice. This process continues until Alice and Bob agree on how to sort most of the toys.
This is how co-training works in machine learning. Two classifiers learn from a small amount of labeled data and then teach each other using unlabeled data that they are confident about. Because they look at different features, they catch things the other one misses, helping both of them get better over time.