Confident Learning (CL)

Introduction

Confident Learning (CL) is a subfield of supervised learning and weak-supervision aimed at characterizing label noise, finding label errors, learning with noisy labels and finding ontological issues. CL is based on the principles of pruning noisy data, counting to estimate noise and ranking examples to train with confidence. CL generalizes Angluin and Laird's classification noise process to directly estimate the joint distribution between given and unknown labels. CL requires two inputs: out-of-sample predicted probabilities; and noisy labels. The three steps for weak supervision using CL are: estimating the joint distribution; pruning noisy examples; and re-weighting examples by estimated prior.

The framework was introduced by Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang in the paper Confident Learning: Estimating Uncertainty in Dataset Labels, first posted to arXiv in October 2019 and published in the Journal of Artificial Intelligence Research (JAIR) in April 2021. CL has become one of the most widely used techniques in data-centric AI, the school of practice that treats dataset quality, rather than model architecture, as the main lever for improving machine learning systems.

Background and motivation

Most machine learning research assumes that training and test labels are correct. In reality, large benchmark datasets contain a non-trivial fraction of mislabeled examples. Even MNIST, often treated as the gold standard for clean data, has confirmed label errors in its test set. ImageNet, CIFAR and Amazon Reviews all carry meaningful noise rates. When the noise is class-conditional, models can memorize the wrong answers, benchmarks can mis-rank architectures, and downstream products can inherit silent failure modes.

Early theoretical work by Dana Angluin and Philip Laird in 1988 introduced the Classification Noise Process, a model in which each true label may be flipped to a wrong label with some probability that depends only on the true class. Their result showed that, under this process, choosing the most consistent rule from a sample is enough to learn provided the noise rate stays below one half. Subsequent work by Natarajan, Dhillon, Ravikumar, and Tewari (NeurIPS 2013) extended the analysis to surrogate losses, and a long line of methods, including bootstrapping, MentorNet, Co-teaching, and Mixup, proposed model-centric tricks that try to be robust to the noise without explicitly identifying which labels are wrong.

Confident learning takes a different posture. Instead of designing a noise-tolerant loss or curriculum, it tries to recover the joint distribution between observed (possibly wrong) labels and latent true labels, then uses that distribution to surgically remove or down-weight the bad examples before training a final model on the cleaned data.

Formal framework

Let X be the input space and m the number of classes. For each example x, let y_true denote the unknown latent true class and y_obs the observed noisy label. CL assumes a class-conditional noise (CNC) model:

p(y_obs = i | y_true = j, x) = p(y_obs = i | y_true = j)

That is, the probability that an example whose true class is j is mislabeled as i depends on j and i but not on the features x. This assumption is appropriate when classes that are easy to confuse get confused systematically, for example missile and projectile on ImageNet, and it is the same assumption used in the Angluin and Laird formulation that CL generalizes.

CL aims to estimate the joint distribution matrix Q[y_obs, y_true], an m by m matrix whose entry (i, j) gives the joint probability that an example was labeled i but truly belongs to class j. With Q in hand, label errors and class-pair confusions follow directly.

Inputs

CL needs only two things from the user, and is otherwise model-agnostic and modality-agnostic:

Input	Description
Out-of-sample predicted probabilities	An n by m matrix of class probabilities for each of the n examples, produced by a probabilistic classifier trained on data the example was not part of, typically via cross-validation.
Noisy labels	The given (possibly wrong) integer labels assigned to each example.

Using out-of-sample probabilities is essential. If a classifier scored its own training data, it would be over-confident on memorized noisy labels and would never flag them as errors. Cross-validated probabilities give an honest second opinion.

The confident joint

The central object in CL is the confident joint, written C[y_obs, y_true]. For each class j, CL first computes a per-class threshold t_j equal to the average self-confidence the model assigns to examples that were labeled j:

t_j = mean of p_hat(y = j | x) over all x with given label j

For each example x with given label i, CL then increments C[i, j] for every class j such that the predicted probability p_hat(y = j | x) is greater than or equal to t_j. When more than one class clears its threshold, CL assigns the example to the class with the largest predicted probability. The result is an m by m matrix that counts, for each observed class i and each candidate true class j, the number of examples we are confident are really j even though they were labeled i.

From the confident joint to the joint distribution

CL then calibrates C so its row sums match the empirical class counts and divides by the total number of examples to produce the estimated joint distribution Q[y_obs, y_true]. From Q one obtains:

The marginal of the latent true classes p(y_true).
The class-conditional noise rates p(y_obs | y_true).
The inverse noise rates p(y_true | y_obs).
A ranking of suspected label errors.

Under reasonable conditions on the model's per-class accuracy, Q is a consistent estimator of the true joint distribution even when the predicted probabilities themselves are imperfect. The key guarantee is robustness to non-uniform calibration error: as long as the per-class threshold ordering is preserved, CL recovers the right joint, which is the property that distinguishes it from naive thresholding on raw probabilities.

The three-step process

CL is usually presented as a small pipeline. Cleanlab and most reproductions implement the same three steps.

Step	What happens	Output
1. Estimate the joint distribution	Compute the confident joint C from out-of-sample predicted probabilities and given labels, calibrate, and normalize to get Q[y_obs, y_true].	An m by m joint distribution matrix and per-class noise rates.
2. Find and prune label errors	Use Q to identify examples most likely to be mislabeled. Several pruning rules exist, including pruning by class (PBC), pruning by noise rate (PBNR), and the combined C+NR rule.	A cleaned dataset with suspect labels removed or set aside for relabeling.
3. Retrain with reweighting	Retrain the model on the cleaned data, optionally reweighting each remaining example by the inverse of its estimated latent prior so that rare classes are not crowded out.	A final model trained on a higher quality subset.

Unlike methods that try to learn noise rates jointly with model parameters, CL decouples noise estimation from training. That decoupling is what gives it its hyperparameter-free character: there are no robust loss tuning constants, no scheduled curricula, and no warm-up phases.

Cleanlab and the open-source ecosystem

CL is implemented in the cleanlab Python library, originally released by Northcutt in November 2019 alongside the arXiv preprint and now maintained by Cleanlab Inc., a company spun out of MIT in 2021. The library is licensed under Apache 2.0 and supports Python 3.10 and later on Linux, macOS, and Windows.

Cleanlab has grown well beyond the original confident learning algorithm. Notable components include:

Component	Purpose
`cleanlab.classification.CleanLearning`	Wraps any scikit-learn compatible classifier and applies the three-step CL pipeline end to end.
`cleanlab.dataset`	Provides the joint distribution, noise rates, and dataset-level health scores.
`cleanlab.outlier`	Detects out-of-distribution and atypical examples using model embeddings or scores.
`cleanlab.multiannotator`	Infers a consensus label and per-annotator quality scores from datasets with multiple human raters.
ActiveLab	Active learning policy that decides whether to collect a fresh label for a new example or re-label an existing example with a possibly wrong consensus.
Datalab	Higher level tool that audits text, image, audio, and tabular datasets for many issue types at once, including label errors, near-duplicates, outliers, drift, and class imbalance.

Cleanlab supports classification, regression, token classification, image segmentation, object detection, and multi-label tasks. It is commonly used as a preprocessing step in production data pipelines and as a teaching tool in courses on data-centric AI, including the MIT IAP course taught by Northcutt and colleagues.

Pervasive label errors in benchmark datasets

The most visible application of CL is the 2021 NeurIPS paper Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks by Northcutt, Anish Athalye (MIT), and Jonas Mueller (then at Amazon, later Cleanlab). The authors used CL to flag candidate label errors in the test sets of ten of the most widely used computer vision, natural language, and audio benchmarks, then asked Mechanical Turk workers to validate each candidate. About 51 percent of the algorithmic candidates were confirmed by humans as actual errors.

Dataset	Modality	Estimated test-set error rate
MNIST	Image (handwritten digits)	About 0.15 percent (15 confirmed errors out of 10,000).
CIFAR-10	Image	About 0.5 percent.
CIFAR-100	Image	About 5.8 percent.
ImageNet (ILSVRC validation)	Image	About 5.8 percent.
Caltech-256	Image	About 1.5 percent.
QuickDraw	Image (sketches)	About 10.1 percent.
20news	Text	About 1.1 percent.
IMDB	Text (sentiment)	About 2.9 percent.
Amazon Reviews	Text (sentiment)	About 4.0 percent.
AudioSet	Audio	About 1.4 percent.

Across the ten datasets, the average test-set error rate was about 3.4 percent. A companion site at labelerrors.com lets visitors browse every confirmed error.

The paper's punchline was that benchmark rankings can flip when test labels are corrected. On ImageNet, ResNet-18 outperforms ResNet-50 once the prevalence of mislabeled examples in the test set rises by just 6 percent. On CIFAR-10, a 5 percent shift is enough to make VGG-11 beat VGG-19. The result is widely cited as evidence that higher capacity does not always mean better generalization and that benchmark progress can be partially an artifact of model architectures fitting the noise.

Learning with noisy labels is a crowded field. CL sits in the data-centric corner, while many earlier methods are model-centric.

Approach	Idea	Relationship to CL
Bootstrapping (Reed et al., 2014)	Mix the model's predictions with the given labels to soften wrong labels during training.	Model-centric; does not explicitly identify which labels are wrong.
Forward and backward loss correction (Patrini et al., 2017)	Multiply the loss by an estimated noise transition matrix.	Requires a similar transition matrix to Q, but estimates it differently and uses it inside the loss instead of for pruning.
MentorNet (Jiang et al., 2018)	A teacher network learns a curriculum that selects easy examples for the student.	Model-centric and requires training a second network.
Co-teaching (Han et al., 2018)	Two networks teach each other by exchanging the small-loss examples each iteration.	No explicit noise model; sensitive to symmetric noise assumptions.
Label smoothing	Replace one-hot labels with a slightly softer distribution.	Reduces overconfidence but does not target identified errors.
Mixup (Zhang et al., 2018)	Train on convex combinations of examples and labels.	Improves regularization; does not identify mislabeled examples.
Confident learning (Northcutt et al., 2021)	Estimate the joint distribution of noisy and true labels, then prune.	Hyperparameter-free, model-agnostic, and produces interpretable error lists.

The original CL paper benchmarked the method on CIFAR-10 with synthetically injected class-conditional noise and reported that CL outperformed seven contemporary methods, including INCV, Mixup, MentorNet, Co-teaching, S-Model, Reed, and SCE-loss, particularly under sparse and asymmetric noise where the noise concentrates between specific class pairs.

Theoretical assumptions and guarantees

The CL paper proves several consistency results. The headline guarantee is that, under the class-conditional noise assumption and a mild per-class accuracy condition, the confident joint C is an exact estimator of the true joint distribution in expectation, and a consistent estimator as the dataset grows. The per-class accuracy condition essentially asks that the model is, on average, more right than wrong on each class, which is a much weaker requirement than asking for well-calibrated probabilities.

A second result, sometimes called the thresholding robustness property, shows that CL recovers the correct joint even when the predicted probabilities are biased, as long as the bias does not change the relative ordering of the per-class thresholds. In practice this means CL can use modestly miscalibrated neural network softmax outputs without large degradation. The theory does not, however, cover instance-dependent noise, where the probability of a flip depends on the example's features.

Use cases

CL and the cleanlab implementation are used in a number of practical settings:

Dataset cleaning before training. Apply CL to a labeled dataset to surface and remove mislabeled rows before fitting a production model.
Audit of crowd-sourced labels. Combine CL with cleanlab's multiannotator tooling to score human labelers and flag systematically unreliable ones.
Active learning. Use ActiveLab to choose which unlabeled examples to send to annotators next, balancing exploration of new samples with re-labeling of low-quality existing labels.
Benchmark integrity. Re-evaluate model rankings on cleaned versions of standard test sets to check whether reported gains are robust to label noise.
Regulated industries. Healthcare, finance, and legal teams use CL to surface mislabeled records that would otherwise propagate into model decisions and audit trails.
Foundation model fine-tuning. When fine-tuning a large language model or vision encoder on domain data, CL can detect noisy supervision before expensive training runs.

Limitations

CL is not a universal fix. The most important caveats are:

It assumes class-conditional noise. When noise depends on the features (for example, a hard-to-photograph subset of objects always being mislabeled), CL's joint distribution estimator can be biased.
It needs a model that is reasonable on the task. If the classifier used to generate out-of-sample probabilities is barely better than chance, the per-class thresholds carry little information and CL flags the wrong things.
Pruning shrinks the dataset. On small or imbalanced datasets, removing suspected errors can hurt minority classes more than it helps.
The framework targets classification-style problems. Extensions to regression, segmentation, and structured prediction exist in cleanlab but are less mature than the multiclass classification case.
Validated label errors found by CL are still subject to human disagreement. The 2021 pervasive errors paper relied on Mechanical Turk consensus to confirm errors, which itself can be noisy on subjective tasks.

Users are usually advised to treat CL output as a prioritized worklist for review rather than as an automatic deletion oracle, especially in domains where a wrong label can be expensive to recover.

Influence and adoption

The CL paper has accumulated thousands of citations since publication and is taught in graduate courses at MIT, Stanford, and elsewhere as the canonical introduction to data-centric label cleaning. Cleanlab the company commercializes the technique through a hosted platform called Cleanlab Studio, which adds a UI, scalable inference, and connectors to common data warehouses.

The framework has also influenced adjacent areas. The labelerrors.com release prompted maintainers of several benchmarks to publish corrected splits, and several papers since 2021 report results on both the original and cleaned versions of CIFAR and ImageNet to demonstrate that improvements are not noise artifacts. CL primitives have been re-implemented in MATLAB's Statistics and Machine Learning Toolbox and ported to Julia and R by community contributors.

References

Northcutt, C. G., Jiang, L., and Chuang, I. L. (2021). "Confident Learning: Estimating Uncertainty in Dataset Labels." *Journal of Artificial Intelligence Research*, 70, pp. 1373 to 1411. https://jair.org/index.php/jair/article/view/12125
Northcutt, C. G., Athalye, A., and Mueller, J. (2021). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." *NeurIPS 2021 Track on Datasets and Benchmarks*. arXiv:2103.14749. https://arxiv.org/abs/2103.14749
Angluin, D., and Laird, P. (1988). "Learning From Noisy Examples." *Machine Learning*, 2(4), pp. 343 to 370. https://link.springer.com/article/10.1023/A:1022873112823
Natarajan, N., Dhillon, I. S., Ravikumar, P., and Tewari, A. (2013). "Learning with Noisy Labels." *Advances in Neural Information Processing Systems*, 26. https://proceedings.neurips.cc/paper/5073-learning-with-noisy-labels.pdf
Patrini, G., Rozza, A., Menon, A. K., Nock, R., and Qu, L. (2017). "Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach." *CVPR 2017*.
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). "Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels." *NeurIPS 2018*.
Jiang, L., Zhou, Z., Leung, T., Li, L. J., and Fei-Fei, L. (2018). "MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels." *ICML 2018*.
Northcutt, C. G. (2019). "An Introduction to Confident Learning: Finding and Learning with Label Errors in Datasets." Personal blog. https://l7.curtisnorthcutt.com/confident-learning
Northcutt, C. G. (2019). "Announcing cleanlab: a Python Package for ML and Deep Learning on Datasets with Label Errors." Personal blog. https://l7.curtisnorthcutt.com/cleanlab-python-package
Cleanlab Inc. "cleanlab open-source library." GitHub repository. https://github.com/cleanlab/cleanlab
Cleanlab Inc. "label-errors: Corrected Test Sets for ImageNet, MNIST, CIFAR, Caltech-256, QuickDraw, IMDB, Amazon Reviews, 20News, and AudioSet." GitHub repository. https://github.com/cleanlab/label-errors
MIT CSAIL. "Label Errors and Confident Learning." *Introduction to Data-Centric AI*, course notes. https://dcai.csail.mit.edu/2023/label-errors/

Confident Learning (CL)

Introduction

Background and motivation

Formal framework

Inputs

The confident joint

From the confident joint to the joint distribution

The three-step process

Cleanlab and the open-source ecosystem

Pervasive label errors in benchmark datasets

Theoretical assumptions and guarantees

Use cases

Limitations

Influence and adoption

See also

References

Improve this article

Introduction

Background and motivation

Formal framework

Inputs

The confident joint

From the confident joint to the joint distribution

The three-step process

Cleanlab and the open-source ecosystem

Pervasive label errors in benchmark datasets

Theoretical assumptions and guarantees

Use cases

Limitations

Influence and adoption

See also

References

Introduction

Background and motivation

Formal framework

Inputs

The confident joint

From the confident joint to the joint distribution

The three-step process

Cleanlab and the open-source ecosystem

Pervasive label errors in benchmark datasets

Comparison with related approaches

Theoretical assumptions and guarantees

Use cases

Limitations

Influence and adoption

See also

References

Improve this article

Related Articles

ARC-AGI 2

Decision Forest

Example

Ground Truth

Target

Support Vector Machine (SVM)

Introduction

Background and motivation

Formal framework

Inputs

The confident joint

From the confident joint to the joint distribution

The three-step process

Cleanlab and the open-source ecosystem

Pervasive label errors in benchmark datasets

Comparison with related approaches

Theoretical assumptions and guarantees

Use cases

Limitations

Influence and adoption

See also

References

Related Articles

ARC-AGI 2

Decision Forest

Example

Ground Truth

Target

Support Vector Machine (SVM)