# Confident Learning (CL)

> Source: https://aiwiki.ai/wiki/confident_learning_cl
> Updated: 2026-06-25
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Confident Learning (CL) is a data-centric machine learning framework for characterizing, finding, and learning with label errors in datasets. It works by directly estimating the joint distribution between noisy (observed) labels and latent (true) labels, using per-class confident thresholds to decide which examples are probably mislabeled, then pruning those examples and retraining. CL was introduced by Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang in the paper *Confident Learning: Estimating Uncertainty in Dataset Labels* (arXiv:1911.00068, October 2019), published in the *Journal of Artificial Intelligence Research* (JAIR) in 2021, and it is implemented in the open-source [cleanlab](/wiki/cleanlab) Python library. The authors describe CL as "an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence." [1]

## What is confident learning?

Confident Learning (CL) is a subfield of [supervised learning](/wiki/supervised_learning) and weak-supervision aimed at characterizing [label](/wiki/label) [noise](/wiki/noise), finding label errors, learning with noisy labels and finding ontological issues. CL is based on the principles of pruning noisy data, counting to estimate noise and ranking examples to train with confidence. CL generalizes Angluin and Laird's classification noise process to directly estimate the joint distribution between given and unknown labels. CL requires two inputs: out-of-sample predicted probabilities; and noisy labels. The three steps for weak supervision using CL are: estimating the joint distribution; pruning noisy examples; and re-weighting examples by estimated prior. [1]

The framework was introduced by Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang in the paper *Confident Learning: Estimating Uncertainty in Dataset Labels*, first posted to arXiv in October 2019 and published in the *Journal of Artificial Intelligence Research* (JAIR) in 2021 (volume 70, pp. 1373 to 1411). [1] The paper opens with the line that frames the whole field: "Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality." [1] CL has become one of the most widely used techniques in [data-centric AI](/wiki/data-centric_ai), the school of practice that treats dataset quality, rather than model architecture, as the main lever for improving machine learning systems.

## Why does confident learning matter?

Most [machine learning](/wiki/machine_learning) research assumes that training and test labels are correct. In reality, large benchmark datasets contain a non-trivial fraction of mislabeled examples. Even [MNIST](/wiki/mnist), often treated as the gold standard for clean data, has confirmed label errors in its test set. ImageNet, [CIFAR](/wiki/cifar) and Amazon Reviews all carry meaningful noise rates. When the noise is class-conditional, models can memorize the wrong answers, benchmarks can mis-rank architectures, and downstream products can inherit silent failure modes.

Early theoretical work by Dana Angluin and Philip Laird in 1988 introduced the *Classification Noise Process*, a model in which each true label may be flipped to a wrong label with some probability that depends only on the true class. [3] Their result showed that, under this process, choosing the most consistent rule from a sample is enough to learn provided the noise rate stays below one half. Subsequent work by Natarajan, Dhillon, Ravikumar, and Tewari (NeurIPS 2013) extended the analysis to surrogate losses, [4] and a long line of methods, including [bootstrapping](/wiki/bootstrapping), MentorNet, Co-teaching, and Mixup, proposed model-centric tricks that try to be robust to the noise without explicitly identifying which labels are wrong.

Confident learning takes a different posture. Instead of designing a noise-tolerant loss or curriculum, it tries to recover the *joint distribution* between observed (possibly wrong) labels and latent true labels, then uses that distribution to surgically remove or down-weight the bad examples before training a final model on the cleaned data. [1]

## How does confident learning work? The formal framework

Let *X* be the input space and *m* the number of classes. For each example *x*, let *y_true* denote the unknown latent true class and *y_obs* the observed noisy label. CL assumes a *class-conditional noise* (CNC) model:

```
p(y_obs = i | y_true = j, x) = p(y_obs = i | y_true = j)
```

That is, the probability that an example whose true class is *j* is mislabeled as *i* depends on *j* and *i* but not on the features *x*. This assumption is appropriate when classes that are easy to confuse get confused systematically, for example *missile* and *projectile* on ImageNet, and it is the same assumption used in the Angluin and Laird formulation that CL generalizes. [1][3]

CL aims to estimate the *joint distribution* matrix Q[y_obs, y_true], an *m* by *m* matrix whose entry (i, j) gives the joint probability that an example was labeled *i* but truly belongs to class *j*. The paper's stated goal is to "directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels." [1] With Q in hand, label errors and class-pair confusions follow directly.

### What inputs does confident learning need?

CL needs only two things from the user, and is otherwise model-agnostic and modality-agnostic:

| Input | Description |
|---|---|
| Out-of-sample predicted probabilities | An *n* by *m* matrix of class probabilities for each of the *n* examples, produced by a [probabilistic classifier](/wiki/probabilistic_classifier) trained on data the example was not part of, typically via [cross-validation](/wiki/cross-validation). |
| Noisy labels | The given (possibly wrong) integer labels assigned to each example. |

Using out-of-sample probabilities is essential. If a classifier scored its own training data, it would be over-confident on memorized noisy labels and would never flag them as errors. Cross-validated probabilities give an honest second opinion.

### What are confident thresholds and the confident joint?

The central object in CL is the *confident joint*, written C[y_obs, y_true]. For each class *j*, CL first computes a per-class threshold *t_j* equal to the average self-confidence the model assigns to examples that were labeled *j*: [1]

```
t_j = mean of p_hat(y = j | x) over all x with given label j
```

These per-class average self-confidences are the *confident thresholds*. Using a separate threshold per class is what lets CL handle datasets where some classes are inherently easier to predict than others, instead of imposing one global probability cutoff. For each example *x* with given label *i*, CL then increments C[i, j] for every class *j* such that the predicted probability *p_hat(y = j | x)* is greater than or equal to *t_j*. When more than one class clears its threshold, CL assigns the example to the class with the largest predicted probability. The result is an *m* by *m* matrix that counts, for each observed class *i* and each candidate true class *j*, the number of examples we are confident are really *j* even though they were labeled *i*.

### From the confident joint to the joint distribution

CL then calibrates C so its row sums match the empirical class counts and divides by the total number of examples to produce the estimated joint distribution Q[y_obs, y_true]. From Q one obtains:

- The marginal of the latent true classes p(y_true).
- The class-conditional noise rates p(y_obs | y_true).
- The inverse noise rates p(y_true | y_obs).
- A ranking of suspected label errors.

Under reasonable conditions on the model's per-class accuracy, Q is a consistent estimator of the true joint distribution even when the predicted probabilities themselves are imperfect. The key guarantee is robustness to non-uniform calibration error: as long as the per-class threshold ordering is preserved, CL recovers the right joint, which is the property that distinguishes it from naive thresholding on raw probabilities. [1]

## What are the three steps of confident learning?

CL is usually presented as a small pipeline. Cleanlab and most reproductions implement the same three steps.

| Step | What happens | Output |
|---|---|---|
| 1. Estimate the joint distribution | Compute the confident joint C from out-of-sample predicted probabilities and given labels, calibrate, and normalize to get Q[y_obs, y_true]. | An *m* by *m* joint distribution matrix and per-class noise rates. |
| 2. Find and prune label errors | Use Q to identify examples most likely to be mislabeled. Several pruning rules exist, including pruning by class (PBC), pruning by noise rate (PBNR), and the combined C+NR rule. | A cleaned dataset with suspect labels removed or set aside for relabeling. |
| 3. Retrain with reweighting | Retrain the model on the cleaned data, optionally reweighting each remaining example by the inverse of its estimated latent prior so that rare classes are not crowded out. | A final model trained on a higher quality subset. |

Unlike methods that try to learn noise rates jointly with model parameters, CL decouples noise estimation from training. That decoupling is what gives it its hyperparameter-free character: there are no robust loss tuning constants, no scheduled curricula, and no warm-up phases. [1]

## What is cleanlab?

CL is implemented in the [cleanlab](/wiki/cleanlab) Python library, originally released by Northcutt in 2019 alongside the arXiv preprint and now developed as the standard data-centric AI package for data quality. [9][10] The library is licensed under Apache 2.0 and supports modern Python releases on Linux, macOS, and Windows. Cleanlab Inc., the company that maintains it, was incorporated in late 2021 as a spinout of roughly a decade (2013 to 2021) of Northcutt's PhD research at MIT with Isaac Chuang, and was co-founded by Northcutt, Jonas Mueller, and Anish Athalye. [13]

Cleanlab has grown well beyond the original confident learning algorithm. Notable components include:

| Component | Purpose |
|---|---|
| `cleanlab.classification.CleanLearning` | Wraps any scikit-learn compatible classifier and applies the three-step CL pipeline end to end. |
| `cleanlab.dataset` | Provides the joint distribution, noise rates, and dataset-level health scores. |
| `cleanlab.outlier` | Detects out-of-distribution and atypical examples using model embeddings or scores. |
| `cleanlab.multiannotator` | Infers a consensus label and per-annotator quality scores from datasets with multiple human raters. |
| ActiveLab | Active learning policy that decides whether to collect a fresh label for a new example or re-label an existing example with a possibly wrong consensus. |
| Datalab | Higher level tool that audits text, image, audio, and tabular datasets for many issue types at once, including label errors, near-duplicates, outliers, drift, and class imbalance. |

Cleanlab supports classification, regression, token classification, image segmentation, object detection, and multi-label tasks. It is commonly used as a preprocessing step in production data pipelines and as a teaching tool in courses on data-centric AI, including the MIT IAP course taught by Northcutt and colleagues. [12]

## How many label errors did confident learning find in benchmark datasets?

The most visible application of CL is the 2021 NeurIPS paper *Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks* by Northcutt, Anish Athalye (MIT), and Jonas Mueller (then at Amazon, later Cleanlab). [2] The authors used CL to flag candidate label errors in the *test sets* of ten of the most widely used computer vision, natural language, and audio benchmarks, then asked Mechanical Turk workers to validate each candidate. The paper reports that "51% of the algorithmically-flagged candidates are indeed erroneously labeled, on average across the datasets." [2]

| Dataset | Modality | Estimated test-set errors | Error rate |
|---|---|---|---|
| [MNIST](/wiki/mnist) | Image (handwritten digits) | 15 (confirmed) | 0.15 percent |
| [CIFAR-10](/wiki/cifar) | Image | 54 | 0.54 percent |
| CIFAR-100 | Image | 585 | 5.85 percent |
| Caltech-256 | Image | 458 | 1.54 percent |
| [ImageNet](/wiki/imagenet) (ILSVRC validation) | Image | 2,916 | 5.83 percent |
| QuickDraw | Image (sketches) | 5,105,386 | 10.12 percent |
| 20news | Text | 82 | 1.09 percent |
| IMDB | Text (sentiment) | 725 | 2.90 percent |
| Amazon Reviews | Text (sentiment) | 390,338 | 3.90 percent |
| AudioSet | Audio | 275 | 1.35 percent |

Across the ten datasets, the average test-set error rate was about 3.4 percent. [2] The single largest absolute count was QuickDraw, where CL estimated over 5 million mislabeled sketches (about 10 percent of the set), while ImageNet's validation set carried 2,916 confirmed errors (about 6 percent). A companion site at labelerrors.com lets visitors browse every confirmed error. [11]

The paper's punchline was that benchmark rankings can flip when test labels are corrected. On ImageNet, ResNet-18 outperforms ResNet-50 once the prevalence of mislabeled examples in the test set rises by just 6 percent. [2] On CIFAR-10, a 5 percent shift is enough to make VGG-11 beat VGG-19. [2] The result is widely cited as evidence that *higher capacity does not always mean better generalization* and that benchmark progress can be partially an artifact of model architectures fitting the noise.

## How does confident learning compare with related approaches?

Learning with noisy labels is a crowded field. CL sits in the data-centric corner, while many earlier methods are model-centric.

| Approach | Idea | Relationship to CL |
|---|---|---|
| [Bootstrapping](/wiki/bootstrapping) (Reed et al., 2014) | Mix the model's predictions with the given labels to soften wrong labels during training. | Model-centric; does not explicitly identify which labels are wrong. |
| Forward and backward loss correction (Patrini et al., 2017) | Multiply the loss by an estimated noise transition matrix. | Requires a similar transition matrix to Q, but estimates it differently and uses it inside the loss instead of for pruning. [5] |
| MentorNet (Jiang et al., 2018) | A teacher network learns a curriculum that selects easy examples for the student. | Model-centric and requires training a second network. [7] |
| Co-teaching (Han et al., 2018) | Two networks teach each other by exchanging the small-loss examples each iteration. | No explicit noise model; sensitive to symmetric noise assumptions. [6] |
| [Label smoothing](/wiki/label_smoothing) | Replace one-hot labels with a slightly softer distribution. | Reduces overconfidence but does not target identified errors. |
| Mixup (Zhang et al., 2018) | Train on convex combinations of examples and labels. | Improves regularization; does not identify mislabeled examples. |
| Confident learning (Northcutt et al., 2021) | Estimate the joint distribution of noisy and true labels, then prune. | Hyperparameter-free, model-agnostic, and produces interpretable error lists. [1] |

The original CL paper benchmarked the method on CIFAR-10 with synthetically injected class-conditional noise and reported that CL outperformed seven contemporary methods, including INCV, Mixup, MentorNet, Co-teaching, S-Model, Reed, and SCE-loss, particularly under sparse and asymmetric noise where the noise concentrates between specific class pairs. [1]

## What does confident learning guarantee theoretically?

The CL paper proves several consistency results. The headline guarantee is that, under the class-conditional noise assumption and a mild *per-class accuracy* condition, the confident joint C is an exact estimator of the true joint distribution in expectation, and a consistent estimator as the dataset grows. [1] The per-class accuracy condition essentially asks that the model is, on average, more right than wrong on each class, which is a much weaker requirement than asking for well-calibrated probabilities.

A second result, sometimes called the *thresholding robustness* property, shows that CL recovers the correct joint even when the predicted probabilities are biased, as long as the bias does not change the relative ordering of the per-class thresholds. In practice this means CL can use modestly miscalibrated neural network softmax outputs without large degradation. The theory does not, however, cover instance-dependent noise, where the probability of a flip depends on the example's features. [1]

## What is confident learning used for?

CL and the cleanlab implementation are used in a number of practical settings:

- *Dataset cleaning before training.* Apply CL to a labeled dataset to surface and remove mislabeled rows before fitting a production model.
- *Audit of crowd-sourced labels.* Combine CL with cleanlab's multiannotator tooling to score human labelers and flag systematically unreliable ones.
- *[Active learning](/wiki/active_learning).* Use ActiveLab to choose which unlabeled examples to send to annotators next, balancing exploration of new samples with re-labeling of low-quality existing labels.
- *Benchmark integrity.* Re-evaluate model rankings on cleaned versions of standard test sets to check whether reported gains are robust to label noise.
- *Regulated industries.* Healthcare, finance, and legal teams use CL to surface mislabeled records that would otherwise propagate into model decisions and audit trails.
- *Foundation model fine-tuning.* When fine-tuning a [large language model](/wiki/large_language_model) or vision encoder on domain data, CL can detect noisy supervision before expensive training runs.

## What are the limitations of confident learning?

CL is not a universal fix. The most important caveats are:

- It assumes class-conditional noise. When noise depends on the features (for example, a hard-to-photograph subset of objects always being mislabeled), CL's joint distribution estimator can be biased.
- It needs a model that is reasonable on the task. If the classifier used to generate out-of-sample probabilities is barely better than chance, the per-class thresholds carry little information and CL flags the wrong things.
- Pruning shrinks the dataset. On small or imbalanced datasets, removing suspected errors can hurt minority classes more than it helps.
- The framework targets classification-style problems. Extensions to regression, segmentation, and structured prediction exist in cleanlab but are less mature than the multiclass classification case.
- Validated label errors found by CL are still subject to human disagreement. The 2021 pervasive errors paper relied on Mechanical Turk consensus to confirm errors, which itself can be noisy on subjective tasks. [2]

Users are usually advised to treat CL output as a *prioritized worklist* for review rather than as an automatic deletion oracle, especially in domains where a wrong label can be expensive to recover.

## What is confident learning's influence and adoption?

The CL paper has accumulated thousands of citations since publication and is taught in graduate courses at MIT, Stanford, and elsewhere as the canonical introduction to data-centric label cleaning. Cleanlab the company commercializes the technique through a hosted platform called Cleanlab Studio, which adds a UI, scalable inference, and connectors to common data warehouses. In October 2023, Cleanlab raised a $25 million Series A co-led by Menlo Ventures and TQ Ventures, with participation from Bain Capital Ventures and Databricks Ventures, bringing total funding to about $30 million. [14]

The framework has also influenced adjacent areas. The labelerrors.com release prompted maintainers of several benchmarks to publish corrected splits, and several papers since 2021 report results on both the original and cleaned versions of CIFAR and ImageNet to demonstrate that improvements are not noise artifacts. CL primitives have been re-implemented in MATLAB's Statistics and Machine Learning Toolbox and ported to Julia and R by community contributors.

In January 2026, Cleanlab was acquired by Handshake, a company that began in 2013 as a college recruiting platform and launched a human data-labeling business for foundation model companies. The acquisition was announced on January 28, 2026; the terms of the transaction were not disclosed. [15] Handshake was valued at $3.3 billion in 2022 and had forecast roughly $300 million in annualized revenue by the end of 2025. [15]

## See also

- [Data-centric AI](/wiki/data-centric_ai)
- [Cleanlab](/wiki/cleanlab)
- [Label noise](/wiki/label_noise)
- [Active learning](/wiki/active_learning)
- [Cross-validation](/wiki/cross-validation)
- [Probabilistic classifier](/wiki/probabilistic_classifier)
- [Weak supervision](/wiki/weak_supervision)

## References

1. Northcutt, C. G., Jiang, L., and Chuang, I. L. (2021). "Confident Learning: Estimating Uncertainty in Dataset Labels." *Journal of Artificial Intelligence Research*, 70, pp. 1373 to 1411. arXiv:1911.00068. https://jair.org/index.php/jair/article/view/12125
2. Northcutt, C. G., Athalye, A., and Mueller, J. (2021). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." *NeurIPS 2021 Track on Datasets and Benchmarks*. arXiv:2103.14749. https://arxiv.org/abs/2103.14749
3. Angluin, D., and Laird, P. (1988). "Learning From Noisy Examples." *Machine Learning*, 2(4), pp. 343 to 370. https://link.springer.com/article/10.1023/A:1022873112823
4. Natarajan, N., Dhillon, I. S., Ravikumar, P., and Tewari, A. (2013). "Learning with Noisy Labels." *Advances in Neural Information Processing Systems*, 26. https://proceedings.neurips.cc/paper/5073-learning-with-noisy-labels.pdf
5. Patrini, G., Rozza, A., Menon, A. K., Nock, R., and Qu, L. (2017). "Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach." *CVPR 2017*. https://arxiv.org/abs/1609.03683
6. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). "Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels." *NeurIPS 2018*. https://arxiv.org/abs/1804.06872
7. Jiang, L., Zhou, Z., Leung, T., Li, L. J., and Fei-Fei, L. (2018). "MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels." *ICML 2018*. https://arxiv.org/abs/1712.05055
8. Northcutt, C. G. (2019). "An Introduction to Confident Learning: Finding and Learning with Label Errors in Datasets." Personal blog. https://l7.curtisnorthcutt.com/confident-learning
9. Northcutt, C. G. (2019). "Announcing cleanlab: a Python Package for ML and Deep Learning on Datasets with Label Errors." Personal blog. https://l7.curtisnorthcutt.com/cleanlab-python-package
10. Cleanlab Inc. "cleanlab open-source library." GitHub repository. https://github.com/cleanlab/cleanlab
11. Cleanlab Inc. "label-errors: Corrected Test Sets for ImageNet, MNIST, CIFAR, Caltech-256, QuickDraw, IMDB, Amazon Reviews, 20News, and AudioSet." GitHub repository. https://github.com/cleanlab/label-errors
12. MIT CSAIL. "Label Errors and Confident Learning." *Introduction to Data-Centric AI*, course notes. https://dcai.csail.mit.edu/2023/label-errors/
13. Cleanlab. "Cleanlab: The History, Present, and Future." Company blog. https://cleanlab.ai/blog/learn/cleanlab-history/
14. Cleanlab / Business Wire (October 10, 2023). "Cleanlab Raises $25M Series A to Automatically Increase the Value and Accuracy of the World's Enterprise Data Used by AI, ML, and Analytics Solutions." https://www.businesswire.com/news/home/20231010484401/en/
15. TechCrunch (January 28, 2026). "AI data labeler Handshake buys Cleanlab, an acquisition target of multiple others." https://techcrunch.com/2026/01/28/ai-data-labeler-handshake-buys-cleanlab-an-acquisition-target-of-multiple-others/