Confident Learning (CL) is a subfield of supervised learning and weak-supervision aimed at characterizing label noise, finding label errors, learning with noisy labels and finding ontological issues. CL is based on the principles of pruning noisy data, counting to estimate noise and ranking examples to train with confidence. CL generalizes Angluin and Laird's classification noise process to directly estimate the joint distribution between given and unknown labels. CL requires two inputs: out-of-sample predicted probabilities; and noisy labels. The three steps for weak supervision using CL are: estimating the joint distribution; pruning noisy examples; and re-weighting examples by estimated prior.
The framework was introduced by Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang in the paper Confident Learning: Estimating Uncertainty in Dataset Labels, first posted to arXiv in October 2019 and published in the Journal of Artificial Intelligence Research (JAIR) in April 2021. CL has become one of the most widely used techniques in data-centric AI, the school of practice that treats dataset quality, rather than model architecture, as the main lever for improving machine learning systems.
Most machine learning research assumes that training and test labels are correct. In reality, large benchmark datasets contain a non-trivial fraction of mislabeled examples. Even MNIST, often treated as the gold standard for clean data, has confirmed label errors in its test set. ImageNet, CIFAR and Amazon Reviews all carry meaningful noise rates. When the noise is class-conditional, models can memorize the wrong answers, benchmarks can mis-rank architectures, and downstream products can inherit silent failure modes.
Early theoretical work by Dana Angluin and Philip Laird in 1988 introduced the Classification Noise Process, a model in which each true label may be flipped to a wrong label with some probability that depends only on the true class. Their result showed that, under this process, choosing the most consistent rule from a sample is enough to learn provided the noise rate stays below one half. Subsequent work by Natarajan, Dhillon, Ravikumar, and Tewari (NeurIPS 2013) extended the analysis to surrogate losses, and a long line of methods, including bootstrapping, MentorNet, Co-teaching, and Mixup, proposed model-centric tricks that try to be robust to the noise without explicitly identifying which labels are wrong.
Confident learning takes a different posture. Instead of designing a noise-tolerant loss or curriculum, it tries to recover the joint distribution between observed (possibly wrong) labels and latent true labels, then uses that distribution to surgically remove or down-weight the bad examples before training a final model on the cleaned data.
Let X be the input space and m the number of classes. For each example x, let y_true denote the unknown latent true class and y_obs the observed noisy label. CL assumes a class-conditional noise (CNC) model:
p(y_obs = i | y_true = j, x) = p(y_obs = i | y_true = j)
That is, the probability that an example whose true class is j is mislabeled as i depends on j and i but not on the features x. This assumption is appropriate when classes that are easy to confuse get confused systematically, for example missile and projectile on ImageNet, and it is the same assumption used in the Angluin and Laird formulation that CL generalizes.
CL aims to estimate the joint distribution matrix Q[y_obs, y_true], an m by m matrix whose entry (i, j) gives the joint probability that an example was labeled i but truly belongs to class j. With Q in hand, label errors and class-pair confusions follow directly.
CL needs only two things from the user, and is otherwise model-agnostic and modality-agnostic:
| Input | Description |
|---|---|
| Out-of-sample predicted probabilities | An n by m matrix of class probabilities for each of the n examples, produced by a probabilistic classifier trained on data the example was not part of, typically via cross-validation. |
| Noisy labels | The given (possibly wrong) integer labels assigned to each example. |
Using out-of-sample probabilities is essential. If a classifier scored its own training data, it would be over-confident on memorized noisy labels and would never flag them as errors. Cross-validated probabilities give an honest second opinion.
The central object in CL is the confident joint, written C[y_obs, y_true]. For each class j, CL first computes a per-class threshold t_j equal to the average self-confidence the model assigns to examples that were labeled j:
t_j = mean of p_hat(y = j | x) over all x with given label j
For each example x with given label i, CL then increments C[i, j] for every class j such that the predicted probability p_hat(y = j | x) is greater than or equal to t_j. When more than one class clears its threshold, CL assigns the example to the class with the largest predicted probability. The result is an m by m matrix that counts, for each observed class i and each candidate true class j, the number of examples we are confident are really j even though they were labeled i.
CL then calibrates C so its row sums match the empirical class counts and divides by the total number of examples to produce the estimated joint distribution Q[y_obs, y_true]. From Q one obtains:
Under reasonable conditions on the model's per-class accuracy, Q is a consistent estimator of the true joint distribution even when the predicted probabilities themselves are imperfect. The key guarantee is robustness to non-uniform calibration error: as long as the per-class threshold ordering is preserved, CL recovers the right joint, which is the property that distinguishes it from naive thresholding on raw probabilities.
CL is usually presented as a small pipeline. Cleanlab and most reproductions implement the same three steps.
| Step | What happens | Output |
|---|---|---|
| 1. Estimate the joint distribution | Compute the confident joint C from out-of-sample predicted probabilities and given labels, calibrate, and normalize to get Q[y_obs, y_true]. | An m by m joint distribution matrix and per-class noise rates. |
| 2. Find and prune label errors | Use Q to identify examples most likely to be mislabeled. Several pruning rules exist, including pruning by class (PBC), pruning by noise rate (PBNR), and the combined C+NR rule. | A cleaned dataset with suspect labels removed or set aside for relabeling. |
| 3. Retrain with reweighting | Retrain the model on the cleaned data, optionally reweighting each remaining example by the inverse of its estimated latent prior so that rare classes are not crowded out. | A final model trained on a higher quality subset. |
Unlike methods that try to learn noise rates jointly with model parameters, CL decouples noise estimation from training. That decoupling is what gives it its hyperparameter-free character: there are no robust loss tuning constants, no scheduled curricula, and no warm-up phases.
CL is implemented in the cleanlab Python library, originally released by Northcutt in November 2019 alongside the arXiv preprint and now maintained by Cleanlab Inc., a company spun out of MIT in 2021. The library is licensed under Apache 2.0 and supports Python 3.10 and later on Linux, macOS, and Windows.
Cleanlab has grown well beyond the original confident learning algorithm. Notable components include:
| Component | Purpose |
|---|---|
cleanlab.classification.CleanLearning | Wraps any scikit-learn compatible classifier and applies the three-step CL pipeline end to end. |
cleanlab.dataset | Provides the joint distribution, noise rates, and dataset-level health scores. |
cleanlab.outlier | Detects out-of-distribution and atypical examples using model embeddings or scores. |
cleanlab.multiannotator | Infers a consensus label and per-annotator quality scores from datasets with multiple human raters. |
| ActiveLab | Active learning policy that decides whether to collect a fresh label for a new example or re-label an existing example with a possibly wrong consensus. |
| Datalab | Higher level tool that audits text, image, audio, and tabular datasets for many issue types at once, including label errors, near-duplicates, outliers, drift, and class imbalance. |
Cleanlab supports classification, regression, token classification, image segmentation, object detection, and multi-label tasks. It is commonly used as a preprocessing step in production data pipelines and as a teaching tool in courses on data-centric AI, including the MIT IAP course taught by Northcutt and colleagues.
The most visible application of CL is the 2021 NeurIPS paper Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks by Northcutt, Anish Athalye (MIT), and Jonas Mueller (then at Amazon, later Cleanlab). The authors used CL to flag candidate label errors in the test sets of ten of the most widely used computer vision, natural language, and audio benchmarks, then asked Mechanical Turk workers to validate each candidate. About 51 percent of the algorithmic candidates were confirmed by humans as actual errors.
| Dataset | Modality | Estimated test-set error rate |
|---|---|---|
| MNIST | Image (handwritten digits) | About 0.15 percent (15 confirmed errors out of 10,000). |
| CIFAR-10 | Image | About 0.5 percent. |
| CIFAR-100 | Image | About 5.8 percent. |
| ImageNet (ILSVRC validation) | Image | About 5.8 percent. |
| Caltech-256 | Image | About 1.5 percent. |
| QuickDraw | Image (sketches) | About 10.1 percent. |
| 20news | Text | About 1.1 percent. |
| IMDB | Text (sentiment) | About 2.9 percent. |
| Amazon Reviews | Text (sentiment) | About 4.0 percent. |
| AudioSet | Audio | About 1.4 percent. |
Across the ten datasets, the average test-set error rate was about 3.4 percent. A companion site at labelerrors.com lets visitors browse every confirmed error.
The paper's punchline was that benchmark rankings can flip when test labels are corrected. On ImageNet, ResNet-18 outperforms ResNet-50 once the prevalence of mislabeled examples in the test set rises by just 6 percent. On CIFAR-10, a 5 percent shift is enough to make VGG-11 beat VGG-19. The result is widely cited as evidence that higher capacity does not always mean better generalization and that benchmark progress can be partially an artifact of model architectures fitting the noise.
Learning with noisy labels is a crowded field. CL sits in the data-centric corner, while many earlier methods are model-centric.
| Approach | Idea | Relationship to CL |
|---|---|---|
| Bootstrapping (Reed et al., 2014) | Mix the model's predictions with the given labels to soften wrong labels during training. | Model-centric; does not explicitly identify which labels are wrong. |
| Forward and backward loss correction (Patrini et al., 2017) | Multiply the loss by an estimated noise transition matrix. | Requires a similar transition matrix to Q, but estimates it differently and uses it inside the loss instead of for pruning. |
| MentorNet (Jiang et al., 2018) | A teacher network learns a curriculum that selects easy examples for the student. | Model-centric and requires training a second network. |
| Co-teaching (Han et al., 2018) | Two networks teach each other by exchanging the small-loss examples each iteration. | No explicit noise model; sensitive to symmetric noise assumptions. |
| Label smoothing | Replace one-hot labels with a slightly softer distribution. | Reduces overconfidence but does not target identified errors. |
| Mixup (Zhang et al., 2018) | Train on convex combinations of examples and labels. | Improves regularization; does not identify mislabeled examples. |
| Confident learning (Northcutt et al., 2021) | Estimate the joint distribution of noisy and true labels, then prune. | Hyperparameter-free, model-agnostic, and produces interpretable error lists. |
The original CL paper benchmarked the method on CIFAR-10 with synthetically injected class-conditional noise and reported that CL outperformed seven contemporary methods, including INCV, Mixup, MentorNet, Co-teaching, S-Model, Reed, and SCE-loss, particularly under sparse and asymmetric noise where the noise concentrates between specific class pairs.
The CL paper proves several consistency results. The headline guarantee is that, under the class-conditional noise assumption and a mild per-class accuracy condition, the confident joint C is an exact estimator of the true joint distribution in expectation, and a consistent estimator as the dataset grows. The per-class accuracy condition essentially asks that the model is, on average, more right than wrong on each class, which is a much weaker requirement than asking for well-calibrated probabilities.
A second result, sometimes called the thresholding robustness property, shows that CL recovers the correct joint even when the predicted probabilities are biased, as long as the bias does not change the relative ordering of the per-class thresholds. In practice this means CL can use modestly miscalibrated neural network softmax outputs without large degradation. The theory does not, however, cover instance-dependent noise, where the probability of a flip depends on the example's features.
CL and the cleanlab implementation are used in a number of practical settings:
CL is not a universal fix. The most important caveats are:
Users are usually advised to treat CL output as a prioritized worklist for review rather than as an automatic deletion oracle, especially in domains where a wrong label can be expensive to recover.
The CL paper has accumulated thousands of citations since publication and is taught in graduate courses at MIT, Stanford, and elsewhere as the canonical introduction to data-centric label cleaning. Cleanlab the company commercializes the technique through a hosted platform called Cleanlab Studio, which adds a UI, scalable inference, and connectors to common data warehouses.
The framework has also influenced adjacent areas. The labelerrors.com release prompted maintainers of several benchmarks to publish corrected splits, and several papers since 2021 report results on both the original and cleaned versions of CIFAR and ImageNet to demonstrate that improvements are not noise artifacts. CL primitives have been re-implemented in MATLAB's Statistics and Machine Learning Toolbox and ported to Julia and R by community contributors.