Cleanlab
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,498 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,498 words
Add missing citations, update stale details, or suggest a clearer explanation.
Cleanlab is both an open source Python library for finding and fixing problems in machine learning datasets, and the San Francisco startup that grew out of it. The library was first open sourced in 2017 by Curtis Northcutt, then a graduate student at the Massachusetts Institute of Technology working with Isaac Chuang. It implements confident learning, an algorithm Northcutt developed with Lu Jiang and Chuang for estimating which training examples in a labeled dataset have been given the wrong label. The company Cleanlab Inc. was incorporated in late 2021 by Northcutt together with Anish Athalye and Jonas Mueller, all three of them MIT computer science PhDs.
The project is one of the more visible flag bearers of data-centric AI, the school of thought that says fixing your dataset usually beats tweaking your model. It has been adopted across supervised learning practice, and the underlying papers were used in 2021 to demonstrate that the test sets behind nearly every popular machine learning benchmark, including ImageNet, MNIST, and CIFAR-10, contained thousands of mislabeled examples. That study made it onto the front page of Hacker News, was picked up by VentureBeat and MIT News, and put a number on something a lot of practitioners had suspected for years: the gold-standard datasets aren't gold.
In 2023, Cleanlab Inc. raised $5 million in seed funding led by Bain Capital Ventures and launched Cleanlab Studio for Enterprise, a hosted version of the open source tool. A $25 million Series A followed in October 2023, co-led by Menlo Ventures and TQ Ventures, bringing total funding to $30 million. In April 2024 the company introduced the Trustworthy Language Model (TLM), a wrapper that scores how reliable any LLM response is, aimed at hallucination detection in retrieval-augmented generation systems. In January 2026, Cleanlab was acquired by Handshake AI in what was reported as primarily an acqui-hire; nine Cleanlab employees including all three founders moved to Handshake's research organization.
The theoretical core of Cleanlab is confident learning (CL), introduced in the paper "Confident Learning: Estimating Uncertainty in Dataset Labels" by Northcutt, Lu Jiang, and Isaac Chuang. The paper was first posted to arXiv as 1911.00068 in October 2019 and published in the Journal of Artificial Intelligence Research in 2021. It won the IJCAI-JAIR Best Paper Prize in 2024.
The basic idea is almost embarrassingly simple once you've seen it. Suppose you have a labeled dataset, and you have a model (any model) that produces predicted class probabilities for each example. Some of the labels in the dataset are wrong. Confident learning asks: which labels does the model disagree with confidently enough that we should suspect the label rather than the model?
The algorithm builds something called the confident joint matrix. Imagine a square matrix where rows are observed (given) labels and columns are latent (true) labels. For each example, you check the predicted probability of every class. If the predicted probability for some class j exceeds a per-class threshold (typically the average self-confidence the model has for that class), the example is counted as a member of a confident set for class j. The confident joint then records, for each pair (given label i, predicted label j), how many examples in the dataset are confidently in class j according to the model but were labeled i by the annotator.
The off-diagonal entries of this matrix are where the action is. An entry at position (i, j) where i is not equal to j counts examples that the model is confident belong to class j, but a human said are class i. Those are the candidate label errors. After calibrating the matrix so its row sums match the actual class counts, you get an estimate of the joint distribution between observed and true labels, which lets you back out the true class prior and the noise transition matrix without ever needing access to the true labels.
From there CL does three things: prune the suspicious examples (drop them from training), reweight the remaining ones to correct for the estimated class imbalance, and rerank candidates by a confidence-based score so the worst label issues come first. Northcutt, Jiang, and Chuang showed that under reasonable assumptions about the noise process (class-conditional noise, plus a model that is at least as accurate as random for each class), confident learning gives provably consistent estimates of the noise rate and the true label distribution. They also showed that on CIFAR with synthetic noise, this approach beat seven competitive prior methods including MentorNet, Co-Teaching, and S-Model, and that it scaled to ImageNet, where it found around 645 missile images mislabeled as the broader class "projectile."
A key practical point is that CL is model-agnostic. The algorithm needs only out-of-sample predicted probabilities, which you can produce from any classifier you like through cross-validation. That includes scikit-learn estimators, PyTorch networks, gradient-boosted trees, transformers, anything that can output a softmax. This is why the same library that ships with image classification examples can also clean up text, audio, and tabular data.
The open source library lives at github.com/cleanlab/cleanlab. It is pure Python, MIT-licensed (the README also lists Apache-2.0 in some places, both have appeared across the package's history), and built on NumPy, scikit-learn, and PyTorch. The latest stable release as of early 2026 is v2.9.0, shipped on January 13, 2026. The package has more than 11,000 stars on GitHub and is installed via pip or conda.
The entry points most people start with are these.
from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=400)
cl = CleanLearning(classifier)
# train a classifier on noisy labels, automatically filtering bad ones
cl.fit(X_train, labels)
# get predictions on new data
preds = cl.predict(X_test)
# get the indices of likely mislabeled examples in your training set
label_issues = cl.find_label_issues(X_train, labels)
CleanLearning is the modern name for what used to be called LearningWithNoisyLabels. It wraps any sklearn-style classifier, runs cross-validation to get out-of-sample probabilities, applies confident learning to identify label issues, and refits on the cleaned data. You can use it as a drop-in robust trainer.
For users who only want the diagnostic, cleanlab.filter.find_label_issues takes the labels and predicted probabilities directly and returns a boolean mask or a ranked list of suspicious examples. Several scoring methods are available, including normalized margin and self-confidence ranking.
The cleanlab.dataset module gives a higher-level health summary of a labeled dataset: which classes have the worst label quality, where classes overlap, how many issues are estimated to be in each class. It is useful for triage on a large dataset where you don't yet know where to look.
The newer interface, introduced in cleanlab 2.0 (2023) and expanded since, is Datalab. Where the original API focused tightly on label errors in classification, Datalab is a general data audit. It looks for several issue types at once.
from cleanlab import Datalab
lab = Datalab(data=df, label_name="y")
lab.find_issues(features=embeddings, pred_probs=predicted_probs)
lab.report()
The issues Datalab can detect include label errors (the original confident learning use case), out-of-distribution outliers, near-duplicate examples, ambiguous examples that sit on class boundaries, low-information examples that are essentially noise, examples that are too similar to test data (data leakage), and class imbalance. It works on tabular, text, image, and audio data as long as you can give it features (raw or embedded) and predicted probabilities.
The library has gradually grown beyond standard multi-class classification. The table below summarizes the major task-level modules.
| Task | Module | What it detects |
|---|---|---|
| Multi-class classification | cleanlab.classification, cleanlab.filter | Mislabeled examples in standard classification |
| Multi-label classification | cleanlab.multilabel_classification | Wrong tag sets in image or document tagging |
| Regression | cleanlab.regression | Examples whose target value looks wrong |
| Token classification | cleanlab.token_classification | Bad token-level tags in NER and similar sequence labeling |
| Image segmentation | cleanlab.segmentation | Per-pixel label errors in segmentation masks |
| Object detection | cleanlab.object_detection | Mislabeled or missed bounding boxes |
| Outlier detection | cleanlab.outlier | Out-of-distribution examples |
| Multi-annotator agreement | cleanlab.multiannotator | Quality of crowdsourced labels |
| Active learning | cleanlab.experimental | Which unlabeled examples are most worth labeling |
Each of these is a separate research thread, and the package release notes track which paper backed each module. The multi-annotator support, for example, is closer to the literature on truth inference from crowd annotations than to the original confident learning work.
The paper that put Cleanlab on the public radar was "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks," by Northcutt, Athalye, and Mueller, presented at the NeurIPS 2021 Datasets and Benchmarks track. The arXiv identifier is 2103.14749. It was nominated for best paper.
The team ran confident learning on the test sets of ten of the most widely used machine learning benchmarks, then sent the algorithmically flagged candidates to crowd workers on Mechanical Turk for human verification. About 51% of the algorithmic candidates were confirmed by humans as actual label errors. The headline number from the paper: an estimated average of at least 3.3% errors across the ten benchmarks studied.
| Benchmark | Modality | Estimated label error rate |
|---|---|---|
| MNIST | Image | About 0.15% |
| CIFAR-10 | Image | About 0.54% |
| CIFAR-100 | Image | About 5.85% |
| Caltech-256 | Image | About 1.54% |
| ImageNet (validation set) | Image | About 5.83%, at least 6% |
| QuickDraw | Image | About 10.12% |
| Amazon Reviews | Text | About 4% |
| IMDB | Text | About 2.9% |
| 20news | Text | About 1.1% |
| AudioSet | Audio | About 1.3% |
(Exact percentages and absolute counts are tabulated in the NeurIPS paper and the labelerrors.com website.)
The paper went a step beyond "these test sets are dirty." The authors corrected the test sets and then re-evaluated a wide range of pretrained models on the cleaned data. The result was the more interesting finding: when test sets are corrected, the relative ranking of models can flip. On corrected ImageNet, ResNet-18 outperformed ResNet-50 in some settings. On corrected CIFAR-10, VGG-11 surpassed VGG-19. In other words, when a benchmark is more than a few percent noisy, a higher-capacity model that fits the test errors more aggressively can score better on the dirty version of the benchmark while actually being a worse model. The implication: leaderboards can be measuring the wrong thing if nobody audits the test set.
The paper got picked up by MIT News, VentureBeat, Wired, and others. It also gave the broader data-centric AI community a concrete artifact to point to whenever someone argued that benchmark numbers were the only thing that mattered.
Alongside the paper, the team launched labelerrors.com, a public site that lets you scroll through the actual label errors found in each of the ten benchmarks. The site shows a thumbnail of an image, the original label, the alternative label crowd workers preferred, and the corrected label. There is a noticeable comedic element. ImageNet has things like a "projectile" that is a missile, a "chimpanzee" that is in fact a gorilla, a tabby cat labeled as an ox. The MNIST examples include digits where the original annotator clearly wrote one number and the dataset annotation says another. The QuickDraw examples are doodles labeled as completely different objects.
The site is part of the open source cleanlab/label-errors GitHub repository, which also publishes the corrected test sets so other researchers can run their own re-evaluations.
Cleanlab Studio is the company's hosted commercial product. It launched in July 2023 alongside the seed funding announcement. Studio packages the open source algorithms behind a no-code web interface and adds enterprise features the open source library doesn't try to provide.
Its core function is a data quality audit: upload a labeled dataset (text, image, or tabular), and Studio runs Datalab-style checks against it, flagging label errors, outliers, near-duplicates, and ambiguous examples. The interface lets non-ML users review the flagged items, accept or reject suggested corrections, and export a cleaned version of the dataset. There is also an AutoML side: Studio will train a model under the hood to produce the predicted probabilities that confident learning needs, so users don't have to bring their own classifier.
The positioning is deliberately for enterprise data teams who don't have an ML research org to build label cleaning pipelines from scratch. The customer list disclosed by the company includes BBVA, Tencent, Amazon, Oracle, Google, and Databricks. Cleanlab also reports use across more than 100 Fortune 500 companies, though most are not named publicly.
The Trustworthy Language Model, introduced on April 25, 2024, is Cleanlab's pivot toward the generative AI era. The thesis: as LLMs get embedded in production systems, the bottleneck moves from training data quality to inference reliability. Hallucinations, retrieval failures, and confidently wrong answers are the new label errors.
TLM is an API wrapper that takes any prompt-and-response pair (or just a prompt, in which case it generates a response itself) and returns a trustworthiness score between 0 and 1. A low score means the response is likely wrong. The score is built from several signals: token-level uncertainty from the underlying LLM, ensembling across multiple LLM calls with prompt variations, self-reflection prompts that ask the model to evaluate its own answer, and a specialized aggregator that learns to combine these into a single calibrated reliability number.
TLM is most often deployed as a hallucination detector for retrieval-augmented generation. In a RAG system, the LLM has to combine retrieved documents with a user query. If the retrieval is bad, the answer is wrong; if the retrieval is good but the model overgenerates, the answer is wrong. TLM scores each response and lets the application escalate low-confidence answers to a human, return a fallback message, or trigger another retrieval pass. Cleanlab benchmarks reported in 2024 and 2025 claim that TLM detects incorrect answers with substantially better precision than other hallucination detectors and self-evaluation methods, often around 3x better in their RAG-specific benchmarks. Those claims are the company's own and should be read accordingly.
The TLM client library, cleanlab-tlm, is a separate Python package on PyPI and GitHub. It works with OpenAI, Anthropic, and most major model APIs. NVIDIA integrated TLM into NeMo Guardrails as one of the supported hallucination check options.
The three founders met at MIT.
Curtis Northcutt (CEO and co-founder) earned his computer science PhD from MIT in May 2021 with a thesis titled "Confident Learning for Machines and Humans." His advisor was Isaac Chuang, who is more famous in the quantum computing world for building one of the first working quantum computers. Northcutt's thesis won the Google PhD Fellowship in machine learning. Before MIT he was at Vanderbilt and as a Fulbright Scholar in Ecuador. After the Handshake acquisition in early 2026, his title became Director of AI Research at Handshake.
Jonas Mueller (Chief Scientist and co-founder) holds a CS PhD from MIT, where he worked on machine learning for noisy data and statistical inference. Before founding Cleanlab he was a senior scientist at Amazon Web Services, where he co-built AutoGluon, the AutoML framework, and contributed to several other AWS AI services. Mueller's published work spans causal inference, deep learning for tabular data, and uncertainty estimation.
Anish Athalye (CTO and co-founder) holds a CS PhD from MIT, focused on systems and security. He is widely cited in adversarial machine learning for the 2018 ICML best-paper-award work "Obfuscated Gradients Give a False Sense of Security," which broke seven of nine then-state-of-the-art adversarial defenses at ICLR 2018. He is also the author of several popular open source projects including the dotbot dotfiles manager.
The company was incorporated in late 2021 after the cleanlab open source library and the two foundational papers had already established the technical brand. It operated quietly for about 18 months before announcing its $5 million seed round on July 20, 2023. The Series A came less than three months later, on October 10, 2023, at $25 million, co-led by Menlo Ventures and TQ Ventures with participation from Bain Capital Ventures and Databricks Ventures. Other angel investors included Naveen Rao (founder of MosaicML and now at Databricks), Frederic Kerrest (co-founder of Okta), and founders from GitHub and Yahoo.
Cleanlab was named to the Forbes AI 50 in 2024 and the CB Insights AI 100 in 2024.
On January 28, 2026, TechCrunch reported that Handshake AI had acquired Cleanlab. Terms were not disclosed. The deal was structured primarily as an acqui-hire: nine Cleanlab employees, including the three co-founders, joined Handshake's research organization. Handshake AI is the AI division of the career-services platform Handshake, and runs a data labeling operation that competes with companies like Scale AI, Surge, and Labelbox. Northcutt's framing of the deal in the TechCrunch piece was that since Handshake provides labeling infrastructure to many of those competitors anyway, joining "the source" rather than a downstream tool was the right move. The status of the open source cleanlab repository post-acquisition was not specified in the announcement.
The open source library is widely used in academic data-centric AI work. It is the recommended tool in the MIT Introduction to Data-Centric AI course, taught by Curtis Northcutt and others, which made its full lecture materials public at dcai.csail.mit.edu. The library has been integrated into educational tutorials at NeurIPS, ICML, and KDD.
On the commercial side, Cleanlab has reported usage by more than 100 Fortune 500 companies. Specifically named customers include BBVA, Tencent, Amazon, Oracle, Google, and Databricks. Andrew Ng has publicly cited the original confident learning paper as one of his favorite recent breakthroughs in machine learning, which probably did more for the brand than any single piece of marketing the company ever produced.
The library has also been cited in thousands of subsequent papers on label noise, robust learning, and data quality. The cumulative citation count for the original confident learning JAIR paper passed 1,500 in 2024 according to Google Scholar.
The data quality and labeled-data tooling space is fragmented. A few systems are commonly compared with Cleanlab.
Snorkel, out of Stanford and now Snorkel AI, focuses on weak supervision: programmatically generating training labels from heuristic functions when you don't have ground truth labels at all. Cleanlab and Snorkel solve different parts of the labeling problem. Snorkel produces noisy labels from rules, Cleanlab cleans noisy labels you already have. They can be used together: write Snorkel labeling functions to bootstrap a dataset, then run Cleanlab to find the worst remaining issues.
Aquarium and Lightly are computer-vision-specific data quality platforms with strong support for embedding-based exploration of image datasets. They overlap with Cleanlab on outlier and near-duplicate detection but tend to be more interactive and visual, and less explicitly tied to label noise theory.
Galileo (now part of GuideLabs) and Arize Phoenix sit closer to MLOps observability for production models, including LLMs. They share with TLM the goal of catching bad outputs at inference, but typically rely on different scoring approaches.
For LLM-specific hallucination detection, TLM competes with offerings from Patronus AI, Arize, NVIDIA NeMo Guardrails (which itself integrates TLM as one option), and the various open source self-consistency and self-evaluation methods in the literature.
Compared to all of these, Cleanlab's distinguishing feature has been a principled grounding in published, peer-reviewed algorithms with provable noise-rate estimation guarantees, rather than a heuristic toolkit. Whether that pedigree is what users actually pay for, or whether the open source library and the public benchmark study were the real growth drivers, is a question companies don't tend to answer in public.