Cleanlab

AI Companies Machine Learning Open Source AI

20 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 4,030 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Cleanlab is an open source Python library for automatically finding and fixing label errors and other data quality problems in machine learning datasets, and the data-centric AI startup, incorporated in 2021, that grew out of it. ^[1]^[3] The library implements confident learning, an algorithm developed by Curtis Northcutt during his PhD at the Massachusetts Institute of Technology that estimates which examples in a labeled dataset have been given the wrong label, and it has more than 11,000 stars on GitHub. ^[1]^[3] The company Cleanlab Inc. was founded by Northcutt together with Anish Athalye and Jonas Mueller, all three MIT computer science PhDs, raised roughly $30 million in venture funding, and was acquired by the AI data-labeling firm Handshake in late January 2026. ^[9]^[13]

The library was first open sourced in 2017 by Northcutt, then a graduate student working with Isaac Chuang. It implements confident learning, an algorithm Northcutt developed with Lu Jiang and Chuang for estimating which training examples in a labeled dataset have been mislabeled. ^[1] The company Cleanlab Inc. was incorporated in late 2021 by Northcutt together with Athalye and Mueller. ^[6]^[9]

The project is one of the more visible flag bearers of data-centric AI, the school of thought that says fixing your dataset usually beats tweaking your model. It has been adopted across supervised learning practice, and the underlying papers were used in 2021 to demonstrate that the test sets behind nearly every popular machine learning benchmark, including ImageNet, MNIST, and CIFAR-10, contained thousands of mislabeled examples. ^[2] That study made it onto the front page of Hacker News, was picked up by VentureBeat and MIT News, and put a number on something a lot of practitioners had suspected for years: the gold-standard datasets aren't gold.

In 2023, Cleanlab Inc. raised $5 million in seed funding led by Bain Capital Ventures and launched Cleanlab Studio for Enterprise, a hosted version of the open source tool. ^[6] A $25 million Series A followed in October 2023, co-led by Menlo Ventures and TQ Ventures, bringing total funding to $30 million. ^[7]^[13] In April 2024 the company introduced the Trustworthy Language Model (TLM), a wrapper that scores how reliable any LLM response is, aimed at hallucination detection in retrieval-augmented generation systems. ^[8] In January 2026, Cleanlab was acquired by Handshake AI in what was reported as primarily an acqui-hire; nine Cleanlab employees including all three founders moved to Handshake's research organization. ^[9]^[13]

What is Cleanlab?

Cleanlab is, at its core, a way to point a model at its own training data and ask which labels are probably wrong. Give the library a labeled dataset and a set of predicted class probabilities (from any model), and it returns a ranked list of the examples most likely to be mislabeled, plus a broader audit of outliers, near-duplicates, and other data issues. The same name covers two things: the MIT-licensed-then-Apache-2.0 open source package (pip install cleanlab), and the venture-backed company that built Cleanlab Studio and the Trustworthy Language Model on top of it. ^[3]^[9]

The library's own description, from the GitHub README, is that it is "the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels." ^[3] Its selling point is that the entire workflow is model-agnostic and can be invoked in roughly one line of code, working with any dataset (text, image, tabular, audio) and any model (scikit-learn, PyTorch, XGBoost, OpenAI). ^[3]

How does Cleanlab work? The confident learning algorithm

The theoretical core of Cleanlab is confident learning (CL), introduced in the paper "Confident Learning: Estimating Uncertainty in Dataset Labels" by Northcutt, Lu Jiang, and Isaac Chuang. The paper was first posted to arXiv as 1911.00068 in October 2019 and published in the Journal of Artificial Intelligence Research in 2021. ^[1] It won the IJCAI-JAIR Best Paper Prize in 2024.

The basic idea is almost embarrassingly simple once you've seen it. Suppose you have a labeled dataset, and you have a model (any model) that produces predicted class probabilities for each example. Some of the labels in the dataset are wrong. Confident learning asks: which labels does the model disagree with confidently enough that we should suspect the label rather than the model?

The algorithm builds something called the confident joint matrix. Imagine a square matrix where rows are observed (given) labels and columns are latent (true) labels. For each example, you check the predicted probability of every class. If the predicted probability for some class j exceeds a per-class threshold (typically the average self-confidence the model has for that class), the example is counted as a member of a confident set for class j. The confident joint then records, for each pair (given label i, predicted label j), how many examples in the dataset are confidently in class j according to the model but were labeled i by the annotator. ^[1]

The off-diagonal entries of this matrix are where the action is. An entry at position (i, j) where i is not equal to j counts examples that the model is confident belong to class j, but a human said are class i. Those are the candidate label errors. After calibrating the matrix so its row sums match the actual class counts, you get an estimate of the joint distribution between observed and true labels, which lets you back out the true class prior and the noise transition matrix without ever needing access to the true labels. ^[1]

From there CL does three things: prune the suspicious examples (drop them from training), reweight the remaining ones to correct for the estimated class imbalance, and rerank candidates by a confidence-based score so the worst label issues come first. Northcutt, Jiang, and Chuang showed that under reasonable assumptions about the noise process (class-conditional noise, plus a model that is at least as accurate as random for each class), confident learning gives provably consistent estimates of the noise rate and the true label distribution. ^[1] They also showed that on CIFAR with synthetic noise, this approach beat seven competitive prior methods including MentorNet, Co-Teaching, and S-Model, outperforming the top methods by over 30% at high noise rates, and that it scaled to ImageNet, where it estimated that around 645 "missile" images were mislabeled as the broader parent class "projectile." ^[1]

A key practical point is that CL is model-agnostic. The algorithm needs only out-of-sample predicted probabilities, which you can produce from any classifier you like through cross-validation. That includes scikit-learn estimators, PyTorch networks, gradient-boosted trees, transformers, anything that can output a softmax. ^[3] This is why the same library that ships with image classification examples can also clean up text, audio, and tabular data.

What is the cleanlab Python package?

The open source library lives at github.com/cleanlab/cleanlab. It is pure Python and built on NumPy, scikit-learn, and PyTorch. Following the January 2026 Handshake acquisition, the company stated that the package would remain open source under the more permissive Apache-2.0 license (earlier releases carried an AGPL-3.0 license, and Apache-2.0 had also appeared across the package's history). ^[9] The package runs on Python 3.10 and up across Linux, macOS, and Windows. ^[3] The latest stable release as of early 2026 is v2.9.0, shipped on January 13, 2026, and the package has more than 11,000 stars on GitHub (around 11,400 in early 2026). ^[3] It is installed via pip or conda.

The entry points most people start with are these.

from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(max_iter=400)
cl = CleanLearning(classifier)

# train a classifier on noisy labels, automatically filtering bad ones
cl.fit(X_train, labels)

# get predictions on new data
preds = cl.predict(X_test)

# get the indices of likely mislabeled examples in your training set
label_issues = cl.find_label_issues(X_train, labels)

CleanLearning is the modern name for what used to be called LearningWithNoisyLabels. It wraps any sklearn-style classifier, runs cross-validation to get out-of-sample probabilities, applies confident learning to identify label issues, and refits on the cleaned data. You can use it as a drop-in robust trainer.

For users who only want the diagnostic, cleanlab.filter.find_label_issues takes the labels and predicted probabilities directly and returns a boolean mask or a ranked list of suspicious examples. Several scoring methods are available, including normalized margin and self-confidence ranking.

The cleanlab.dataset module gives a higher-level health summary of a labeled dataset: which classes have the worst label quality, where classes overlap, how many issues are estimated to be in each class. It is useful for triage on a large dataset where you don't yet know where to look.

Datalab

The newer interface, introduced in cleanlab 2.0 (2023) and expanded since, is Datalab. Where the original API focused tightly on label errors in classification, Datalab is a general data audit. It looks for several issue types at once.

from cleanlab import Datalab

lab = Datalab(data=df, label_name="y")
lab.find_issues(features=embeddings, pred_probs=predicted_probs)
lab.report()

The issues Datalab can detect include label errors (the original confident learning use case), out-of-distribution outliers, near-duplicate examples, ambiguous examples that sit on class boundaries, low-information examples that are essentially noise, examples that are too similar to test data (data leakage), and class imbalance. It works on tabular, text, image, and audio data as long as you can give it features (raw or embedded) and predicted probabilities.

Supported tasks

The library has gradually grown beyond standard multi-class classification. The table below summarizes the major task-level modules.

Task	Module	What it detects
Multi-class classification	`cleanlab.classification`, `cleanlab.filter`	Mislabeled examples in standard classification
Multi-label classification	`cleanlab.multilabel_classification`	Wrong tag sets in image or document tagging
Regression	`cleanlab.regression`	Examples whose target value looks wrong
Token classification	`cleanlab.token_classification`	Bad token-level tags in NER and similar sequence labeling
Image segmentation	`cleanlab.segmentation`	Per-pixel label errors in segmentation masks
Object detection	`cleanlab.object_detection`	Mislabeled or missed bounding boxes
Outlier detection	`cleanlab.outlier`	Out-of-distribution examples
Multi-annotator agreement	`cleanlab.multiannotator`	Quality of crowdsourced labels
Active learning	`cleanlab.experimental`	Which unlabeled examples are most worth labeling

Each of these is a separate research thread, and the package release notes track which paper backed each module. The multi-annotator support, for example, is closer to the literature on truth inference from crowd annotations than to the original confident learning work.

How dirty are popular ML benchmarks? The 2021 label-errors study

The paper that put Cleanlab on the public radar was "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks," by Northcutt, Athalye, and Mueller, presented at the NeurIPS 2021 Datasets and Benchmarks track. The arXiv identifier is 2103.14749. ^[2] It was nominated for best paper.

The team ran confident learning on the test sets of ten of the most widely used machine learning benchmarks, then sent the algorithmically flagged candidates to crowd workers on Mechanical Turk for human verification. About 51% of the algorithmic candidates were confirmed by humans as actual label errors. ^[2] The headline number from the paper: an estimated average of 3.4% errors across the ten benchmarks studied, with at least 2,916 errors (about 6%) in the ImageNet validation set alone and an estimated 5 million-plus errors (about 10%) in QuickDraw. ^[2]

Benchmark	Modality	Estimated label error rate
MNIST	Image	About 0.15%
CIFAR-10	Image	About 0.54%
CIFAR-100	Image	About 5.85%
Caltech-256	Image	About 1.54%
ImageNet (validation set)	Image	About 5.83%, at least 6%
QuickDraw	Image	About 10.12%
Amazon Reviews	Text	About 4%
IMDB	Text	About 2.9%
20news	Text	About 1.1%
AudioSet	Audio	About 1.3%

(Exact percentages and absolute counts are tabulated in the NeurIPS paper and the labelerrors.com website.) ^[2]^[5]

The paper went a step beyond "these test sets are dirty." The authors corrected the test sets and then re-evaluated a wide range of pretrained models on the cleaned data. The result was the more interesting finding: when test sets are corrected, the relative ranking of models can flip. On corrected ImageNet, ResNet-18 outperformed ResNet-50 in some settings. On corrected CIFAR-10, VGG-11 surpassed VGG-19. ^[2] In other words, when a benchmark is more than a few percent noisy, a higher-capacity model that fits the test errors more aggressively can score better on the dirty version of the benchmark while actually being a worse model. The implication: leaderboards can be measuring the wrong thing if nobody audits the test set.

The paper got picked up by MIT News, VentureBeat, Wired, and others. It also gave the broader data-centric AI community a concrete artifact to point to whenever someone argued that benchmark numbers were the only thing that mattered.

labelerrors.com

Alongside the paper, the team launched labelerrors.com, a public site that lets you scroll through the actual label errors found in each of the ten benchmarks. ^[5] The site shows a thumbnail of an image, the original label, the alternative label crowd workers preferred, and the corrected label. There is a noticeable comedic element. ImageNet has things like a "projectile" that is a missile, a "chimpanzee" that is in fact a gorilla, a tabby cat labeled as an ox. The MNIST examples include digits where the original annotator clearly wrote one number and the dataset annotation says another. The QuickDraw examples are doodles labeled as completely different objects.

The site is part of the open source cleanlab/label-errors GitHub repository, which also publishes the corrected test sets so other researchers can run their own re-evaluations. ^[5]

What is Cleanlab Studio?

Cleanlab Studio is the company's hosted commercial product. It launched in July 2023 alongside the seed funding announcement. ^[6] Studio packages the open source algorithms behind a no-code web interface and adds enterprise features the open source library doesn't try to provide.

Its core function is a data quality audit: upload a labeled dataset (text, image, or tabular), and Studio runs Datalab-style checks against it, flagging label errors, outliers, near-duplicates, and ambiguous examples. The interface lets non-ML users review the flagged items, accept or reject suggested corrections, and export a cleaned version of the dataset. There is also an AutoML side: Studio will train a model under the hood to produce the predicted probabilities that confident learning needs, so users don't have to bring their own classifier.

The positioning is deliberately for enterprise data teams who don't have an ML research org to build label cleaning pipelines from scratch. The customer list disclosed by the company includes BBVA, Tencent, Amazon, Oracle, Google, and Databricks. ^[7] Cleanlab also reports use across more than 100 Fortune 500 companies, though most are not named publicly.

What is the Trustworthy Language Model (TLM)?

The Trustworthy Language Model, introduced on April 25, 2024, is Cleanlab's pivot toward the generative AI era. ^[8] The thesis: as LLMs get embedded in production systems, the bottleneck moves from training data quality to inference reliability. Hallucinations, retrieval failures, and confidently wrong answers are the new label errors.

TLM is an API wrapper that takes any prompt-and-response pair (or just a prompt, in which case it generates a response itself) and returns a trustworthiness score between 0 and 1. A low score means the response is likely wrong. The score is built from several signals: token-level uncertainty from the underlying LLM, ensembling across multiple LLM calls with prompt variations, self-reflection prompts that ask the model to evaluate its own answer, and a specialized aggregator that learns to combine these into a single calibrated reliability number. ^[8]

TLM is most often deployed as a hallucination detector for retrieval-augmented generation. In a RAG system, the LLM has to combine retrieved documents with a user query. If the retrieval is bad, the answer is wrong; if the retrieval is good but the model overgenerates, the answer is wrong. TLM scores each response and lets the application escalate low-confidence answers to a human, return a fallback message, or trigger another retrieval pass. Cleanlab benchmarks reported in 2024 and 2025 claim that TLM detects incorrect answers with substantially better precision than other hallucination detectors and self-evaluation methods, often around 3x better in their RAG-specific benchmarks. Those claims are the company's own and should be read accordingly.

The TLM client library, cleanlab-tlm, is a separate Python package on PyPI and GitHub. ^[12] It works with OpenAI, Anthropic, and most major model APIs. NVIDIA integrated TLM into NeMo Guardrails as one of the supported hallucination check options.

Who created Cleanlab? Founders and company history

The three founders met at MIT, where all three earned PhDs in computer science. ^[9]^[13]

Curtis Northcutt (CEO and co-founder) earned his computer science PhD from MIT in May 2021 with a thesis titled "Confident Learning for Machines and Humans." ^[10] His advisor was Isaac Chuang, who is more famous in the quantum computing world for building one of the first working quantum computers. Northcutt's thesis won the Google PhD Fellowship in machine learning. Before MIT he was at Vanderbilt and as a Fulbright Scholar in Ecuador. After the Handshake acquisition in early 2026, he became Director of AI Research at Handshake. ^[9]

Jonas Mueller (Chief Scientist and co-founder) holds a CS PhD from MIT, where he worked on machine learning for noisy data and statistical inference. ^[13] Before founding Cleanlab he was a senior scientist at Amazon Web Services, where he co-built AutoGluon, the AutoML framework, and contributed to several other AWS AI services. Mueller's published work spans causal inference, deep learning for tabular data, and uncertainty estimation.

Anish Athalye (CTO and co-founder) holds a CS PhD from MIT, focused on systems and security. ^[13] He is widely cited in adversarial machine learning for the 2018 ICML best-paper-award work "Obfuscated Gradients Give a False Sense of Security," which broke seven of nine then-state-of-the-art adversarial defenses at ICLR 2018. He is also the author of several popular open source projects including the dotbot dotfiles manager.

The company was incorporated in late 2021 after the cleanlab open source library and the two foundational papers had already established the technical brand. ^[6] It operated quietly for about 18 months before announcing its $5 million seed round on July 20, 2023. ^[6] The Series A came less than three months later, on October 10, 2023, at $25 million, co-led by Menlo Ventures and TQ Ventures with participation from Bain Capital Ventures and Databricks Ventures. ^[7] Other angel investors included Naveen Rao (founder of MosaicML and now at Databricks), Frederic Kerrest (co-founder of Okta), and founders from GitHub and Yahoo. ^[6]

Northcutt described the company's mission as building "the cleanest, highest quality AI data on Earth" and said the team's "singular focus will be to lead Handshake AI research in producing the highest quality data to train frontier AI models." ^[9]

Cleanlab was named to the Forbes AI 50 in 2024 and the CB Insights AI 100 in 2024.

When was Cleanlab acquired? The Handshake deal

On January 28, 2026, TechCrunch reported that Handshake had acquired Cleanlab. ^[13] The company confirmed the deal the same day. ^[9] Terms were not disclosed. The deal was structured primarily as an acqui-hire: nine Cleanlab employees, including the three co-founders, joined Handshake's research organization. ^[13] Handshake AI is the AI division of the career-services platform Handshake (last valued at $3.3 billion in 2022), and runs a data labeling operation that sources human experts for AI labs and that competes with companies like Scale AI, Surge, and Mercor. ^[13] TechCrunch reported that Cleanlab had received acquisition interest from multiple other AI data-labeling companies before choosing Handshake. ^[13] Northcutt's framing of the deal was that since Handshake provides labeling infrastructure to many of those competitors anyway, joining the source rather than a downstream tool was the right move: "If you're going to pick one, you should probably pick the source, not the middleman." ^[13] Cleanlab said the open source cleanlab library would continue to be maintained and would move to the more permissive Apache-2.0 license. ^[9]

Where is Cleanlab used? Adoption

The open source library is widely used in academic data-centric AI work. It is the recommended tool in the MIT Introduction to Data-Centric AI course, taught by Curtis Northcutt and others, which made its full lecture materials public at dcai.csail.mit.edu. ^[11] The library has been integrated into educational tutorials at NeurIPS, ICML, and KDD.

On the commercial side, Cleanlab has reported usage by more than 100 Fortune 500 companies. Specifically named customers include BBVA, Tencent, Amazon, Oracle, Google, and Databricks. ^[7] Andrew Ng has publicly cited the original confident learning paper as one of his favorite recent breakthroughs in machine learning, which probably did more for the brand than any single piece of marketing the company ever produced.

The library has also been cited in thousands of subsequent papers on label noise, robust learning, and data quality. The cumulative citation count for the original confident learning JAIR paper passed 1,500 in 2024 according to Google Scholar. ^[1]

How does Cleanlab compare to alternatives?

The data quality and labeled-data tooling space is fragmented. A few systems are commonly compared with Cleanlab.

Snorkel, out of Stanford and now Snorkel AI, focuses on weak supervision: programmatically generating training labels from heuristic functions when you don't have ground truth labels at all. Cleanlab and Snorkel solve different parts of the labeling problem. Snorkel produces noisy labels from rules, Cleanlab cleans noisy labels you already have. They can be used together: write Snorkel labeling functions to bootstrap a dataset, then run Cleanlab to find the worst remaining issues.

Aquarium and Lightly are computer-vision-specific data quality platforms with strong support for embedding-based exploration of image datasets. They overlap with Cleanlab on outlier and near-duplicate detection but tend to be more interactive and visual, and less explicitly tied to label noise theory.

Galileo (now part of GuideLabs) and Arize Phoenix sit closer to MLOps observability for production models, including LLMs. They share with TLM the goal of catching bad outputs at inference, but typically rely on different scoring approaches.

For LLM-specific hallucination detection, TLM competes with offerings from Patronus AI, Arize, NVIDIA NeMo Guardrails (which itself integrates TLM as one option), and the various open source self-consistency and self-evaluation methods in the literature.

Compared to all of these, Cleanlab's distinguishing feature has been a principled grounding in published, peer-reviewed algorithms with provable noise-rate estimation guarantees, rather than a heuristic toolkit. Whether that pedigree is what users actually pay for, or whether the open source library and the public benchmark study were the real growth drivers, is a question companies don't tend to answer in public.

References

Northcutt, Curtis G., Lu Jiang, and Isaac L. Chuang. "Confident Learning: Estimating Uncertainty in Dataset Labels." *Journal of Artificial Intelligence Research* 70 (2021): 1373-1411. arXiv:1911.00068. https://arxiv.org/abs/1911.00068 ↩
Northcutt, Curtis G., Anish Athalye, and Jonas Mueller. "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." *NeurIPS 2021 Track on Datasets and Benchmarks*. arXiv:2103.14749. https://arxiv.org/abs/2103.14749 ↩
Cleanlab GitHub repository. https://github.com/cleanlab/cleanlab ↩
Cleanlab documentation. https://docs.cleanlab.ai/
labelerrors.com. https://labelerrors.com/ ↩
Cleanlab. "Cleanlab Emerges with $5 million to Automate Data Curation for LLMs and the Modern AI Stack" (seed funding and Cleanlab Studio launch). Business Wire, July 20, 2023. https://www.businesswire.com/news/home/20230720360972/en/ ↩
Cleanlab. "Cleanlab Raises $25M Series A to Automatically Increase the Value and Accuracy of the World's Enterprise Data Used by AI, ML, and Analytics Solutions." Business Wire, October 10, 2023. https://www.businesswire.com/news/home/20231010484401/en/ ↩
Cleanlab. "Cleanlab Announces Billion-Dollar Breakthrough in Detecting AI Hallucinations" (Trustworthy Language Model). April 25, 2024. https://cleanlab.ai/blog/ ↩
Cleanlab. "Letter from the CEO: Handshake acquires Cleanlab." January 28, 2026. https://cleanlab.ai/blog/handshake-acquires-cleanlab/ ↩
Northcutt, Curtis G. "Confident Learning for Machines and Humans." PhD thesis, MIT, 2021. https://dspace.mit.edu/handle/1721.1/139321 ↩
MIT Introduction to Data-Centric AI course. https://dcai.csail.mit.edu/ ↩
Cleanlab TLM client library. https://github.com/cleanlab/cleanlab-tlm ↩
Wiggers, Kyle. "AI data labeler Handshake buys Cleanlab, an acquisition target of multiple others." TechCrunch, January 28, 2026. https://techcrunch.com/2026/01/28/ai-data-labeler-handshake-buys-cleanlab-an-acquisition-target-of-multiple-others/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Companies Confident Learning (CL)Data-centric AI (DCAI)DatologyAI Derived label Proxy labels Snorkel Weak supervision

What is Cleanlab?

How does Cleanlab work? The confident learning algorithm

What is the cleanlab Python package?

Datalab

Supported tasks

How dirty are popular ML benchmarks? The 2021 label-errors study

labelerrors.com

What is Cleanlab Studio?

What is the Trustworthy Language Model (TLM)?

Who created Cleanlab? Founders and company history

When was Cleanlab acquired? The Handshake deal

Where is Cleanlab used? Adoption

How does Cleanlab compare to alternatives?

See also

References

Improve this article

Related Articles

Hugging Face

InclusionAI

Stability AI

DeepSeek

Meta AI

Mistral AI

What links here

Related Articles

Hugging Face

InclusionAI

Stability AI

DeepSeek

Meta AI

Mistral AI

What links here