Data-centric AI (DCAI) is an approach to building machine learning systems that puts systematic improvement of the training data at the center of the workflow, rather than treating data as a fixed input and iterating mainly on model architecture or training algorithms. The discipline was popularized by Andrew Ng in 2021 and has since grown into a research area with its own NeurIPS workshops, dedicated tooling vendors, and a freely available MIT course.
The core claim is straightforward: in many real applications, the model has long since stopped being the bottleneck. Two engineers using the same off-the-shelf architecture on the same dataset will get nearly identical results. The same engineers handed a noisy dataset and asked to clean it will produce wildly different models. So the leverage, the argument goes, is in the data work that almost nobody wants to do.
DCAI is sometimes presented as the opposite of model-centric AI, but in practice the two are complementary. Most production teams iterate on both. The reason DCAI gets a name at all is that the data side of that loop was, for years, treated as a one-time setup task rather than an engineering discipline with its own tools, metrics, and best practices.
The term entered wide use in March 2021, when Andrew Ng gave a talk titled "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI." He argued that academic ML research had spent roughly a decade fixating on model architecture while holding benchmark datasets fixed, and that this had trained a generation of practitioners to reach for a new model whenever a project stalled. In industry, where datasets are often small, noisy, and domain-specific, that habit fails. Ng claimed that in the projects he had seen at Landing AI, perhaps 80 percent of the actual work was data work. That number is not a measured statistic so much as a rule of thumb, but it captured a real frustration that many practitioners shared.
Later that year, Ng and a group of collaborators organized the first NeurIPS Data-Centric AI Workshop. The workshop framed DCAI as "the discipline of systematically engineering the data needed to successfully build an AI system," and grouped its scope into roughly six areas: data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance. Workshops have continued at NeurIPS in subsequent years, alongside courses and competitions.
The contrast with model-centric work is easiest to see in a side-by-side view.
| Aspect | Model-centric approach | Data-centric approach |
|---|---|---|
| What is held fixed | Dataset is fixed; model is varied | Model is fixed; dataset is varied |
| Primary lever | Architecture, hyperparameters, optimizer | Label quality, coverage, augmentation, filtering |
| Typical metric to improve | Test accuracy on a held-out benchmark | Real-world reliability, robustness, label noise |
| Typical failure mode | Diminishing returns on bigger models | Garbage in, garbage out; data cascades |
| Where it dominates | Frontier research, benchmark leaderboards | Production ML, regulated and high-stakes domains |
| Representative tools | PyTorch, JAX, TensorFlow, Hugging Face Transformers | cleanlab, Snorkel, Label Studio, DVC, Great Expectations |
In frontier LLM research, both approaches now happen at once. Architecture work continues, but the most measurable gains in 2024 and 2025 came from changes to pretraining data, not from new attention variants.
DCAI is less a single technique than a stance about where ML effort should go. A few principles show up across most descriptions of the field.
Iterate on data, not just models. The training set is treated as a deliverable that goes through versions, reviews, and tests, the same way code does. If the model is underperforming on a particular slice, the first response is to look at the data for that slice rather than to retrain a larger network.
Treat data as code. Datasets should be versioned, validated against schemas, monitored in production, and rolled back when something breaks. Tools like DVC, lakeFS, and Great Expectations come from this principle.
Quality over quantity. A smaller, cleaner dataset often beats a larger, noisier one. Curtis Northcutt and colleagues showed in 2021 that 6 percent of the ImageNet validation set is mislabeled and that correcting these labels can flip the ranking of widely used architectures: ResNet-18 outperforms ResNet-50 once enough mislabels are removed. The implication is that a chunk of architectural progress on noisy benchmarks may be measurement noise.
Systematic error analysis. Rather than reporting a single accuracy number, DCAI workflows slice errors by class, by region, by collection date, and by any other axis that might reveal a data problem. The fix is then a targeted data change: relabeling a confusing class, collecting more examples from an underrepresented slice, removing a contaminated source.
Documentation and provenance. Datasets should travel with their context. Datasheets for Datasets, proposed by Timnit Gebru and coauthors in 2018, are a structured template for recording why a dataset was created, how it was collected, what its known limitations are, and who maintains it. Model cards play the same role for trained models.
The field covers a wide set of techniques, most of which existed before the term "data-centric AI" did. The novelty is in pulling them under one umbrella and treating them as parts of a single workflow.
| Technique | What it does | Example tools |
|---|---|---|
| Label cleaning | Detects and fixes mislabeled training examples using model disagreement, confident learning, or rater consensus | cleanlab, Cleanlab Studio |
| Inter-rater agreement | Measures whether independent annotators agree, then resolves disputes through adjudication or majority voting | Label Studio, Prodigy, custom QA pipelines |
| Data labeling | Creates supervision signals through human annotation, often with active queue routing and review workflows | Label Studio, Scale AI, Surge AI, Encord |
| Weak supervision | Replaces hand labels with programmatic labeling functions whose noise is modeled and aggregated | Snorkel, Snorkel Flow |
| Active learning | Selects the most informative unlabeled examples for human review, reducing labeling cost | modAL, baal, Prodigy |
| Data augmentation | Generates new training examples by transforming existing ones (cropping, paraphrasing, mixup, RandAugment) | Albumentations, AugLy, nlpaug |
| Synthetic data | Generates training data from simulators, GANs, diffusion models, or LLMs | Self-Instruct, Alpaca pipeline, Gretel, Mostly AI |
| Data slicing and error analysis | Breaks evaluation into subgroups to find systematic failures the aggregate metric hides | Snorkel SliceNets, Domino, custom dashboards |
| Data validation | Asserts schemas, distributions, ranges, and freshness on incoming data | Great Expectations, Deequ, TFDV |
| Data versioning | Tracks dataset revisions alongside code so experiments are reproducible | DVC, Git LFS, lakeFS, Pachyderm |
| Documentation | Records dataset purpose, collection method, and known biases | Datasheets for Datasets, Data Cards, Model Cards |
| Data quality monitoring | Watches input distributions and label quality after deployment | WhyLabs, Arize, Evidently |
No serious DCAI workflow uses every one of these. Teams pick what fits the failure modes they actually see.
A cluster of vendors has built businesses around the data-centric workflow. The list below is not exhaustive and changes quickly.
| Tool or platform | Focus | Origin |
|---|---|---|
| cleanlab (open source and Studio) | Label error detection, model-derived data quality scores | Spun out of MIT, founded by Curtis Northcutt |
| Snorkel and Snorkel Flow | Programmatic labeling, weak supervision at scale | Spun out of Stanford DAWN |
| Label Studio | Open-source annotation UI for text, image, audio, video | HumanSignal |
| Scale AI | Managed labeling services and data engine for autonomous driving and LLMs | Founded 2016 |
| Surge AI | High-quality human annotation for LLM RLHF | Founded 2020 |
| Encord | Annotation, curation, and model evaluation for vision | Founded 2020 |
| Galileo | LLM and ML data quality, error analysis, hallucination detection | Founded 2021 |
| Lightly | Active learning and curation for vision datasets | Spin-out from ETH Zurich |
| DVC | Git-style versioning for datasets and models | Iterative.ai |
| Great Expectations | Data validation and pipeline testing | Open source |
| Landing AI | End-to-end visual inspection platform built around DCAI principles | Founded by Andrew Ng |
Many of these companies explicitly market themselves with the data-centric label, which has fueled the criticism that DCAI is partly a positioning exercise for tooling vendors.
A short list of the work that gave the field its current shape.
| Year | Item | Why it matters |
|---|---|---|
| 2018 | "Datasheets for Datasets," Gebru et al. | Proposed structured documentation for datasets, modeled on electronics component datasheets |
| 2020 | "Shortcut Learning in Deep Neural Networks," Geirhos et al. | Catalogued how models exploit dataset artifacts rather than learning intended concepts |
| 2021 | "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks," Northcutt, Athalye, Mueller | Found an average of 3.3 percent label errors across ten major benchmarks; about 6 percent on ImageNet |
| 2021 | "On the Dangers of Stochastic Parrots," Bender, Gebru, McMillan-Major, Shmitchell | Foregrounded data quality, scale, and curation costs in large language models |
| 2021 | "Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI," Sambasivan et al. (CHI) | Interview study of 53 AI practitioners; 92 percent reported one or more data cascades, 45.3 percent reported two or more |
| 2021 | NeurIPS Data-Centric AI Workshop | First major academic venue for the field; organized by Andrew Ng and collaborators |
| 2021 | DeepLearning.AI Data-Centric AI Competition | Inverted Kaggle: model architecture was fixed, participants competed on improving a 1,500-image Roman numeral dataset capped at 10,000 examples |
| 2023+ | MIT IAP "Introduction to Data-Centric AI" | Free public course taught by Anish Athalye, Curtis Northcutt, and Jonas Mueller; lectures and labs are openly available |
| 2024 | FineWeb and FineWeb-Edu, Hugging Face | 15 trillion token Common Crawl pretraining corpus produced by aggressive filtering and MinHash deduplication; FineWeb-Edu uses an LLM-trained quality classifier to extract a 1.3T-token educational subset |
The Sambasivan et al. paper in particular sharpened the field's vocabulary. "Data cascades" names the way a small upstream data problem can quietly compound through a pipeline and only show up as a model failure months later, by which point it is hard to trace.
Large language model pretraining has turned out to be a strikingly data-centric activity, even though most of the public attention goes to model size and architecture. The major recipe shifts of the last few years have been about which tokens go into training, in what order, with what filters.
Web-scale pretraining corpora are built through long filtering pipelines: language identification, URL blocklists, perplexity filters, classifier-based quality scoring, line-level heuristics for boilerplate, and aggressive deduplication. FineWeb is a clear public example. Its team processed 96 Common Crawl dumps, ran MinHash deduplication, and ablated each filtering choice against downstream evaluation. The educational subset FineWeb-Edu uses a quality classifier trained on Llama-3-70B-generated annotations and produces noticeably better MMLU and ARC scores at the same compute budget. None of this is a model change.
Synthetic data has become a standard ingredient. Self-Instruct, Alpaca, WizardLM, and similar pipelines use a strong model to generate instruction-following data for a smaller one. Constitutional AI extends this idea to safety: models critique and revise their own outputs against a written set of principles, producing alignment data without human raters in the loop for every example. LLM-as-judge setups now play a similar role for evaluation, though the failure modes of using a model to judge itself are still being worked out.
For post-training, the data-centric mindset shows up in careful curation of preference data for RLHF and DPO, deduplication and contamination checks against evaluation benchmarks, and red-teaming datasets that stress safety properties. The shift to test-time compute and reasoning models has not displaced any of this; if anything, it has put more weight on having clean, high-quality reasoning traces in the training mix.
DCAI overlaps heavily with MLOps. The two fields grew up at roughly the same time and share most of their tooling. The difference, to the extent there is one, is emphasis. MLOps tends to focus on the lifecycle of models, including deployment, monitoring, and rollback. DCAI focuses on the lifecycle of datasets and the workflows that produce them. In practice, a mature ML platform will have both: dataset versioning sits next to model versioning, data validation runs in the same CI as model unit tests, and dashboards track input drift alongside output quality.
Feature stores, data contracts, and modern data quality tooling sit on the boundary between data engineering and DCAI. The DCAI lens treats them as machine learning concerns rather than purely analytics concerns, since the consumer of the data is a model that will be retrained on it.
The sharpest criticism of DCAI is that the model-centric versus data-centric framing is a false dichotomy. Every working ML team iterates on both. Pretending they are opposed makes for good talks and conference workshops, but the actual practice is mixed. A second criticism is that some of the marketing around DCAI is straightforwardly self-interested: vendors selling labeling platforms, data quality tools, or annotation services have an obvious incentive to promote a worldview in which their product category is where the leverage lives.
There is also a more substantive concern. Calling something "data-centric" can paper over hard questions about who collects the data, under what conditions, and with what consent. The Sambasivan paper is partly a study of how invisible labor in data work, much of it offshored, gets discounted in AI development. A version of DCAI that focuses purely on tools and metrics without engaging with the labor and ethics of data collection misses the point of why the data work is hard in the first place.
None of this means the underlying ideas are wrong. Mislabels in benchmarks are real. Data cascades are real. Filtering recipes drive a large share of LLM quality. The field has produced concrete, replicable techniques. But the slogan is doing more work than the science sometimes warrants, and that is worth keeping in mind when reading vendor blog posts.