# Data-centric AI (DCAI)

> Source: https://aiwiki.ai/wiki/data-centric_ai_dcai
> Updated: 2026-06-27
> Categories: Data & Datasets, MLOps
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

Data-centric AI (DCAI) is the discipline of systematically engineering and improving the data used to train a [machine learning](/wiki/machine_learning) model, rather than holding the data fixed and iterating mainly on model architecture or training algorithms. The approach was popularized by [Andrew Ng](/wiki/andrew_ng) in March 2021, who summarized it with the line "Data is food for AI" and argued that on real projects roughly 80 percent of the work is data work, not modeling.[1][4] The most-cited evidence behind the movement is a 2021 study finding an average of at least 3.3 percent label errors across ten widely used machine learning benchmarks, and at least 6 percent in the ImageNet validation set, errors large enough to flip which model architecture looks better.[6]

The core claim is straightforward: in many real applications, the model has long since stopped being the bottleneck. Two engineers using the same off-the-shelf architecture on the same dataset will get nearly identical results. The same engineers handed a noisy dataset and asked to clean it will produce wildly different models. So the leverage, the argument goes, is in the data work that almost nobody wants to do.

DCAI is sometimes presented as the opposite of model-centric AI, but in practice the two are complementary. Most production teams iterate on both. The reason DCAI gets a name at all is that the data side of that loop was, for years, treated as a one-time setup task rather than an engineering discipline with its own tools, metrics, and best practices.

## What is data-centric AI?

The term entered wide use on 24 March 2021, when Andrew Ng gave a talk titled "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI."[1] He argued that academic ML research had spent roughly a decade fixating on model architecture while holding benchmark datasets fixed, and that this had trained a generation of practitioners to reach for a new model whenever a project stalled. In industry, where datasets are often small, noisy, and domain-specific, that habit fails. Ng claimed that in the projects he had seen at [Landing AI](/wiki/landingai), perhaps 80 percent of the actual work was data work.[4] That number is not a measured statistic so much as a rule of thumb, but it captured a real frustration that many practitioners shared.

Later that year, Ng and a group of collaborators organized the first NeurIPS Data-Centric AI Workshop.[2] The workshop framed DCAI, in a definition now hosted at the field's resource hub, as "the discipline of systematically engineering the data used to build an AI system," and grouped its scope into roughly six areas: data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance.[3] Workshops have continued at NeurIPS in subsequent years, alongside courses and competitions.

## How is it different from model-centric AI?

The contrast with model-centric work is easiest to see in a side-by-side view.

| Aspect | Model-centric approach | Data-centric approach |
| --- | --- | --- |
| What is held fixed | Dataset is fixed; model is varied | Model is fixed; dataset is varied |
| Primary lever | Architecture, hyperparameters, optimizer | Label quality, coverage, augmentation, filtering |
| Typical metric to improve | Test accuracy on a held-out benchmark | Real-world reliability, robustness, label noise |
| Typical failure mode | Diminishing returns on bigger models | Garbage in, garbage out; data cascades |
| Where it dominates | Frontier research, benchmark leaderboards | Production ML, regulated and high-stakes domains |
| Representative tools | PyTorch, JAX, TensorFlow, Hugging Face Transformers | cleanlab, Snorkel, Label Studio, DVC, Great Expectations |

In one example Ng presented from steel-sheet defect inspection, a data-centric pass that cleaned and standardized the labels lifted accuracy meaningfully, while a model-centric pass on the same baseline produced almost no improvement.[1] In frontier LLM research, both approaches now happen at once. Architecture work continues, but the most measurable gains in 2024 and 2025 came from changes to pretraining data, not from new attention variants.

## What are the key principles of data-centric AI?

DCAI is less a single technique than a stance about where ML effort should go. A few principles show up across most descriptions of the field.

Iterate on data, not just models. The training set is treated as a deliverable that goes through versions, reviews, and tests, the same way code does. If the model is underperforming on a particular slice, the first response is to look at the data for that slice rather than to retrain a larger network.

Treat data as code. Datasets should be versioned, validated against schemas, monitored in production, and rolled back when something breaks. Tools like DVC, lakeFS, and Great Expectations come from this principle.

Quality over quantity. A smaller, cleaner dataset often beats a larger, noisier one. Curtis Northcutt and colleagues showed in 2021 that at least 6 percent of the ImageNet validation set is mislabeled and that correcting these labels can flip the ranking of widely used architectures: ResNet-18 outperforms ResNet-50 once enough mislabels are removed.[6] The implication is that a chunk of architectural progress on noisy benchmarks may be measurement noise.

Systematic error analysis. Rather than reporting a single accuracy number, DCAI workflows slice errors by class, by region, by collection date, and by any other axis that might reveal a data problem. The fix is then a targeted data change: relabeling a confusing class, collecting more examples from an underrepresented slice, removing a contaminated source.

Documentation and provenance. Datasets should travel with their context. Datasheets for Datasets, proposed by Timnit Gebru and coauthors in 2018, are a structured template for recording why a dataset was created, how it was collected, what its known limitations are, and who maintains it.[7] Model cards play the same role for trained models.

## What are the techniques of data-centric AI?

The field covers a wide set of techniques, most of which existed before the term "data-centric AI" did. The novelty is in pulling them under one umbrella and treating them as parts of a single workflow.

| Technique | What it does | Example tools |
| --- | --- | --- |
| Label cleaning | Detects and fixes mislabeled training examples using model disagreement, [confident learning](/wiki/confident_learning_cl), or rater consensus | [cleanlab](/wiki/cleanlab), Cleanlab Studio |
| Inter-rater agreement | Measures whether independent annotators agree, then resolves disputes through adjudication or majority voting | Label Studio, Prodigy, custom QA pipelines |
| [Data labeling](/wiki/data_labeling) | Creates supervision signals through human annotation, often with active queue routing and review workflows | Label Studio, Scale AI, Surge AI, Encord |
| [Weak supervision](/wiki/weak_supervision) | Replaces hand labels with programmatic labeling functions whose noise is modeled and aggregated | [Snorkel](/wiki/snorkel), Snorkel Flow |
| [Active learning](/wiki/active_learning) | Selects the most informative unlabeled examples for human review, reducing labeling cost | modAL, baal, Prodigy |
| [Data augmentation](/wiki/data_augmentation) | Generates new training examples by transforming existing ones (cropping, paraphrasing, mixup, RandAugment) | Albumentations, AugLy, nlpaug |
| [Synthetic data](/wiki/synthetic_data) | Generates training data from simulators, GANs, diffusion models, or LLMs | Self-Instruct, Alpaca pipeline, Gretel, Mostly AI |
| Data slicing and error analysis | Breaks evaluation into subgroups to find systematic failures the aggregate metric hides | Snorkel SliceNets, Domino, custom dashboards |
| Data validation | Asserts schemas, distributions, ranges, and freshness on incoming data | Great Expectations, Deequ, TFDV |
| Data versioning | Tracks dataset revisions alongside code so experiments are reproducible | DVC, Git LFS, lakeFS, Pachyderm |
| Documentation | Records dataset purpose, collection method, and known biases | Datasheets for Datasets, [Data Cards](/wiki/datasheet), [Model Cards](/wiki/model_card) |
| [Data quality](/wiki/data_quality) monitoring | Watches input distributions and label quality after deployment | WhyLabs, Arize, Evidently |

No serious DCAI workflow uses every one of these. Teams pick what fits the failure modes they actually see.

## What tools and platforms support data-centric AI?

A cluster of vendors has built businesses around the data-centric workflow. The list below is not exhaustive and changes quickly.

| Tool or platform | Focus | Origin |
| --- | --- | --- |
| cleanlab (open source and Studio) | Label error detection, model-derived data quality scores | Spun out of MIT, founded by Curtis Northcutt |
| Snorkel and Snorkel Flow | Programmatic labeling, weak supervision at scale | Spun out of Stanford DAWN |
| Label Studio | Open-source annotation UI for text, image, audio, video | HumanSignal |
| Scale AI | Managed labeling services and data engine for autonomous driving and LLMs | Founded 2016 |
| Surge AI | High-quality human annotation for LLM RLHF | Founded 2020 |
| Encord | Annotation, curation, and model evaluation for vision | Founded 2020 |
| Galileo | LLM and ML data quality, error analysis, hallucination detection | Founded 2021 |
| Lightly | Active learning and curation for vision datasets | Spin-out from ETH Zurich |
| DVC | Git-style versioning for datasets and models | Iterative.ai |
| Great Expectations | Data validation and pipeline testing | Open source |
| Landing AI | End-to-end visual inspection platform built around DCAI principles | Founded by Andrew Ng |

Many of these companies explicitly market themselves with the data-centric label, which has fueled the criticism that DCAI is partly a positioning exercise for tooling vendors.

## What papers and events shaped the field?

A short list of the work that gave the field its current shape.

| Year | Item | Why it matters |
| --- | --- | --- |
| 2018 | "Datasheets for Datasets," [Gebru](/wiki/datasheet) et al. | Proposed structured documentation for datasets, modeled on electronics component datasheets[7] |
| 2020 | "Shortcut Learning in Deep Neural Networks," Geirhos et al. | Catalogued how models exploit dataset artifacts rather than learning intended concepts[9] |
| 2021 | "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks," Northcutt, Athalye, Mueller | Found an average of at least 3.3 percent label errors across ten major benchmarks; at least 6 percent on ImageNet[6] |
| 2021 | "On the Dangers of Stochastic Parrots," Bender, Gebru, McMillan-Major, Shmitchell | Foregrounded data quality, scale, and curation costs in large language models[8] |
| 2021 | "Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI," Sambasivan et al. (CHI) | Interview study of 53 AI practitioners; 92 percent reported one or more data cascades, 45.3 percent reported two or more[5] |
| 2021 | NeurIPS Data-Centric AI Workshop | First major academic venue for the field; organized by Andrew Ng and collaborators[2] |
| 2021 | DeepLearning.AI Data-Centric AI Competition | Launched 17 June 2021 with a 4 September 2021 deadline; an inverted Kaggle where model architecture was fixed and participants competed on improving a 1,500-image Roman numeral dataset capped at 10,000 examples[14] |
| 2023+ | MIT IAP "Introduction to Data-Centric AI" | Free public course taught by Anish Athalye, Curtis Northcutt, and Jonas Mueller; lectures and labs are openly available[11] |
| 2024 | FineWeb and FineWeb-Edu, Hugging Face | 15 trillion token Common Crawl pretraining corpus produced by aggressive filtering and MinHash deduplication; FineWeb-Edu uses an LLM-trained quality classifier to extract a 1.3T-token educational subset[12] |

The Sambasivan et al. paper in particular sharpened the field's vocabulary. "Data cascades" names the way a small upstream data problem can quietly compound through a pipeline and only show up as a model failure months later, by which point it is hard to trace.[5]

## How does data-centric AI apply to large language models?

Large language model pretraining has turned out to be a strikingly data-centric activity, even though most of the public attention goes to model size and architecture. The major recipe shifts of the last few years have been about which tokens go into training, in what order, with what filters.

Web-scale pretraining corpora are built through long filtering pipelines: language identification, URL blocklists, perplexity filters, classifier-based quality scoring, line-level heuristics for boilerplate, and aggressive deduplication. FineWeb is a clear public example. Its team processed 96 Common Crawl dumps, ran MinHash deduplication, and ablated each filtering choice against downstream evaluation.[12] The educational subset FineWeb-Edu uses a quality classifier trained on Llama-3-70B-generated annotations and produces noticeably better MMLU and ARC scores at the same compute budget.[12] None of this is a model change.

Synthetic data has become a standard ingredient. Self-Instruct, Alpaca, WizardLM, and similar pipelines use a strong model to generate instruction-following data for a smaller one. Constitutional AI extends this idea to safety: models critique and revise their own outputs against a written set of principles, producing alignment data without human raters in the loop for every example. LLM-as-judge setups now play a similar role for evaluation, though the failure modes of using a model to judge itself are still being worked out.

For post-training, the data-centric mindset shows up in careful curation of preference data for RLHF and DPO, deduplication and contamination checks against evaluation benchmarks, and red-teaming datasets that stress safety properties. The shift to test-time compute and reasoning models has not displaced any of this; if anything, it has put more weight on having clean, high-quality reasoning traces in the training mix.

## How does data-centric AI relate to MLOps and data engineering?

DCAI overlaps heavily with MLOps. The two fields grew up at roughly the same time and share most of their tooling. The difference, to the extent there is one, is emphasis. MLOps tends to focus on the lifecycle of models, including deployment, monitoring, and rollback. DCAI focuses on the lifecycle of datasets and the workflows that produce them. In practice, a mature ML platform will have both: dataset versioning sits next to model versioning, data validation runs in the same CI as model unit tests, and dashboards track input drift alongside output quality.

Feature stores, data contracts, and modern data quality tooling sit on the boundary between data engineering and DCAI. The DCAI lens treats them as machine learning concerns rather than purely analytics concerns, since the consumer of the data is a model that will be retrained on it.

## What are the criticisms of data-centric AI?

The sharpest criticism of DCAI is that the model-centric versus data-centric framing is a false dichotomy. Every working ML team iterates on both. Pretending they are opposed makes for good talks and conference workshops, but the actual practice is mixed. A second criticism is that some of the marketing around DCAI is straightforwardly self-interested: vendors selling labeling platforms, data quality tools, or annotation services have an obvious incentive to promote a worldview in which their product category is where the leverage lives.

There is also a more substantive concern. Calling something "data-centric" can paper over hard questions about who collects the data, under what conditions, and with what consent. The Sambasivan paper is partly a study of how invisible labor in data work, much of it offshored, gets discounted in AI development.[5] A version of DCAI that focuses purely on tools and metrics without engaging with the labor and ethics of data collection misses the point of why the data work is hard in the first place.

None of this means the underlying ideas are wrong. Mislabels in benchmarks are real. Data cascades are real. Filtering recipes drive a large share of LLM quality. The field has produced concrete, replicable techniques. But the slogan is doing more work than the science sometimes warrants, and that is worth keeping in mind when reading vendor blog posts.

## ELI5

Imagine you are teaching a friend to tell cats from dogs by showing them a stack of flashcards. If some of the cards are blurry, or a few cats are accidentally labeled "dog," your friend will get confused no matter how smart they are. Data-centric AI is the idea that cleaning up the flashcards, making them clear, correctly labeled, and covering all the kinds of cats and dogs, usually helps more than just hoping for a smarter friend. As Andrew Ng puts it, "Data is food for AI": feed it good food and it grows strong, feed it junk and it gets sick.[1]

## See also

- [confident learning (CL)](/wiki/confident_learning_cl)
- [cleanlab](/wiki/cleanlab)
- [Snorkel](/wiki/snorkel)
- [data quality](/wiki/data_quality)
- [data labeling](/wiki/data_labeling)
- [data augmentation](/wiki/data_augmentation)
- [active learning](/wiki/active_learning)
- [weak supervision](/wiki/weak_supervision)
- [synthetic data](/wiki/synthetic_data)
- [model card](/wiki/model_card)
- [datasheet](/wiki/datasheet)
- [Andrew Ng](/wiki/andrew_ng)
- [Landing AI](/wiki/landingai)
- [machine learning](/wiki/machine_learning)

## References

1. Ng, A. (2021). "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI." DeepLearning.AI talk, 24 March 2021.
2. NeurIPS Data-Centric AI Workshop (2021). https://neurips.cc/virtual/2021/workshop/21860
3. Data-Centric AI Resource Hub. https://www.datacentricai.org/
4. Strickland, E. (2022). "Andrew Ng: Unbiggen AI." IEEE Spectrum interview.
5. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., and Aroyo, L. (2021). "Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI." CHI 2021.
6. Northcutt, C. G., Athalye, A., and Mueller, J. (2021). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." NeurIPS Datasets and Benchmarks Track. arXiv:2103.14749.
7. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., and Crawford, K. (2018, updated 2021). "Datasheets for Datasets." Communications of the ACM. arXiv:1803.09010.
8. Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" FAccT 2021.
9. Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. (2020). "Shortcut Learning in Deep Neural Networks." Nature Machine Intelligence.
10. Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Re, C. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." VLDB. arXiv:1711.10160.
11. Athalye, A., Northcutt, C., and Mueller, J. (2024). "Introduction to Data-Centric AI." MIT IAP. https://dcai.csail.mit.edu/
12. Penedo, G., Kydlicek, H., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. v., and Wolf, T. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557.
13. Zha, D., Bhat, Z. P., Lai, K.-H., Yang, F., Jiang, Z., Zhong, S., and Hu, X. (2023). "Data-centric Artificial Intelligence: A Survey." arXiv:2212.11854.
14. DeepLearning.AI and Landing AI (2021). "Data-Centric AI Competition." Launched 17 June 2021, submission deadline 4 September 2021. https://https-deeplearning-ai.github.io/data-centric-comp/

