Data-centric AI (DCAI)

Introduction

Data-centric AI (DCAI) is an approach to building machine learning systems that puts systematic improvement of the training data at the center of the workflow, rather than treating data as a fixed input and iterating mainly on model architecture or training algorithms. The discipline was popularized by Andrew Ng in 2021 and has since grown into a research area with its own NeurIPS workshops, dedicated tooling vendors, and a freely available MIT course.

The core claim is straightforward: in many real applications, the model has long since stopped being the bottleneck. Two engineers using the same off-the-shelf architecture on the same dataset will get nearly identical results. The same engineers handed a noisy dataset and asked to clean it will produce wildly different models. So the leverage, the argument goes, is in the data work that almost nobody wants to do.

DCAI is sometimes presented as the opposite of model-centric AI, but in practice the two are complementary. Most production teams iterate on both. The reason DCAI gets a name at all is that the data side of that loop was, for years, treated as a one-time setup task rather than an engineering discipline with its own tools, metrics, and best practices.

Origin and the model-centric contrast

The term entered wide use in March 2021, when Andrew Ng gave a talk titled "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI." He argued that academic ML research had spent roughly a decade fixating on model architecture while holding benchmark datasets fixed, and that this had trained a generation of practitioners to reach for a new model whenever a project stalled. In industry, where datasets are often small, noisy, and domain-specific, that habit fails. Ng claimed that in the projects he had seen at Landing AI, perhaps 80 percent of the actual work was data work. That number is not a measured statistic so much as a rule of thumb, but it captured a real frustration that many practitioners shared.

Later that year, Ng and a group of collaborators organized the first NeurIPS Data-Centric AI Workshop. The workshop framed DCAI as "the discipline of systematically engineering the data needed to successfully build an AI system," and grouped its scope into roughly six areas: data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance. Workshops have continued at NeurIPS in subsequent years, alongside courses and competitions.

The contrast with model-centric work is easiest to see in a side-by-side view.

Aspect	Model-centric approach	Data-centric approach
What is held fixed	Dataset is fixed; model is varied	Model is fixed; dataset is varied
Primary lever	Architecture, hyperparameters, optimizer	Label quality, coverage, augmentation, filtering
Typical metric to improve	Test accuracy on a held-out benchmark	Real-world reliability, robustness, label noise
Typical failure mode	Diminishing returns on bigger models	Garbage in, garbage out; data cascades
Where it dominates	Frontier research, benchmark leaderboards	Production ML, regulated and high-stakes domains
Representative tools	PyTorch, JAX, TensorFlow, Hugging Face Transformers	cleanlab, Snorkel, Label Studio, DVC, Great Expectations

In frontier LLM research, both approaches now happen at once. Architecture work continues, but the most measurable gains in 2024 and 2025 came from changes to pretraining data, not from new attention variants.

Key principles

DCAI is less a single technique than a stance about where ML effort should go. A few principles show up across most descriptions of the field.

Iterate on data, not just models. The training set is treated as a deliverable that goes through versions, reviews, and tests, the same way code does. If the model is underperforming on a particular slice, the first response is to look at the data for that slice rather than to retrain a larger network.

Treat data as code. Datasets should be versioned, validated against schemas, monitored in production, and rolled back when something breaks. Tools like DVC, lakeFS, and Great Expectations come from this principle.

Quality over quantity. A smaller, cleaner dataset often beats a larger, noisier one. Curtis Northcutt and colleagues showed in 2021 that 6 percent of the ImageNet validation set is mislabeled and that correcting these labels can flip the ranking of widely used architectures: ResNet-18 outperforms ResNet-50 once enough mislabels are removed. The implication is that a chunk of architectural progress on noisy benchmarks may be measurement noise.

Systematic error analysis. Rather than reporting a single accuracy number, DCAI workflows slice errors by class, by region, by collection date, and by any other axis that might reveal a data problem. The fix is then a targeted data change: relabeling a confusing class, collecting more examples from an underrepresented slice, removing a contaminated source.

Documentation and provenance. Datasets should travel with their context. Datasheets for Datasets, proposed by Timnit Gebru and coauthors in 2018, are a structured template for recording why a dataset was created, how it was collected, what its known limitations are, and who maintains it. Model cards play the same role for trained models.

Techniques

The field covers a wide set of techniques, most of which existed before the term "data-centric AI" did. The novelty is in pulling them under one umbrella and treating them as parts of a single workflow.

Technique	What it does	Example tools
Label cleaning	Detects and fixes mislabeled training examples using model disagreement, confident learning, or rater consensus	cleanlab, Cleanlab Studio
Inter-rater agreement	Measures whether independent annotators agree, then resolves disputes through adjudication or majority voting	Label Studio, Prodigy, custom QA pipelines
Data labeling	Creates supervision signals through human annotation, often with active queue routing and review workflows	Label Studio, Scale AI, Surge AI, Encord
Weak supervision	Replaces hand labels with programmatic labeling functions whose noise is modeled and aggregated	Snorkel, Snorkel Flow
Active learning	Selects the most informative unlabeled examples for human review, reducing labeling cost	modAL, baal, Prodigy
Data augmentation	Generates new training examples by transforming existing ones (cropping, paraphrasing, mixup, RandAugment)	Albumentations, AugLy, nlpaug
Synthetic data	Generates training data from simulators, GANs, diffusion models, or LLMs	Self-Instruct, Alpaca pipeline, Gretel, Mostly AI
Data slicing and error analysis	Breaks evaluation into subgroups to find systematic failures the aggregate metric hides	Snorkel SliceNets, Domino, custom dashboards
Data validation	Asserts schemas, distributions, ranges, and freshness on incoming data	Great Expectations, Deequ, TFDV
Data versioning	Tracks dataset revisions alongside code so experiments are reproducible	DVC, Git LFS, lakeFS, Pachyderm
Documentation	Records dataset purpose, collection method, and known biases	Datasheets for Datasets, Data Cards, Model Cards
Data quality monitoring	Watches input distributions and label quality after deployment	WhyLabs, Arize, Evidently

No serious DCAI workflow uses every one of these. Teams pick what fits the failure modes they actually see.

Tools and platforms

A cluster of vendors has built businesses around the data-centric workflow. The list below is not exhaustive and changes quickly.

Tool or platform	Focus	Origin
cleanlab (open source and Studio)	Label error detection, model-derived data quality scores	Spun out of MIT, founded by Curtis Northcutt
Snorkel and Snorkel Flow	Programmatic labeling, weak supervision at scale	Spun out of Stanford DAWN
Label Studio	Open-source annotation UI for text, image, audio, video	HumanSignal
Scale AI	Managed labeling services and data engine for autonomous driving and LLMs	Founded 2016
Surge AI	High-quality human annotation for LLM RLHF	Founded 2020
Encord	Annotation, curation, and model evaluation for vision	Founded 2020
Galileo	LLM and ML data quality, error analysis, hallucination detection	Founded 2021
Lightly	Active learning and curation for vision datasets	Spin-out from ETH Zurich
DVC	Git-style versioning for datasets and models	Iterative.ai
Great Expectations	Data validation and pipeline testing	Open source
Landing AI	End-to-end visual inspection platform built around DCAI principles	Founded by Andrew Ng

Many of these companies explicitly market themselves with the data-centric label, which has fueled the criticism that DCAI is partly a positioning exercise for tooling vendors.

Influential papers and events

A short list of the work that gave the field its current shape.

Year	Item	Why it matters
2018	"Datasheets for Datasets," Gebru et al.	Proposed structured documentation for datasets, modeled on electronics component datasheets
2020	"Shortcut Learning in Deep Neural Networks," Geirhos et al.	Catalogued how models exploit dataset artifacts rather than learning intended concepts
2021	"Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks," Northcutt, Athalye, Mueller	Found an average of 3.3 percent label errors across ten major benchmarks; about 6 percent on ImageNet
2021	"On the Dangers of Stochastic Parrots," Bender, Gebru, McMillan-Major, Shmitchell	Foregrounded data quality, scale, and curation costs in large language models
2021	"Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI," Sambasivan et al. (CHI)	Interview study of 53 AI practitioners; 92 percent reported one or more data cascades, 45.3 percent reported two or more
2021	NeurIPS Data-Centric AI Workshop	First major academic venue for the field; organized by Andrew Ng and collaborators
2021	DeepLearning.AI Data-Centric AI Competition	Inverted Kaggle: model architecture was fixed, participants competed on improving a 1,500-image Roman numeral dataset capped at 10,000 examples
2023+	MIT IAP "Introduction to Data-Centric AI"	Free public course taught by Anish Athalye, Curtis Northcutt, and Jonas Mueller; lectures and labs are openly available
2024	FineWeb and FineWeb-Edu, Hugging Face	15 trillion token Common Crawl pretraining corpus produced by aggressive filtering and MinHash deduplication; FineWeb-Edu uses an LLM-trained quality classifier to extract a 1.3T-token educational subset

The Sambasivan et al. paper in particular sharpened the field's vocabulary. "Data cascades" names the way a small upstream data problem can quietly compound through a pipeline and only show up as a model failure months later, by which point it is hard to trace.

DCAI in the LLM era

Large language model pretraining has turned out to be a strikingly data-centric activity, even though most of the public attention goes to model size and architecture. The major recipe shifts of the last few years have been about which tokens go into training, in what order, with what filters.

Web-scale pretraining corpora are built through long filtering pipelines: language identification, URL blocklists, perplexity filters, classifier-based quality scoring, line-level heuristics for boilerplate, and aggressive deduplication. FineWeb is a clear public example. Its team processed 96 Common Crawl dumps, ran MinHash deduplication, and ablated each filtering choice against downstream evaluation. The educational subset FineWeb-Edu uses a quality classifier trained on Llama-3-70B-generated annotations and produces noticeably better MMLU and ARC scores at the same compute budget. None of this is a model change.

Synthetic data has become a standard ingredient. Self-Instruct, Alpaca, WizardLM, and similar pipelines use a strong model to generate instruction-following data for a smaller one. Constitutional AI extends this idea to safety: models critique and revise their own outputs against a written set of principles, producing alignment data without human raters in the loop for every example. LLM-as-judge setups now play a similar role for evaluation, though the failure modes of using a model to judge itself are still being worked out.

For post-training, the data-centric mindset shows up in careful curation of preference data for RLHF and DPO, deduplication and contamination checks against evaluation benchmarks, and red-teaming datasets that stress safety properties. The shift to test-time compute and reasoning models has not displaced any of this; if anything, it has put more weight on having clean, high-quality reasoning traces in the training mix.

Connection to MLOps and data engineering

DCAI overlaps heavily with MLOps. The two fields grew up at roughly the same time and share most of their tooling. The difference, to the extent there is one, is emphasis. MLOps tends to focus on the lifecycle of models, including deployment, monitoring, and rollback. DCAI focuses on the lifecycle of datasets and the workflows that produce them. In practice, a mature ML platform will have both: dataset versioning sits next to model versioning, data validation runs in the same CI as model unit tests, and dashboards track input drift alongside output quality.

Feature stores, data contracts, and modern data quality tooling sit on the boundary between data engineering and DCAI. The DCAI lens treats them as machine learning concerns rather than purely analytics concerns, since the consumer of the data is a model that will be retrained on it.

Critiques

The sharpest criticism of DCAI is that the model-centric versus data-centric framing is a false dichotomy. Every working ML team iterates on both. Pretending they are opposed makes for good talks and conference workshops, but the actual practice is mixed. A second criticism is that some of the marketing around DCAI is straightforwardly self-interested: vendors selling labeling platforms, data quality tools, or annotation services have an obvious incentive to promote a worldview in which their product category is where the leverage lives.

There is also a more substantive concern. Calling something "data-centric" can paper over hard questions about who collects the data, under what conditions, and with what consent. The Sambasivan paper is partly a study of how invisible labor in data work, much of it offshored, gets discounted in AI development. A version of DCAI that focuses purely on tools and metrics without engaging with the labor and ethics of data collection misses the point of why the data work is hard in the first place.

None of this means the underlying ideas are wrong. Mislabels in benchmarks are real. Data cascades are real. Filtering recipes drive a large share of LLM quality. The field has produced concrete, replicable techniques. But the slogan is doing more work than the science sometimes warrants, and that is worth keeping in mind when reading vendor blog posts.

References

Ng, A. (2021). "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI." DeepLearning.AI talk.
NeurIPS Data-Centric AI Workshop (2021). https://neurips.cc/virtual/2021/workshop/21860
Data-Centric AI Resource Hub. https://www.datacentricai.org/
Strickland, E. (2022). "Andrew Ng: Unbiggen AI." IEEE Spectrum interview.
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., and Aroyo, L. (2021). "Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI." CHI 2021.
Northcutt, C. G., Athalye, A., and Mueller, J. (2021). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." NeurIPS Datasets and Benchmarks Track. arXiv:2103.14749.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2018, updated 2021). "Datasheets for Datasets." Communications of the ACM. arXiv:1803.09010.
Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" FAccT 2021.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. (2020). "Shortcut Learning in Deep Neural Networks." Nature Machine Intelligence.
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." VLDB. arXiv:1711.10160.
Athalye, A., Northcutt, C., and Mueller, J. (2024). "Introduction to Data-Centric AI." MIT IAP. https://dcai.csail.mit.edu/
Penedo, G., Kydlíček, H., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. v., and Wolf, T. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557.
Zha, D., Bhat, Z. P., Lai, K.-H., Yang, F., Jiang, Z., Zhong, S., and Hu, X. (2023). "Data-centric Artificial Intelligence: A Survey." arXiv:2212.11854.
DeepLearning.AI (2021). "Data-Centric AI Competition." https://https-deeplearning-ai.github.io/data-centric-comp/

Data-centric AI (DCAI)

Introduction

Origin and the model-centric contrast

Key principles

Techniques

Tools and platforms

Influential papers and events

DCAI in the LLM era

Connection to MLOps and data engineering

Critiques

See also

References

Improve this article

Introduction

Origin and the model-centric contrast

Key principles

Techniques

Tools and platforms

Influential papers and events

DCAI in the LLM era

Connection to MLOps and data engineering

Critiques

See also

References

Introduction

Origin and the model-centric contrast

Key principles

Techniques

Tools and platforms

Influential papers and events

DCAI in the LLM era

Connection to MLOps and data engineering

Critiques

See also

References

Improve this article

Related Articles

Training-Serving Skew

Coverage Bias

Imbalanced Dataset

Non-Response Bias

Outlier Detection

Participation Bias

Introduction

Origin and the model-centric contrast

Key principles

Techniques

Tools and platforms

Influential papers and events

DCAI in the LLM era

Connection to MLOps and data engineering

Critiques

See also

References

Related Articles

Training-Serving Skew

Coverage Bias

Imbalanced Dataset

Non-Response Bias

Outlier Detection

Participation Bias