DatologyAI
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,117 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,117 words
Add missing citations, update stale details, or suggest a clearer explanation.
DatologyAI is a Redwood City, California artificial-intelligence startup that builds automated tools for curating, deduplicating, and composing the training datasets used by foundation models. The company was founded in 2023 by Ari Morcos (CEO), Matthew Leavitt (Chief Science Officer) and Bogdan Gaza (Chief Technology Officer), and productizes a line of research, beginning with the NeurIPS 2022 paper Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning, that argues training-data selection can be a stronger lever on model quality than additional parameters or compute.[^1][^2][^3] DatologyAI raised an $11.65 million seed round led by Amplify Partners in February 2024 (closing the round announced in September 2023) and a $46 million Series A in May 2024 led by Felicis Ventures, with angel checks from Jeff Dean, Yann LeCun, Geoffrey Hinton and others.[^4][^5][^6] Its platform is deployed on customer infrastructure to filter, mix, augment and order pre-training corpora across text, image, video, audio, tabular and specialized modalities, and the company has published a sequence of technical reports comparing models trained on its curated data to those trained on popular open corpora such as RedPajama, RefinedWeb, FineWeb and DCLM.[^3][^7][^8]
| Field | Value |
|---|---|
| Type | Private company |
| Founded | 2023 |
| Headquarters | Redwood City, California, United States |
| Founders | Ari Morcos, Matthew Leavitt, Bogdan Gaza |
| CEO | Ari Morcos |
| Industry | Data curation for AI; machine-learning infrastructure |
| Total funding | ~$57.6 million (as of May 2024) |
| Lead seed investor | Amplify Partners |
| Lead Series A investor | Felicis Ventures |
| Product | Automated data-curation platform for foundation-model training |
| Modalities supported | Text, image, video, audio, tabular, genomic, geospatial |
The intellectual core of DatologyAI predates the company. In June 2022 Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli and Ari Morcos posted the paper Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning to arXiv; the work received an Outstanding Paper Award at NeurIPS 2022. The authors argued that the conventional empirical scaling law for test error as a power of training-set size is not a hard ceiling: with a sufficiently informative pruning metric, the power-law in dataset size can be replaced by exponential scaling. They benchmarked ten data-pruning metrics on ImageNet and introduced a self-supervised pruning score that approached the performance of supervised oracle metrics, training ResNet models on as little as 75% of the dataset without accuracy loss.[^1][^9]
That paper, along with related work on data selection and pre-training data quality such as Morcos and colleagues' NeurIPS 2023 study D4: Improving LLM Pretraining via Document De-Duplication and Diversification, framed an emerging "data is the bottleneck" thesis. As the supply of additional unique web text plateaued and training budgets for frontier models grew into the hundreds of millions of dollars, a number of groups argued that the highest-leverage interventions in pretraining would come from selecting which examples a model sees, not just feeding it more.[^9][^10]
Morcos, who holds a PhD in neuroscience from Harvard and joined Google DeepMind as a postdoctoral researcher before spending roughly five years at Meta's Meta AI (FAIR) research lab, left FAIR in 2023 to start DatologyAI. He was joined by Matthew Leavitt, a McGill neuroscience PhD who had been an AI resident at FAIR under Morcos and subsequently led data research at MosaicML, and Bogdan Gaza, a computer scientist whose prior roles included engineering work at Amazon and a senior engineering management position covering search and language infrastructure at Twitter, plus a co-founding role at fraud-detection startup Moonsense.[^2][^11][^12] Morcos has described the founding rationale in interviews as bringing the techniques being applied internally at frontier labs to the rest of the model-training market, on the premise that careful curation is the only way to keep extracting capability gains as raw token supply tightens.[^3][^7]
DatologyAI announced an $11.65 million seed round on 22 February 2024. The lead was Amplify Partners, with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital. The round attracted a list of angel investors unusually weighted toward AI principal investigators: Jeff Dean, Yann LeCun, Geoffrey Hinton, Cohere co-founders Aidan Gomez and Ivan Zhang, Quora founder Adam D'Angelo, and former Intel AI vice-president Naveen Rao.[^4][^5] Outset Capital, one of the participating funds, described the round as having been closed in late 2023 and held quietly while the team built out infrastructure before going public.[^13]
Just under three months later, on 7 May 2024, DatologyAI announced a $46 million Series A. The round was led by Astasia Myers and Viv Faga of Felicis Ventures, with participation from existing seed investors Radical Ventures and Amplify Partners and new investors including Elad Gil, Microsoft's M12 venture arm and the Amazon Alexa Fund.[^6][^14] The Series A took the company's cumulative announced funding to roughly $57.6 million, with the proceeds earmarked, according to Morcos, for hiring more researchers and engineers and expanding compute for internal curation experiments.[^15] Astasia Myers subsequently took a board-observer role.[^16]
Felicis publicly framed its thesis in terms compatible with Morcos's: training data is the dominant lever for frontier-model quality, the supply of high-quality public web text is finite (with some forecasts suggesting it could be exhausted as a usable resource within a small number of years), and automated curation is therefore a category of infrastructure rather than a tooling line-item.[^16]
| Round | Date | Amount | Lead(s) | Other participants |
|---|---|---|---|---|
| Seed | Feb 2024 (closed late 2023) | $11.65 M | Amplify Partners | Radical Ventures, Conviction Capital, Outset Capital, Quiet Capital; angels Jeff Dean, Yann LeCun, Geoffrey Hinton, Aidan Gomez, Ivan Zhang, Adam D'Angelo, Naveen Rao[^4][^5][^13] |
| Series A | May 7, 2024 | $46 M | Felicis Ventures (Astasia Myers, Viv Faga) | Radical Ventures, Amplify Partners, Elad Gil, M12, Amazon Alexa Fund[^6][^14][^16] |
DatologyAI sells its platform as an automated data-curation pipeline rather than a labeling service. The system is designed to ingest petabyte-scale raw corpora and emit a training-ready dataset that has been deduplicated, filtered, augmented and ordered for a given downstream training run. According to the company's own description and external reporting, the pipeline combines several algorithmic families:[^3][^7][^17]
The product is deployed inside the customer's own environment (on-premises or in a customer-controlled cloud account) so that the underlying training data does not leave the customer's infrastructure. The platform is modality-agnostic in design: DatologyAI's marketing and technical writeups describe instances for text, image-text pairs, video, audio, tabular data and "exotic" modalities such as genomic and geospatial data.[^3][^7][^17]
DatologyAI does not publish a public list of all customers but has issued case studies for two: Arcee AI, a builder of frontier open foundation models, and Thomson Reuters, which used the platform for legal-domain models.[^17][^18]
DatologyAI publishes a sequence of technical blog posts that read like applied research papers, comparing models trained on internally curated corpora to models trained on widely used open datasets at matched parameter counts and token budgets. The reports cover both pre-training and downstream evaluation; they are the primary public source of evidence for the platform's claims.
In its initial text deep-dive, DatologyAI applied its pipeline to the RedPajama-v1 (RPJv1) corpus, producing a curated variant the company calls DAIT (DatologyAI Text). The team trained transformer language models at 1.3 B and 2.7 B parameters for token budgets of 20 B-180 B tokens and evaluated on 15 standard tasks including MMLU, HellaSwag, LAMBADA, BoolQ and the ARC subsets.[^8]
At 2.7 B parameters and 180 B training tokens, the company reported:[^8]
These numbers were positioned as a "data-only" intervention: architecture, optimizer and tokenization were held constant across runs, and only the training corpus changed.[^8]
Parallel work targets multimodal training. In an image-text post the company reported producing a CLIP-style model on a retrieval-optimized curation of the DataComp pool: a CLIP-ViT-S/32 (~63 M parameters) trained on 512 M curated samples matched or beat a CLIP-ViT-B/32 (~151 M parameters) trained with the same compute on the raw baseline, with a 13-percentage-point absolute gain on the retrieval benchmark suite.[^19]
A follow-up, CLIP Gets a Data Upgrade, compared DatologyAI-curated training data to two industry baselines, SigLIP 2 (trained on 40 B image-text pairs) and MetaCLIP (12.8 B pairs), across 24 zero-shot classification benchmarks anchored on ImageNet-1k and ImageNet-v2 and three retrieval benchmarks anchored on MSCOCO and Flickr. At ViT-B/32 the report claimed ImageNet-1k zero-shot accuracy of 76.91% versus 74.0% for SigLIP 2, while training on roughly an order of magnitude less data, and reported larger gains over MetaCLIP.[^20] As with the text work, the framing is that careful data curation alone, without changes to the contrastive objective or the architecture, can match or surpass state-of-the-art results.[^20]
In August 2025 DatologyAI published the report BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining on arXiv (2508.10975), authored by Pratyush Maini and 28 collaborators including Matthew Leavitt. BeyondWeb is a synthetic data generation framework that rephrases and augments real web documents to produce additional pre-training data, trained and evaluated at trillion-token scale.[^21][^22]
The report describes training 1 B, 3 B and 8 B parameter language models on BeyondWeb tokens and compares against Cosmopedia and the synthetic subset of Nemotron-CC (Nemotron-Synth). Across a fourteen-benchmark suite, BeyondWeb is reported to outperform Cosmopedia by up to 5.1 percentage points and Nemotron-Synth by up to 2.6 percentage points on average, with up to 7.7x faster training relative to open web data and 2.7x faster than Nemotron-Synth at matched accuracy. A particularly striking claim is that a 3 B model trained for 180 B tokens on BeyondWeb data outperforms an 8 B model trained on the same token budget over Cosmopedia.[^21][^22]
BeyondWeb is positioned by the company as a building block of its full curation pipeline rather than a standalone product: the synthetic data sits alongside the company's filtering, deduplication and mixing logic.[^22]
Subsequent posts have extended the same methodology to additional questions:[^7]
These are blog-format reports rather than peer-reviewed papers, but they include controlled comparisons, full benchmark tables and ablations of pipeline components.[^7]
DatologyAI's first publicly disclosed customer was Arcee AI, which used the platform to curate the pre-training corpus for its Trinity-Large-Thinking model, a 400 B-parameter mixture-of-experts language model with 13 B active parameters per token. According to DatologyAI's case study, Trinity-Large-Thinking was trained on 20 trillion tokens spread across three pre-training phases, including more than 8 trillion synthetic tokens generated on clusters of up to 2,048 NVIDIA H100 GPUs and 12 trillion curated web tokens. The case study cites top-tier agentic benchmark scores for the resulting model (#1 on Tau2-Airline and #2 on PinchBench and AIME25 among the open models evaluated at the time of writing) and notes that the model served more than 3 trillion tokens in its first two months on the OpenRouter inference marketplace.[^18]
A second case study describes a Thomson Reuters partnership for domain-specific legal models, in which the customer reported "clear, measurable improvements across both public and proprietary legal evaluations" for legal reasoning, retrieval and downstream-task accuracy, achieved under a constrained data budget. DatologyAI has also disclosed broader collaboration with the company at the product level.[^7][^17]
DatologyAI is one of the most visible companies operationalizing the idea that the marginal return on additional parameters or additional raw tokens for foundation models is falling, and that systematic interventions on the training corpus offer a higher-leverage path to capability gains. This thesis was popularized in academic work by, among others, the Sorscher, Geirhos and Morcos paper on data pruning, the Hoffmann et al. Chinchilla paper on compute-optimal scaling, and the Beyond the Imitation Game and similar benchmarking efforts; in industry it has been argued by groups such as MosaicML (which was acquired by Databricks in 2023), the FineWeb and DCLM communities, and individual frontier labs.[^1][^7][^9]
Within this thesis, DatologyAI's commercial role is closest to providing an "outsourced data team" for organizations that do not have the in-house research capacity to run their own large-scale curation experiments but still want to train custom models on domain-specific corpora. Felicis's investment write-up explicitly framed it that way: data curation is "the missing piece of every AI strategy" because most enterprises will not build internal teams comparable to those at OpenAI, Anthropic, Google DeepMind or Meta AI but will still want to train or fine-tune custom models in the coming years.[^16]
The data-for-AI space contains several adjacent categories whose offerings overlap with parts of DatologyAI's pipeline without fully covering it. The table below summarizes the comparison.
| Company | Primary offering | Overlap with DatologyAI |
|---|---|---|
| Scale AI | Human-in-the-loop labeling, RLHF, evaluations | Both serve foundation-model training, but Scale's core competency is labeled-data production and evaluations rather than automated curation of pretraining-scale unlabeled corpora.[^23] |
| Snorkel AI | Programmatic data labeling and weak supervision via Snorkel Flow | Overlaps in "data-centric AI" framing; Snorkel emphasizes weak supervision for supervised tasks, where DatologyAI emphasizes unsupervised pretraining curation.[^23][^24] |
| Cleanlab | Automated detection and correction of label errors in tabular and labeled datasets | Cleanlab focuses on labeled supervised datasets and label noise; DatologyAI operates earlier in the pipeline, on pretraining-scale unlabeled corpora.[^25] |
| Galileo | Observability and evaluation tools for ML data and models | Galileo is more diagnostic than curative; DatologyAI overlaps in the "is your data good" question but answers it with an automated rewrite of the dataset.[^23] |
DatologyAI is most often grouped with Scale AI, Snorkel AI, Cleanlab and similar companies as part of the broader machine-learning training-data-curation market, while differentiating itself on petabyte-scale pretraining and multimodality.[^23]
Several limitations of DatologyAI's public position are worth noting.
The company's response to several of these issues, expressed in interviews and writeups, has been to argue that publishing controlled blog reports with full benchmark tables is more transparent than typical commercial alternatives and that the work of independent groups, including the DCLM project and the FineWeb team, increasingly provides external reference points against which its results can be calibrated.[^3][^7]
Ari S. Morcos is a co-founder and CEO of DatologyAI. He completed his undergraduate work at the University of California, San Diego before earning a PhD in neuroscience from Harvard University, where he studied neuronal circuits underlying decision-making under Chris Harvey. He was a postdoctoral researcher at Google DeepMind in London for two years and then spent approximately five years at Meta AI (FAIR) in Menlo Park, where he held the title of senior staff research scientist. At FAIR his research centered on the mechanisms underlying neural-network computation, including work on data pruning, self-supervised learning, the lottery-ticket hypothesis, regularization and representation learning. He is a co-author of the NeurIPS 2022 Outstanding Paper on data pruning and has additional Outstanding Paper recognition at ICLR.[^2][^9][^26]
Matthew L. Leavitt is co-founder and Chief Science Officer of DatologyAI. He earned a PhD in neuroscience from McGill University in Montreal, studying cognitive neurobiology with Julio Martinez-Trujillo, before joining FAIR as an AI resident under Morcos and Sergey Edunov for roughly one and a half years. He then became Head of Data Research at MosaicML, the company that built the MPT model family and was acquired by Databricks in 2023. His research interests span both neuroscience and machine learning and focus on understanding and explaining biological and synthetic intelligence.[^11][^27]
Bogdan Gaza is co-founder and Chief Technology Officer of DatologyAI. He studied computer science at Universitatea Alexandru Ioan Cuza Iași and the University of Lille and accumulated more than a decade of experience in distributed systems and large-scale production infrastructure. He held engineering roles at Amazon and was a senior engineering manager at Twitter responsible for search and language infrastructure, before co-founding the behavioral-fraud-detection startup Moonsense. He brings the systems engineering and production-infrastructure background that complements the research backgrounds of Morcos and Leavitt.[^12][^28]
If DatologyAI's commercial bet is correct, automated data curation becomes a standard layer of the pre-training stack alongside data ingestion, transformer training frameworks and inference serving. The company has assembled a comparatively unusual combination of academic recognition (the NeurIPS Outstanding Paper award for the founding paper), high-profile angel support (LeCun, Hinton, Dean), institutional investment from Amplify, Radical, Felicis, M12 and the Amazon Alexa Fund, and a small but visible early customer set in Arcee AI and Thomson Reuters. The combination is a useful test case for whether the "data is the new compute" thesis can support a standalone infrastructure business as opposed to remaining a feature of the largest frontier labs.[^4][^6][^16][^18]
Whether automated curation evolves into a winner-take-most market or into a feature of broader ML platforms remains an open question. As of mid-2026, DatologyAI is the highest-profile pure-play in this niche, but it competes for budget with internal data teams at frontier labs, with cloud providers (some of which, including AWS and Azure, are also investors), and with broader data-quality and labeling companies such as Scale AI, Snorkel and Cleanlab.[^23]
A second open question concerns how the company's research-heavy public posture interacts with its commercial position. DatologyAI publishes detailed technical reports, releases the recipe-level structure of its pipeline through blog posts, and contributes to the academic literature through papers such as BeyondWeb. The pattern is reminiscent of the early MosaicML strategy, in which extensive technical writing both attracted hires and served as a moat against pure-services competitors. Whether this is sustainable as the underlying methods become more widely known, and as adjacent open-source efforts such as DCLM, FineWeb and the Hugging Face data ecosystem continue to publish strong baselines, will help determine whether the company's edge is principally its algorithms, its productionized infrastructure, or its accumulated curation know-how applied across customer deployments.[^7][^10][^22]