DatologyAI

AI Companies Data & Datasets

21 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

28 citations

Revision

v3 · 4,115 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DatologyAI is a Redwood City, California artificial-intelligence startup that builds automated tools for curating, deduplicating, and composing the training datasets used by foundation models. The company was founded in 2023 by Ari Morcos (CEO), Matthew Leavitt (Chief Science Officer) and Bogdan Gaza (Chief Technology Officer), and productizes a line of research, beginning with the NeurIPS 2022 paper Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning, that argues training-data selection can be a stronger lever on model quality than additional parameters or compute.^[1]^[2]^[3] DatologyAI raised an $11.65 million seed round led by Amplify Partners in February 2024 (closing the round announced in September 2023) and a $46 million Series A in May 2024 led by Felicis Ventures, with angel checks from Jeff Dean, Yann LeCun, Geoffrey Hinton and others.^[4]^[5]^[6] Its platform is deployed on customer infrastructure to filter, mix, augment and order pre-training corpora across text, image, video, audio, tabular and specialized modalities, and the company has published a sequence of technical reports comparing models trained on its curated data to those trained on popular open corpora such as RedPajama, RefinedWeb, FineWeb and DCLM.^[3]^[7]^[8]

Field	Value
Type	Private company
Founded	2023
Headquarters	Redwood City, California, United States
Founders	Ari Morcos, Matthew Leavitt, Bogdan Gaza
CEO	Ari Morcos
Industry	Data curation for AI; machine-learning infrastructure
Total funding	~$57.6 million (as of May 2024)
Lead seed investor	Amplify Partners
Lead Series A investor	Felicis Ventures
Product	Automated data-curation platform for foundation-model training
Modalities supported	Text, image, video, audio, tabular, genomic, geospatial

Background

The intellectual core of DatologyAI predates the company. In June 2022 Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli and Ari Morcos posted the paper Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning to arXiv; the work received an Outstanding Paper Award at NeurIPS 2022. The authors argued that the conventional empirical scaling law for test error as a power of training-set size is not a hard ceiling: with a sufficiently informative pruning metric, the power-law in dataset size can be replaced by exponential scaling. They benchmarked ten data-pruning metrics on ImageNet and introduced a self-supervised pruning score that approached the performance of supervised oracle metrics, training ResNet models on as little as 75% of the dataset without accuracy loss.^[1]^[9]

That paper, along with related work on data selection and pre-training data quality such as Morcos and colleagues' NeurIPS 2023 study D4: Improving LLM Pretraining via Document De-Duplication and Diversification, framed an emerging "data is the bottleneck" thesis. As the supply of additional unique web text plateaued and training budgets for frontier models grew into the hundreds of millions of dollars, a number of groups argued that the highest-leverage interventions in pretraining would come from selecting which examples a model sees, not just feeding it more.^[9]^[10]

Morcos, who holds a PhD in neuroscience from Harvard and joined Google DeepMind as a postdoctoral researcher before spending roughly five years at Meta's Meta AI (FAIR) research lab, left FAIR in 2023 to start DatologyAI. He was joined by Matthew Leavitt, a McGill neuroscience PhD who had been an AI resident at FAIR under Morcos and subsequently led data research at MosaicML, and Bogdan Gaza, a computer scientist whose prior roles included engineering work at Amazon and a senior engineering management position covering search and language infrastructure at Twitter, plus a co-founding role at fraud-detection startup Moonsense.^[2]^[11]^[12] Morcos has described the founding rationale in interviews as bringing the techniques being applied internally at frontier labs to the rest of the model-training market, on the premise that careful curation is the only way to keep extracting capability gains as raw token supply tightens.^[3]^[7]

Founding and funding history

Seed round (announced February 2024)

DatologyAI announced an $11.65 million seed round on 22 February 2024. The lead was Amplify Partners, with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital. The round attracted a list of angel investors unusually weighted toward AI principal investigators: Jeff Dean, Yann LeCun, Geoffrey Hinton, Cohere co-founders Aidan Gomez and Ivan Zhang, Quora founder Adam D'Angelo, and former Intel AI vice-president Naveen Rao.^[4]^[5] Outset Capital, one of the participating funds, described the round as having been closed in late 2023 and held quietly while the team built out infrastructure before going public.^[13]

Series A (May 2024)

Just under three months later, on 7 May 2024, DatologyAI announced a $46 million Series A. The round was led by Astasia Myers and Viv Faga of Felicis Ventures, with participation from existing seed investors Radical Ventures and Amplify Partners and new investors including Elad Gil, Microsoft's M12 venture arm and the Amazon Alexa Fund.^[6]^[14] The Series A took the company's cumulative announced funding to roughly $57.6 million, with the proceeds earmarked, according to Morcos, for hiring more researchers and engineers and expanding compute for internal curation experiments.^[15] Astasia Myers subsequently took a board-observer role.^[16]

Felicis publicly framed its thesis in terms compatible with Morcos's: training data is the dominant lever for frontier-model quality, the supply of high-quality public web text is finite (with some forecasts suggesting it could be exhausted as a usable resource within a small number of years), and automated curation is therefore a category of infrastructure rather than a tooling line-item.^[16]

Investor lineup summary

Round	Date	Amount	Lead(s)	Other participants
Seed	Feb 2024 (closed late 2023)	$11.65 M	Amplify Partners	Radical Ventures, Conviction Capital, Outset Capital, Quiet Capital; angels Jeff Dean, Yann LeCun, Geoffrey Hinton, Aidan Gomez, Ivan Zhang, Adam D'Angelo, Naveen Rao^[4]^[5]^[13]
Series A	May 7, 2024	$46 M	Felicis Ventures (Astasia Myers, Viv Faga)	Radical Ventures, Amplify Partners, Elad Gil, M12, Amazon Alexa Fund^[6]^[14]^[16]

Product

DatologyAI sells its platform as an automated data-curation pipeline rather than a labeling service. The system is designed to ingest petabyte-scale raw corpora and emit a training-ready dataset that has been deduplicated, filtered, augmented and ordered for a given downstream training run. According to the company's own description and external reporting, the pipeline combines several algorithmic families:^[3]^[7]^[17]

Lexical deduplication. SHA-512 document hashing and n-gram matching to remove exact and near-exact duplicates.
Heuristic filters. Rule-based removal of malformed, whitespace-heavy or otherwise low-utility documents.
Model-based quality filtering. Lightweight classifiers trained on high-quality reference sets to score the value of pretraining samples.
Embedding-based curation. Use of geometric structure in embedding space to detect semantic near-duplicates and high-value examples and to balance long-tailed concept distributions.
Target-distribution matching. Retrieval and upsampling of data whose distribution aligns with a target evaluation or downstream task.
Synthetic-data generation. Rephrasing and elaboration of source documents using language models to extend coverage and improve density of useful content.
Source mixing. Optimization of relative weights across heterogeneous sub-corpora (for example, the multiple sources inside RedPajama).
Curriculum and batch construction. Choice of ordering and batching of selected examples to improve training efficiency.

The product is deployed inside the customer's own environment (on-premises or in a customer-controlled cloud account) so that the underlying training data does not leave the customer's infrastructure. The platform is modality-agnostic in design: DatologyAI's marketing and technical writeups describe instances for text, image-text pairs, video, audio, tabular data and "exotic" modalities such as genomic and geospatial data.^[3]^[7]^[17]

DatologyAI does not publish a public list of all customers but has issued case studies for two: Arcee AI, a builder of frontier open foundation models, and Thomson Reuters, which used the platform for legal-domain models.^[17]^[18]

Technical details and reports

DatologyAI publishes a sequence of technical blog posts that read like applied research papers, comparing models trained on internally curated corpora to models trained on widely used open datasets at matched parameter counts and token budgets. The reports cover both pre-training and downstream evaluation; they are the primary public source of evidence for the platform's claims.

Text pre-training (2024 deep dive)

In its initial text deep-dive, DatologyAI applied its pipeline to the RedPajama-v1 (RPJv1) corpus, producing a curated variant the company calls DAIT (DatologyAI Text). The team trained transformer language models at 1.3 B and 2.7 B parameters for token budgets of 20 B-180 B tokens and evaluated on 15 standard tasks including MMLU, HellaSwag, LAMBADA, BoolQ and the ARC subsets.^[8]

At 2.7 B parameters and 180 B training tokens, the company reported:^[8]

DAIT 60.5% mean 5-shot accuracy versus 52.0% for RPJv1, an 8.4 percentage-point gain.
DAIT 60.5% versus 56.1% for the same-size model trained on DCLM, a 4.4 percentage-point gain.
DAIT versus FineWeb-Edu, a 6.1 percentage-point gain.
A 7.7x speedup to reach the RPJv1 baseline accuracy, and a 3.4x speedup to reach the DCLM baseline.
A 1.3 B model trained on DAIT for 180 B tokens reached 56.5% mean 5-shot accuracy, beating all 2.7 B baselines and offering roughly 2.1x lower inference cost per query.

These numbers were positioned as a "data-only" intervention: architecture, optimizer and tokenization were held constant across runs, and only the training corpus changed.^[8]

Image-text and CLIP

Parallel work targets multimodal training. In an image-text post the company reported producing a CLIP-style model on a retrieval-optimized curation of the DataComp pool: a CLIP-ViT-S/32 (~63 M parameters) trained on 512 M curated samples matched or beat a CLIP-ViT-B/32 (~151 M parameters) trained with the same compute on the raw baseline, with a 13-percentage-point absolute gain on the retrieval benchmark suite.^[19]

A follow-up, CLIP Gets a Data Upgrade, compared DatologyAI-curated training data to two industry baselines, SigLIP 2 (trained on 40 B image-text pairs) and MetaCLIP (12.8 B pairs), across 24 zero-shot classification benchmarks anchored on ImageNet-1k and ImageNet-v2 and three retrieval benchmarks anchored on MSCOCO and Flickr. At ViT-B/32 the report claimed ImageNet-1k zero-shot accuracy of 76.91% versus 74.0% for SigLIP 2, while training on roughly an order of magnitude less data, and reported larger gains over MetaCLIP.^[20] As with the text work, the framing is that careful data curation alone, without changes to the contrastive objective or the architecture, can match or surpass state-of-the-art results.^[20]

Synthetic pre-training data (BeyondWeb, 2025)

In August 2025 DatologyAI published the report BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining on arXiv (2508.10975), authored by Pratyush Maini and 28 collaborators including Matthew Leavitt. BeyondWeb is a synthetic data generation framework that rephrases and augments real web documents to produce additional pre-training data, trained and evaluated at trillion-token scale.^[21]^[22]

The report describes training 1 B, 3 B and 8 B parameter language models on BeyondWeb tokens and compares against Cosmopedia and the synthetic subset of Nemotron-CC (Nemotron-Synth). Across a fourteen-benchmark suite, BeyondWeb is reported to outperform Cosmopedia by up to 5.1 percentage points and Nemotron-Synth by up to 2.6 percentage points on average, with up to 7.7x faster training relative to open web data and 2.7x faster than Nemotron-Synth at matched accuracy. A particularly striking claim is that a 3 B model trained for 180 B tokens on BeyondWeb data outperforms an 8 B model trained on the same token budget over Cosmopedia.^[21]^[22]

BeyondWeb is positioned by the company as a building block of its full curation pipeline rather than a standalone product: the synthetic data sits alongside the company's filtering, deduplication and mixing logic.^[22]

Other reports

Subsequent posts have extended the same methodology to additional questions:^[7]

Luxical Embeddings (December 2025) introduced an embedding-based curation methodology.
DatBench (January 2026) described a framework for "discriminative, faithful and efficient" vision-language model evaluation aimed at reducing the compute cost of model selection during curation.
UberWeb (February 2026) reported on multilingual curation experiments for a 20-trillion-token dataset.
The Finetuner's Fallacy (March 2026) examined when supervised fine-tuning data is best mixed into pre-training rather than reserved for a post-training stage.
20/20 Vision Language Models (May 2026) extended the curation-only-for-improvement framing to vision-language models.

These are blog-format reports rather than peer-reviewed papers, but they include controlled comparisons, full benchmark tables and ablations of pipeline components.^[7]

Customers and case studies

DatologyAI's first publicly disclosed customer was Arcee AI, which used the platform to curate the pre-training corpus for its Trinity-Large-Thinking model, a 400 B-parameter mixture-of-experts language model with 13 B active parameters per token. According to DatologyAI's case study, Trinity-Large-Thinking was trained on 20 trillion tokens spread across three pre-training phases, including more than 8 trillion synthetic tokens generated on clusters of up to 2,048 NVIDIA H100 GPUs and 12 trillion curated web tokens. The case study cites top-tier agentic benchmark scores for the resulting model (#1 on Tau2-Airline and #2 on PinchBench and AIME25 among the open models evaluated at the time of writing) and notes that the model served more than 3 trillion tokens in its first two months on the OpenRouter inference marketplace.^[18]

A second case study describes a Thomson Reuters partnership for domain-specific legal models, in which the customer reported "clear, measurable improvements across both public and proprietary legal evaluations" for legal reasoning, retrieval and downstream-task accuracy, achieved under a constrained data budget. DatologyAI has also disclosed broader collaboration with the company at the product level.^[7]^[17]

Place in the data-is-the-bottleneck thesis

DatologyAI is one of the most visible companies operationalizing the idea that the marginal return on additional parameters or additional raw tokens for foundation models is falling, and that systematic interventions on the training corpus offer a higher-leverage path to capability gains. This thesis was popularized in academic work by, among others, the Sorscher, Geirhos and Morcos paper on data pruning, the Hoffmann et al. Chinchilla paper on compute-optimal scaling, and the Beyond the Imitation Game and similar benchmarking efforts; in industry it has been argued by groups such as MosaicML (which was acquired by Databricks in 2023), the FineWeb and DCLM communities, and individual frontier labs.^[1]^[7]^[9]

Within this thesis, DatologyAI's commercial role is closest to providing an "outsourced data team" for organizations that do not have the in-house research capacity to run their own large-scale curation experiments but still want to train custom models on domain-specific corpora. Felicis's investment write-up explicitly framed it that way: data curation is "the missing piece of every AI strategy" because most enterprises will not build internal teams comparable to those at OpenAI, Anthropic, Google DeepMind or Meta AI but will still want to train or fine-tune custom models in the coming years.^[16]

Comparison with adjacent companies

The data-for-AI space contains several adjacent categories whose offerings overlap with parts of DatologyAI's pipeline without fully covering it. The table below summarizes the comparison.

Company	Primary offering	Overlap with DatologyAI
Scale AI	Human-in-the-loop labeling, RLHF, evaluations	Both serve foundation-model training, but Scale's core competency is labeled-data production and evaluations rather than automated curation of pretraining-scale unlabeled corpora.^[23]
Snorkel AI	Programmatic data labeling and weak supervision via Snorkel Flow	Overlaps in "data-centric AI" framing; Snorkel emphasizes weak supervision for supervised tasks, where DatologyAI emphasizes unsupervised pretraining curation.^[23]^[24]
Cleanlab	Automated detection and correction of label errors in tabular and labeled datasets	Cleanlab focuses on labeled supervised datasets and label noise; DatologyAI operates earlier in the pipeline, on pretraining-scale unlabeled corpora.^[25]
Galileo	Observability and evaluation tools for ML data and models	Galileo is more diagnostic than curative; DatologyAI overlaps in the "is your data good" question but answers it with an automated rewrite of the dataset.^[23]

DatologyAI is most often grouped with Scale AI, Snorkel AI, Cleanlab and similar companies as part of the broader machine-learning training-data-curation market, while differentiating itself on petabyte-scale pretraining and multimodality.^[23]

Limitations and criticisms

Several limitations of DatologyAI's public position are worth noting.

Self-reported benchmarks. Most of the comparative numbers in the company's published reports are produced and evaluated by DatologyAI itself; while the methodology is described in some detail, the reports are not peer-reviewed and independent third-party replications at matched compute are scarce. Some of the strongest claims (for example, training-time speedups of 7.7x against the RPJv1 baseline) are sensitive to the specific evaluation suite and to the comparison budget.^[8]
Opacity of the pipeline. Because the platform is delivered as a service running inside customer infrastructure, the specific filters, thresholds and synthetic-data prompts used in a given curation are not publicly inspectable. This is standard for commercial software but limits external auditability, including for downstream questions about copyright, sensitive-content filtering and de-biasing.
Sensitivity to the underlying corpora. The text deep-dive starts from RedPajama-v1; the multimodal work starts from DataComp. The reported gains are gains over these baselines and may not transfer linearly to closed corpora with different source distributions.
Synthetic-data risks. BeyondWeb relies heavily on language-model-generated synthetic rephrasings, which raises well-known questions about model collapse, recursion of biases, and the validity of evaluations that themselves draw on web-derived benchmarks. The BeyondWeb report acknowledges there is "no silver bullet" for high-quality synthetic data.^[21]^[22]
Total addressable market. The viable customer base for petabyte-scale pretraining curation is structurally smaller than the customer base for, for example, labeling services, and growth depends on the assumption that a wide range of enterprises will continue to pretrain or heavily continue-pretrain foundation models rather than simply consuming closed-API frontier models.

The company's response to several of these issues, expressed in interviews and writeups, has been to argue that publishing controlled blog reports with full benchmark tables is more transparent than typical commercial alternatives and that the work of independent groups, including the DCLM project and the FineWeb team, increasingly provides external reference points against which its results can be calibrated.^[3]^[7]

Founders

Ari Morcos

Ari S. Morcos is a co-founder and CEO of DatologyAI. He completed his undergraduate work at the University of California, San Diego before earning a PhD in neuroscience from Harvard University, where he studied neuronal circuits underlying decision-making under Chris Harvey. He was a postdoctoral researcher at Google DeepMind in London for two years and then spent approximately five years at Meta AI (FAIR) in Menlo Park, where he held the title of senior staff research scientist. At FAIR his research centered on the mechanisms underlying neural-network computation, including work on data pruning, self-supervised learning, the lottery-ticket hypothesis, regularization and representation learning. He is a co-author of the NeurIPS 2022 Outstanding Paper on data pruning and has additional Outstanding Paper recognition at ICLR.^[2]^[9]^[26]

Matthew Leavitt

Matthew L. Leavitt is co-founder and Chief Science Officer of DatologyAI. He earned a PhD in neuroscience from McGill University in Montreal, studying cognitive neurobiology with Julio Martinez-Trujillo, before joining FAIR as an AI resident under Morcos and Sergey Edunov for roughly one and a half years. He then became Head of Data Research at MosaicML, the company that built the MPT model family and was acquired by Databricks in 2023. His research interests span both neuroscience and machine learning and focus on understanding and explaining biological and synthetic intelligence.^[11]^[27]

Bogdan Gaza

Bogdan Gaza is co-founder and Chief Technology Officer of DatologyAI. He studied computer science at Universitatea Alexandru Ioan Cuza Iași and the University of Lille and accumulated more than a decade of experience in distributed systems and large-scale production infrastructure. He held engineering roles at Amazon and was a senior engineering manager at Twitter responsible for search and language infrastructure, before co-founding the behavioral-fraud-detection startup Moonsense. He brings the systems engineering and production-infrastructure background that complements the research backgrounds of Morcos and Leavitt.^[12]^[28]

Significance

If DatologyAI's commercial bet is correct, automated data curation becomes a standard layer of the pre-training stack alongside data ingestion, transformer training frameworks and inference serving. The company has assembled a comparatively unusual combination of academic recognition (the NeurIPS Outstanding Paper award for the founding paper), high-profile angel support (LeCun, Hinton, Dean), institutional investment from Amplify, Radical, Felicis, M12 and the Amazon Alexa Fund, and a small but visible early customer set in Arcee AI and Thomson Reuters. The combination is a useful test case for whether the "data is the new compute" thesis can support a standalone infrastructure business as opposed to remaining a feature of the largest frontier labs.^[4]^[6]^[16]^[18]

Whether automated curation evolves into a winner-take-most market or into a feature of broader ML platforms remains an open question. As of mid-2026, DatologyAI is the highest-profile pure-play in this niche, but it competes for budget with internal data teams at frontier labs, with cloud providers (some of which, including AWS and Azure, are also investors), and with broader data-quality and labeling companies such as Scale AI, Snorkel and Cleanlab.^[23]

A second open question concerns how the company's research-heavy public posture interacts with its commercial position. DatologyAI publishes detailed technical reports, releases the recipe-level structure of its pipeline through blog posts, and contributes to the academic literature through papers such as BeyondWeb. The pattern is reminiscent of the early MosaicML strategy, in which extensive technical writing both attracted hires and served as a moat against pure-services competitors. Whether this is sustainable as the underlying methods become more widely known, and as adjacent open-source efforts such as DCLM, FineWeb and the Hugging Face data ecosystem continue to publish strong baselines, will help determine whether the company's edge is principally its algorithms, its productionized infrastructure, or its accumulated curation know-how applied across customer deployments.^[7]^[10]^[22]

References

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari Morcos, "Beyond neural scaling laws: beating power law scaling via data pruning", arXiv, 2022-06-29. https://arxiv.org/abs/2206.14486. Accessed 2026-05-20. ↩
Ari Morcos, "About Me", arimorcos.com, undated. http://www.arimorcos.com/. Accessed 2026-05-20. ↩
Kyle Wiggers, "DatologyAI is building tech to automatically curate AI training data sets", TechCrunch, 2024-02-22. https://techcrunch.com/2024/02/22/datologyai-is-building-tech-to-automatically-curate-ai-training-data-sets/. Accessed 2026-05-20. ↩
Mike Wheatley, "DatologyAI raises $11.65M to automate data curation for more efficient AI training", SiliconANGLE, 2024-02-22. https://siliconangle.com/2024/02/22/datologyai-raises-11-65m-automate-data-curation-efficient-ai-training/. Accessed 2026-05-20. ↩
DatologyAI, "Introducing DatologyAI: Making models better through better data, automatically", DatologyAI blog, 2024-02-22. https://www.datologyai.com/blog/introducing-datologyai--making-models-better-through-better-data-automatically. Accessed 2026-05-20. ↩
DatologyAI, "DatologyAI raises $46M Series A", DatologyAI blog, 2024-05-07. https://www.datologyai.com/blog/datologyai-raises-46m-series-a. Accessed 2026-05-20. ↩
DatologyAI, "Data Curation, Research and Strategic Insights" (blog index), DatologyAI, 2026. https://www.datologyai.com/blog. Accessed 2026-05-20. ↩
DatologyAI, "Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset", DatologyAI blog, 2024. https://www.datologyai.com/blog/technical-deep-dive-curating-our-way-to-a-state-of-the-art-text-dataset. Accessed 2026-05-20. ↩
Ari Morcos, "Publications", arimorcos.com, undated. https://www.arimorcos.com/publications/. Accessed 2026-05-20. ↩
Anas Awadalla et al. / DataComp-LM team, "DataComp-LM: In search of the next generation of training sets for language models", arXiv, 2024-06-17. https://arxiv.org/abs/2406.11794. Accessed 2026-05-20. ↩
Matthew Leavitt, "Matthew Leavitt", mleavitt.net, undated. https://mleavitt.net/. Accessed 2026-05-20. ↩
Bogdan Gaza, "Bogdan Gaza, Co-Founder and CTO @ DatologyAI", Crunchbase, 2024. https://www.crunchbase.com/person/bogdan-gaza-0981. Accessed 2026-05-20. ↩
Outset Capital, "TechCrunch announces Datology seed round", Outset Capital blog, 2024-02-22. https://www.outsetcapital.com/post/datology-ai-fundraise-announced-in-techcrunch. Accessed 2026-05-20. ↩
Mike Wheatley, "DatologyAI raises $46M to streamline AI model training data diets", SiliconANGLE, 2024-05-08. https://siliconangle.com/2024/05/08/datologyai-raises-46m-streamline-ai-model-training-data-diets/. Accessed 2026-05-20. ↩
AI Business, "Startup Raises $46M to Revolutionize AI Dataset Curation", AI Business, 2024-05-08. https://aibusiness.com/data/startup-raises-46m-to-revolutionize-ai-dataset-curation. Accessed 2026-05-20. ↩
Felicis Ventures, "Felicis's Series A in Datology: Automated Data Curation as the Missing Piece of Every AI Strategy", Felicis blog, 2024-05-07. https://www.felicis.com/blog/datology-series-a. Accessed 2026-05-20. ↩
DatologyAI, "Datology: Train Better Models, Faster and Smaller" (home page), DatologyAI, 2026. https://www.datologyai.com/. Accessed 2026-05-20. ↩
DatologyAI, "Faster, Better, Smaller: Arcee AI works with DatologyAI to release their foundation model", DatologyAI blog, 2026-04-10. https://www.datologyai.com/blog/arcee-case-study. Accessed 2026-05-20. ↩
DatologyAI, "Technical Deep-Dive: Image-Text Data Curation at the Billion-Sample Scale", DatologyAI blog, 2024. https://www.datologyai.com/blog/productionized-multimodal-data-curation-at-the-billion-sample-scale. Accessed 2026-05-20. ↩
DatologyAI, "CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only", DatologyAI blog, 2025. https://www.datologyai.com/blog/clip-gets-a-data-upgrade-outperforming-sota-with-improved-data-curation-only. Accessed 2026-05-20. ↩
Pratyush Maini et al., "BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining", arXiv, 2025-08-14. https://arxiv.org/abs/2508.10975. Accessed 2026-05-20. ↩
DatologyAI, "BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining", DatologyAI blog, 2025-08-18. https://www.datologyai.com/blog/beyondweb. Accessed 2026-05-20. ↩
CB Insights, "DatologyAI Top Alternatives, Competitors", CB Insights company profile, 2026. https://www.cbinsights.com/company/datologyai/alternatives-competitors. Accessed 2026-05-20. ↩
Madrona Venture Group, "Snorkel's Alex Ratner talks data-centric AI on Founded & Funded", Madrona blog, 2024. https://www.madrona.com/founded-and-funded-snorkel-alex-ratner-data-centric-ai/. Accessed 2026-05-20. ↩
Cleanlab, "Cleanlab: The History, Present, and Future", Cleanlab blog, 2024. https://cleanlab.ai/blog/learn/cleanlab-history/. Accessed 2026-05-20. ↩
Ari Morcos, "CV", arimorcos.com, undated. https://www.arimorcos.com/static/pdfs/morcos_as_cv.pdf. Accessed 2026-05-20. ↩
Matthew Leavitt, "Matthew L Leavitt", Google Scholar profile, undated. https://scholar.google.ca/citations?user=S3-M5Z8AAAAJ&hl=en. Accessed 2026-05-20. ↩
Bogdan Gaza, "Bogdan Gaza", LinkedIn profile, undated. https://www.linkedin.com/in/bogdangaza/. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

WRAP (Web Rephrase Augmented Pre-training)