Data Provenance Initiative
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,738 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,738 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Data Provenance Initiative (DPI) is a volunteer-led, multi-institution research collective that audits and documents the licenses, sources, creators, and consent status of the datasets most widely used to train artificial intelligence systems. Led from the MIT Media Lab by researcher Shayne Longpre together with collaborators across academia and industry, the initiative pairs machine learning and legal expertise to trace the lineage of training data and to build open tools such as the Data Provenance Explorer. Active from 2023 onward, it has become one of the most cited efforts in AI data governance, data transparency, and the policy and legal debates surrounding AI copyright.[1][2][3]
The Data Provenance Initiative describes itself as a multidisciplinary, volunteer effort to improve the transparency, documentation, and responsible use of AI training datasets. It has grown to a collective of more than 50 contributors spanning many universities and companies, coordinated principally through the MIT Media Lab.[2][3] Rather than producing new models, the group conducts large-scale manual audits of existing datasets, releases the resulting structured metadata under open licenses, and publishes peer-reviewed analyses of what the data reveals about the AI ecosystem.
Its work spans three layers: empirical audits of dataset licensing and sourcing; the longitudinal tracking of how the open web is being closed off to AI crawlers; and the design of infrastructure, standards, and documentation practices intended to make data provenance verifiable. The initiative's outputs have appeared at venues including the International Conference on Machine Learning (ICML), the Conference on Neural Information Processing Systems (NeurIPS), and the International Conference on Learning Representations (ICLR), and in the journal Nature Machine Intelligence.[1][4][5][6]
Modern large language models are trained on enormous, heterogeneous corpora aggregated from thousands of underlying sources, often with little reliable documentation of where the data came from, who created it, or under what license it may be used. The motivation behind the Data Provenance Initiative is that this opacity creates compounding legal, ethical, and reproducibility problems. Practitioners assembling instruction-tuning or pretraining mixtures frequently cannot determine whether a given dataset is commercially usable, whether it was scraped without consent, or whether it itself was generated by another model.[1][3]
This ambiguity sits at the center of the AI copyright lawsuits filed against major developers, including cases such as The New York Times Company v. Microsoft and OpenAI. The DPI argues that documenting provenance is a prerequisite for resolving questions of fair use, consent, attribution, and creator compensation, and that the field's existing data practices were never designed for the scale at which the web is now being repurposed for AI.[3][7] The initiative positions its work alongside earlier dataset documentation efforts such as datasheets for datasets and model cards, extending those ideas from voluntary per-dataset disclosure toward systematic, auditable, machine-readable provenance.
The initiative's founding work, "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing and Attribution in AI" (Longpre, Mahari, Chen, et al.), was released on arXiv on 25 October 2023 and later published in Nature Machine Intelligence in 2024.[1][8] A team of legal and machine learning experts manually traced more than 1,800 popular text-to-text finetuning datasets, drawn from 44 widely used data collections, annotating each for its original source, creators, license terms, and intended use.
The audit found that license information on major aggregator platforms is frequently missing or wrong: the authors reported license omission rates exceeding 70 percent and error rates exceeding 50 percent across the dataset-hosting sites they examined, meaning that a practitioner relying on the license tag shown by a host would often be misinformed.[1][8] The study also documented a sharp and widening divide between commercially open and commercially closed datasets, with closed datasets disproportionately covering lower-resource languages, more creative tasks, richer topical variety, and newer or more synthetic content. To accompany the audit, the team released a re-licensed and re-documented version of the underlying collection so that downstream users could filter datasets by their actual, corrected provenance.[1][2]
In 2024 the group broadened its argument with the position paper "Data Authenticity, Consent, and Provenance for AI are all broken: what will it take to fix them?" (Longpre, Mahari, Obeng-Marnu, et al.), presented as a spotlight at ICML 2024. The paper contends that the infrastructure for tracing authenticity, verifying consent, and respecting copyright is fundamentally inadequate, and it sketches the standards and tooling the authors believe responsible foundation-model development would require.[9]
A later study, "Bridging the Data Provenance Gap Across Text, Speech and Video" (presented at ICLR 2025), extended the methodology beyond text. The team manually analyzed close to 4,000 public datasets released between 1990 and 2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.[6] Among its findings: multimodal training has shifted overwhelmingly toward web-crawled, synthetic, and social-media content, with platforms such as YouTube dominating speech and video data since around 2019; while fewer than a third of datasets carry restrictive licenses, more than 80 percent of the underlying source content in widely used text, speech, and video datasets is governed by non-commercial restrictions; and measures of geographic and linguistic representation have not meaningfully improved since roughly 2013, despite a rising count of nominally represented languages.[6]
The table below summarizes the initiative's principal works.
| Work | Year | Venue | Scope | Headline finding |
|---|---|---|---|---|
| A Large Scale Audit of Dataset Licensing and Attribution in AI | 2023 (arXiv); 2024 (journal) | Nature Machine Intelligence | ~1,800 finetuning datasets, 44 collections | License omission 70%+, error rates 50%+ on aggregator sites |
| Data Authenticity, Consent, and Provenance for AI are all broken | 2024 | ICML 2024 (spotlight) | Position / framework | Provenance and consent infrastructure is missing |
| Consent in Crisis: The Rapid Decline of the AI Data Commons | 2024 | NeurIPS 2024 | 14,000 web domains | Rapid rise in AI-crawler blocking via robots.txt and ToS |
| Bridging the Data Provenance Gap Across Text, Speech and Video | 2024 (arXiv); 2025 (conf.) | ICLR 2025 | ~4,000 datasets, 608 languages | Multimodal data dominated by web, synthetic, and social sources |
The initiative's best-known tool is the Data Provenance Explorer, an open interactive interface hosted at dataprovenance.org and backed by the open-source Data-Provenance-Collection repository on GitHub.[2][10] The Explorer lets practitioners download, filter, and inspect the corrected provenance metadata for the audited finetuning datasets, selecting by attributes such as language coverage, presence of code, whether the data is model-generated, and whether the license is permissive. For each dataset the tool surfaces the source, creators, license conditions, and derivation chain that the initiative's annotators verified, and it can auto-generate "data provenance cards" intended as a reusable documentation standard.[1][2] The goal is to give teams a practical way to assemble training mixtures with known licensing status rather than relying on the unreliable tags shown by hosting platforms.
"Consent in Crisis: The Rapid Decline of the AI Data Commons," released in July 2024 and presented at NeurIPS 2024, is the initiative's most widely discussed publication and the work of a large author team (roughly 49 contributors led by Longpre).[5][11] It is described as the first large-scale longitudinal audit of the consent signals governing the web domains that underpin major AI training corpora, examining about 14,000 web domains drawn from the Common Crawl based datasets C4, RefinedWeb, and Dolma.
The study documents a sudden tightening of access within a single year (April 2023 to April 2024). Using the robots.txt exclusion protocol, websites moved to block AI crawlers so quickly that roughly 5 percent or more of all tokens in C4, and more than 28 percent of the tokens from the most actively maintained, critical sources in C4, became fully restricted.[5][11] When the authors also accounted for restrictions expressed in websites' terms of service, they found that about 45 percent of C4 was restricted from AI use. The paper further documents a proliferation of AI-specific clauses, frequent inconsistencies between a site's stated terms of service and its robots.txt, and uneven treatment of different AI developers. The authors frame these as symptoms of web protocols that were never designed to govern the repurposing of online content for AI, and they warn that the shrinking "data commons" disproportionately affects open and academic developers who lack licensing deals.[5][11]
The Data Provenance Initiative has become a reference point in discussions of AI transparency, copyright, and governance. Its license-correction work is used directly by practitioners checking whether datasets are safe to train on, while its findings on web-crawler restrictions have been cited in policy commentary, journalism, and legal analysis of how the open web is being enclosed.[5][7] By quantifying the gap between documented and actual dataset provenance, and by releasing the corrected metadata and tooling openly, the initiative has helped shift dataset documentation from a voluntary, per-paper practice toward systematic, auditable infrastructure.
The project is also notable for its collaborative model: bringing together machine learning researchers and legal scholars under a volunteer structure coordinated by the MIT Media Lab, with contributors from organizations including Cohere For AI and EleutherAI, among many universities and companies.[3][5] Its central thesis, that responsible foundation-model development requires verifiable data provenance and consent, connects the technical question of dataset documentation to the broader legal and ethical debates over how AI systems are built.