OpenAI ChatGPT logs discovery dispute
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,481 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,481 words
Add missing citations, update stale details, or suggest a clearer explanation.
The OpenAI ChatGPT logs discovery dispute is a fight over user privacy inside the consolidated copyright litigation that news publishers and authors have brought against OpenAI. At the center of it is a single number: 20 million. In late 2025 a federal magistrate judge ordered OpenAI to hand over a sample of 20 million de-identified ChatGPT conversation logs to the plaintiffs, and in January 2026 a district judge affirmed that order. The decision turned what had been an abstract argument about generative AI training into a concrete question that millions of ChatGPT users never expected to face: could conversations they had with a chatbot end up in the hands of opposing lawyers in a lawsuit they have nothing to do with?
The dispute sits inside In re: OpenAI, Inc. Copyright Infringement Litigation, the multidistrict litigation pending in the U.S. District Court for the Southern District of New York under case number 1:25-md-03143.[1][2] That MDL gathers more than a dozen suits, the most prominent being the one filed by The New York Times. For the broader copyright claims, see New York Times v. OpenAI.
The discovery fight grew out of a much older preservation order. In May 2025, Magistrate Judge Ona T. Wang directed OpenAI to "retain and segregate all output log data that would otherwise be deleted," including conversations users had already deleted and chats from temporary sessions.[3][4] The plaintiffs wanted those outputs preserved so they could later search for instances where the model reproduced their copyrighted articles. OpenAI fought the order hard in public. Chief Operating Officer Brad Lightcap called the requirement "inappropriate," and CEO Sam Altman used it to argue for what he called "AI privilege," the idea that conversations with an AI system should get protection similar to talking to a doctor or a lawyer.[4] After months of litigation, OpenAI's obligation to indefinitely retain consumer ChatGPT and API content ended in late September 2025, though it still had to keep a defined set of historical logs for the case.[3]
With logs preserved, the parties turned to how many the plaintiffs could actually inspect. In July 2025 the news plaintiffs moved to compel a sample of 120 million logs.[1][2] OpenAI opposed that and countered with 20 million, which it described as more than enough, and the publishers accepted the smaller figure while reserving the right to ask for more later. Twenty million represents roughly 0.5 percent of OpenAI's preserved logs.[2] Then in October 2025 OpenAI changed course. Rather than produce the full 20 million de-identified sample it had offered, it proposed running keyword searches and producing only the conversations that surfaced the plaintiffs' specific works.[1][2]
Magistrate Judge Wang rejected OpenAI's narrower proposal. On November 7, 2025 she granted the motion to compel production of the entire 20 million-log sample, and in December she denied OpenAI's request for reconsideration.[2][5] Her reasoning on relevance went to the heart of the case: output logs matter even when they do not contain the plaintiffs' articles, because the full range of what ChatGPT produces bears on OpenAI's fair use defense and on whether the model's outputs substitute for the original journalism.[2][5] Cherry-picking only the conversations that quote the Times, in other words, would give an incomplete picture of how the product actually behaves.
OpenAI appealed to the district court. On January 5, 2026, U.S. District Judge Sidney H. Stein affirmed Wang's orders, agreeing that she had adequately weighed the privacy interests against the relevance of the material.[1][2][6] Coverage of the ruling appeared across legal and technology press in early January.[6][7] The affirmance left the 20 million-log production in place.
OpenAI's central objection was about its users, not itself. The company argued that producing tens of millions of "irrelevant personal user conversations" would invade the privacy of people "who have no role, voice, or stake in these proceedings."[5] ChatGPT conversations, OpenAI and outside observers noted, often contain things people would never put in an email or a search box: medical worries, financial details, confidential business plans, personal confessions.
Judge Wang found those interests real but adequately protected by three safeguards, and Judge Stein agreed.[2][5] First, the sheer reduction in scale, from tens of billions of logs down to 20 million. Second, OpenAI's de-identification process, which the company carried out using a custom tool meant to strip out personally identifiable information before production.[2] Third, the standing protective order in the case, which restricts how the plaintiffs and their lawyers can use and share the material.
The court also drew a line between this situation and wiretap cases. Distinguishing SEC v. Rajaratnam, which involved secretly recorded and potentially illegal wiretaps, the court reasoned that ChatGPT users "voluntarily provided their data to OpenAI as part of ordinary platform usage."[1][6] That voluntariness, in the court's view, weakened the privacy claim. Some commentators dubbed the framework the "Stein standard" and questioned whether clicking through a consumer terms-of-service page is really the kind of voluntary, informed waiver that should expose private conversations to discovery.[6]
The hardest objection is that de-identification may not deliver real anonymity. Conversational text is unusually difficult to scrub, because the identifying details are woven into the substance of what people write rather than sitting in a neat metadata field. Legal analysts pointed to prior incidents to make the point. One review of roughly 1,000 leaked ChatGPT conversations reportedly found multiple chats that explicitly stated full names, addresses, and identification numbers, and an analysis of about 47,000 accidentally exposed logs surfaced email addresses, phone numbers, and intimate personal details that could re-identify individuals even with names removed.[5] Removing the obvious identifiers does not necessarily remove the ability to figure out who wrote something, especially when a conversation describes a specific job, a specific medical history, or a specific dispute.
This is the gap privacy advocates have focused on. The protective order constrains the lawyers in the case, but it cannot un-write the sensitive content inside the logs, and it cannot guarantee that de-identification is reversible-proof. Techniques like differential privacy offer formal guarantees for aggregate statistics, but they are not what is happening here; this is raw conversational text passed through a redaction tool. For users, the practical lesson many writers drew was blunt: treat anything typed into a chatbot as potentially discoverable, because a large language model provider can be compelled to retain and produce it.
The logs dispute is one front in the larger war over whether training AI on copyrighted journalism is lawful. The New York Times sued OpenAI and Microsoft in December 2023, alleging that they copied millions of Times articles to build their language models. In March 2025, Judge Stein allowed the core copyright infringement claims to proceed past a motion to dismiss, which kept the case alive and made the discovery stakes much higher.[8] Whether ChatGPT's outputs compete with or substitute for the originals is central to OpenAI's fair use defense, and the 20 million logs are the plaintiffs' attempt to build an evidentiary record on exactly that point. The outcome will feed into the wider debate over AI copyright and the use of training data scraped from the open web.
The figure that has been consistently reported and confirmed across reputable coverage is 20 million de-identified logs, ordered in November 2025 and affirmed in January 2026.[1][2][6] Some secondary discussion has floated much larger totals, on the order of tens of millions more or roughly 100 million logs, sometimes attributed to additional 2026 orders. I could not substantiate those larger figures in primary or reputable reporting as of June 2026, and at least one widely cited "78" figure appears to trace to ChatGPT's market share rather than to any production order. The verified record describes a single 20 million-log sample. Readers should treat claims of much larger court-ordered productions with caution until they are confirmed by the docket or by reliable reporting.