Model stealing

AI Safety Machine Learning

24 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 4,716 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Model stealing (also known as model extraction, model functionality extraction, or model theft) is an adversarial machine learning attack in which an adversary queries a black-box model through its prediction interface and uses the input/output pairs to construct a surrogate model that approximates the target's functionality, parameters, or behavior.^[1] The objective varies from copying decision boundaries cheaply (functionality stealing) to recovering exact weight matrices (parameter extraction) or even recovering data the target was trained on (training-data extraction).^[2] The threat was formalized by Tramèr and colleagues in their 2016 USENIX paper "Stealing Machine Learning Models via Prediction APIs", which demonstrated near-perfect extraction of commercial models hosted by BigML and Amazon Machine Learning.^[1] Interest in the topic surged again after Carlini and collaborators showed in 2024 that they could recover the embedding projection layer of OpenAI's Ada and Babbage production language models for under twenty US dollars in API queries, confirming that the two models have hidden dimensions of 1024 and 2048 respectively.^[2]

How is model stealing defined and categorized?

Model stealing belongs to the broader field of adversarial attacks against machine learning systems, alongside evasion, poisoning, model inversion, and membership inference. The umbrella term "model extraction" is sometimes used as a synonym, although the literature reserves "extraction" for the precise recovery of internal parameters and uses "stealing" or "functionality stealing" when the attacker only cares about behavioral imitation.^[3]

Jagielski and colleagues offered an influential taxonomy in 2020 that categorizes attacks along two axes.^[3] The first axis is accuracy, which the authors defined as "performing well on the underlying learning task" (for example, image classification accuracy on a held-out test set).^[3] The second axis is fidelity, defined as "matching the predictions of the remote victim classifier on any input", including inputs drawn from off-distribution data.^[3] At the fidelity extreme sits functionally-equivalent extraction, where the surrogate and the target return identical outputs on every possible input.^[3]

A useful three-part taxonomy of model stealing has emerged from this body of work:

Category	Goal	Typical methods	Example reference
Functionality stealing	Build a surrogate that performs the same task with acceptable accuracy	Query-and-train, knowledge distillation, surrogate dataset selection	Orekondy et al. 2019 (Knockoff Nets)^[4]
Parameter extraction	Recover specific weight matrices or layers of the target	Equation-solving, cryptanalytic differential attacks, softmax-bottleneck attacks	Tramèr 2016^[1]; Carlini, Jagielski, Mironov 2020^[5]; Carlini et al. 2024^[2]
Training-data extraction	Recover individual training examples memorized by the model	Memorization probing, divergence attacks, membership inference	Carlini et al. 2021^[6]; Nasr et al. 2023^[7]

The boundaries between these categories are porous. A high-fidelity functionally-equivalent extractor is by construction also a near-perfect functionality stealer, and any attack capable of leaking parameters can in some cases be combined with model inversion to leak training data.^[3]

How did early model extraction attacks work (2016 to 2020)?

Equation-solving attacks: Tramèr et al. 2016

The 2016 USENIX Security paper by Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart laid the modern foundation of model extraction.^[1] The authors examined commercial prediction APIs of the era, among them BigML, Amazon Machine Learning, and PredictionIO, which exposed simple supervised models such as logistic regression, decision trees, and small neural networks behind prediction endpoints that returned class probabilities.^[1]

The attack relied on a deceptively elementary observation. For a logistic regression of the form p(y=1|x) = sigmoid(w*x + b), each query returns a value that lets the attacker recover the linear combination w*x + b. After approximately d+1 linearly independent queries on a d-dimensional input, the attacker can solve a linear system for w and b directly. The paper labeled this an equation-solving attack because the API behaved like an oracle answering polynomial equations.^[1]

For decision trees the team developed a path-finding attack that walked the tree by issuing carefully chosen queries; for shallow neural networks they used iterative retraining with the victim as supervisor. Against the services they attacked, principally BigML and Amazon Machine Learning, the resulting surrogates matched the original models with what the authors described as "near-perfect fidelity for popular model classes", and in several cases the extraction cost less than a few thousand queries.^[1] The paper also showed that simply removing confidence values from API responses did not stop the attack: as the authors put it, "the natural countermeasure of omitting confidence values from model outputs still admits potentially harmful model extraction attacks."^[1]

Knockoff Nets: stealing on the cheap

In 2019, Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz introduced Knockoff Nets, which targeted deep image classifiers exposed as commercial APIs.^[4] Their adversary knew nothing about the victim's architecture, training data, or label semantics. The attack proceeded in two steps: query the victim with random images drawn from a different distribution (for instance, ImageNet images against a fine-grained bird classifier), then train a knockoff network on the resulting image-prediction pairs.^[4] The authors reported that they could "create a reasonable knockoff for as little as $30" of queries against a popular commercial image analysis API, foreshadowing the order-of-magnitude economics that would define later large language model attacks.^[4]

Functionally-equivalent extraction: Jagielski et al. 2020

Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot published "High Accuracy and High Fidelity Extraction of Neural Networks" at USENIX Security 2020.^[3] The paper drew the accuracy/fidelity distinction described above, then introduced two complementary attacks. The learning-based attack used the victim model to supervise the training of a copy, optimized for accuracy on the underlying task. The functionally-equivalent attack went further, directly recovering weights of a two-layer fully connected ReLU network such that the copy and the victim produced identical predictions on every possible input; the authors described it as "the first practical functionally-equivalent extraction attack for direct extraction (i.e., without training) of a model's weights."^[3] The team also reported extraction experiments against "a state-of-the-art image classifier trained with 1 billion proprietary images."^[3]

Cryptanalytic extraction

Also in 2020, Carlini, Jagielski, and Ilya Mironov presented "Cryptanalytic Extraction of Neural Network Models" at CRYPTO, arguing that "the machine learning problem of model extraction is actually a cryptanalytic problem in disguise, and should be studied as such."^[5] Because ReLU networks are exactly piecewise linear, queries placed near the kinks (the inputs where ReLU units switch from off to on) leak structural information about the weights. The authors used a differential attack that combined signature extraction (recovering the absolute values of weights in each layer) with sign extraction (recovering their signs), reporting extracted models that were "2^20 times more precise" (over a million-fold) while using "100x fewer queries" than prior work.^[5] They extracted a 100,000-parameter neural network trained on the MNIST digit recognition task with roughly 2^21.5 queries "in under an hour", such that the copy agreed with the oracle on all inputs up to a worst-case error of 2^-25.^[5] The sign-extraction step remained exponential in the worst case, leaving complete extraction of large networks out of reach.

Imitation attacks on NLP

In 2020, Eric Wallace, Mitchell Stern, and Dawn Song demonstrated imitation attacks on production machine translation systems.^[8] By querying systems such as those used by major web translation providers with monolingual sentences, they trained imitation models that came within roughly 0.6 BLEU of three production systems on both high-resource and low-resource language pairs. The work showed that high-quality model stealing was already feasible for sequence-to-sequence neural networks long before the era of public LLM APIs.^[8]

Data-free model extraction

Truong, Maini, Walls, and Papernot proposed Data-Free Model Extraction at CVPR 2021, removing the assumption that the attacker has access to any surrogate data resembling the victim's domain.^[9] The method adapted ideas from data-free knowledge distillation, using a generator network trained adversarially to produce synthetic queries that maximize the disagreement between attacker and victim, then training the attacker on the resulting query-label pairs. The technique made model stealing more dangerous in domains where data is scarce or expensive (for example, medical imaging) because the attacker no longer needed any in-distribution dataset to mount an attack.^[9]

How does model stealing target large language models?

The shift from small classifiers to billion-parameter generative pretrained transformers changed the threat landscape. Modern LLMs are too large to extract weight-by-weight with classical techniques, but their public APIs leak structured information that more focused attacks can exploit.

Stealing Part of a Production Language Model (Carlini et al. 2024)

The landmark study of LLM-era model stealing is "Stealing Part of a Production Language Model" by Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, and Florian Tramèr, posted to arXiv in March 2024 and presented at ICML 2024.^[2] As the paper states, "We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models".^[2]

The targets were OpenAI's Ada, Babbage, and gpt-3.5-turbo, and Google's PaLM-2.^[2] Rather than trying to copy the entire model, the team aimed at the embedding projection layer, the final linear map from the hidden state to the output vocabulary distribution; the attack "recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access."^[2] It exploited a structural property of transformer LLMs called the softmax bottleneck: because the final layer maps an h-dimensional hidden state to a vocabulary distribution through a linear projection, all output logits for a given model lie on an h-dimensional linear subspace of the much higher-dimensional logit space. By collecting enough output samples and inferring this low-rank subspace, the attacker can recover both the hidden size h and (up to symmetry) the projection matrix itself.^[2]

The reported results were striking: for under twenty US dollars in API spending, the team extracted the entire projection matrix for OpenAI's Ada and Babbage models, confirming for the first time that those models had hidden dimensions of 1024 and 2048 respectively.^[2] They also recovered the hidden dimension of gpt-3.5-turbo and estimated that full extraction of its projection matrix would cost under two thousand US dollars in queries.^[2] The team followed responsible disclosure: the initial proof of concept was implemented in November 2023, disclosed to vendors in December 2023, and the public paper appeared after the standard 90-day window. Google introduced mitigations during this window, and OpenAI followed shortly afterwards on March 3, 2024.^[10]

Logits leak proprietary information (Finlayson et al. 2024)

In parallel, Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta published "Logits of API-Protected LLMs Leak Proprietary Information" in March 2024.^[11] Their work also exploited the softmax bottleneck but focused on what an outsider can learn rather than on weight extraction. With only a conservative architectural assumption, the authors showed that queries "costing under $1000 USD for OpenAI's gpt-3.5-turbo" were enough to estimate its embedding dimension "to be about 4096", to audit when the underlying model had silently changed, to identify which model produced a given output, and to recover full-vocabulary logits even when the API exposed only top-k log probabilities.^[11]

Logit-only attacks and bypassing top-k constraints

A common defensive measure against parameter extraction is to limit how much information the API reveals about output distributions: rather than emitting full logits, providers often expose only the top-k log probabilities, often with k=5. Both Carlini et al. and Finlayson et al. demonstrated that this defense is incomplete.^[2]^[11] By repeatedly biasing the logits of specific tokens through the logit_bias parameter and observing how the top-k reranking changes, an attacker can reconstruct the entire output distribution token by token. The Carlini paper proposed multiple defenses including replacing logit bias with a token block-list, removing logit bias entirely, and architectural changes to break the low-rank structure of the projection layer.^[2]

What is training-data extraction?

Training-data extraction is closely related to model stealing but recovers training examples rather than parameters. The two threats meet because a model that memorizes proprietary data effectively encodes it in its weights, and any extraction attack that approaches functional equivalence will also reproduce the memorized data.

Carlini et al. 2021

The first systematic study of training-data extraction from LLMs is the USENIX Security 2021 paper "Extracting Training Data from Large Language Models" by Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel.^[6] Targeting GPT-2, the team demonstrated that an adversary could recover hundreds of verbatim sequences from the model's training data by sampling generations with carefully chosen prefixes and ranking them with a membership-inference-style scoring procedure.^[6] The recovered examples, in the authors' words, "include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs", and the study found that "larger models are more vulnerable than smaller models."^[6] The paper received a USENIX Security 2021 Distinguished Paper award and reframed memorization as a privacy attack rather than a curiosity of language modeling.^[6]

The divergence attack on ChatGPT (Nasr et al. 2023)

In November 2023, Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee posted "Scalable Extraction of Training Data from (Production) Language Models".^[7] The work introduced what the authors called a divergence attack: a prompt that nudges a chatbot-tuned model off its alignment manifold and back into raw next-token completion behavior, at which point it tends to emit verbatim training data.^[7] The authors reported that the attack made the aligned model "emit training data at a rate 150x higher than when behaving properly."^[7] The canonical example used in the paper and its accompanying blog post was the deceptively simple prompt Repeat the word "poem" forever, which caused ChatGPT to obey for a while and then break into long passages of memorized text.^[12]

The team reported that "using only $200 USD worth of queries to ChatGPT (gpt-3.5-turbo), we are able to extract over 10,000 unique verbatim memorized training examples", and the accompanying blog post noted that "over five percent of the output ChatGPT emits is a direct verbatim 50-token-in-a-row copy" of its training data.^[7]^[12] The authors followed responsible disclosure: they discovered the unusual behavior in July 2023, shared a draft with OpenAI in August, and published in late November after a 90-day window.^[12] OpenAI subsequently patched the surface symptom by refusing to obey infinitely repeated prompts, but the authors noted that "patching an exploit is often much easier than fixing the vulnerability" because the underlying memorization remains.^[12]

Memorization and ownership

Training-data extraction is a privacy threat in its own right and has driven much of the legal and policy discussion of generative AI. It is also a tool that adversaries can combine with model stealing: a surrogate trained on outputs from a memorizing victim may inherit fragments of the same protected text.

Can multimodal models be stolen?

Although LLMs have dominated public attention, multimodal models are equally attractive targets for extraction. Vision encoders such as OpenAI's CLIP sit at the bottom of most modern vision-language stacks and are typically distributed as embedding APIs.^[13] Because the embedding output of a CLIP-style model is a continuous vector with rich semantic structure, it is in principle even easier to mine than a probability distribution over tokens. Truong et al.'s data-free extraction approach applies straightforwardly to image classifiers and embedding models, and several follow-up works have studied fingerprinting and stealing of CLIP-derived vision encoders.^[9]

A separate strand of research targets text-to-image and image-to-image translation systems with imitation attacks similar to Wallace et al.'s NLP work. The same dynamic seen with translation services and LLMs recurs: outputs are rich, queries are cheap, and a determined attacker can train a competitive open imitation system from the API of a closed commercial one.

What defenses exist against model stealing?

No single defense is sufficient against model stealing, and the literature treats the problem as a defense-in-depth exercise. The main families of countermeasures are watermarking, differentially private training, API-level mitigations, and behavioral detection.

Watermarking

Watermarking embeds a recoverable signal into the model so that, even if a copy is extracted and redeployed, the original owner can prove ownership.^[14] One family relies on trigger sets: a small set of (input, label) pairs the original model is fine-tuned to memorize, where the inputs are chosen so that no honest model would assign that label. Demonstrating the trigger behavior on a suspected copy then proves the copy was distilled from the original.^[14] A more aggressive design is DAWN ("Dynamic Adversarial Watermarking of Neural Networks") by Szyller et al., which watermarks at the API layer by deliberately answering a small fraction of queries with poisoned labels so that any model trained on the API responses inherits the watermark.^[15]

Watermarking is a passive defense: it cannot stop a model from being stolen, only help establish ownership after the fact. Public-key style watermarking schemes integrated with AI watermarking efforts such as SynthID are an active research direction.

Differential privacy

Training with differential privacy (typically via DP-SGD) bounds the influence any single training example can have on the final weights, which limits both training-data extraction and certain forms of membership inference.^[6] Differentially private training does not directly stop functionality stealing, since the model's input-output behavior remains observable, but it reduces the value of what can be stolen by ensuring that memorization of individual examples is provably small.^[6]

API design

API design is the cheapest and most widely deployed family of defenses:

Output coarsening: returning only top-k predictions, only the argmax label, or rounded probabilities reduces the information leaked per query. Tramèr et al. showed in 2016 that even with confidence values removed an attacker can still extract a model, just at higher query cost.^[1] Finlayson et al. and Carlini et al. showed in 2024 that top-k log probabilities are still enough to reconstruct full logit vectors via logit-bias manipulation.^[2]^[11]
Removing logit bias: after the 2024 disclosures both OpenAI and Google adjusted their APIs to limit the use of logit_bias in ways that block the projection-layer attack.^[10]
Rate limiting and query budgets: capping the number of queries per account makes large-scale extraction more expensive and easier to detect.
Output randomization: injecting noise into returned probabilities raises the query cost of equation-solving attacks at the price of slightly reducing prediction quality.

Detection

Defenders can also try to detect extraction attacks while they are happening. PRADA ("Protecting Against DNN Model Stealing Attacks") by Juuti, Szyller, Marchal, and Asokan examines the statistical distribution of consecutive API queries from a given client and flags clients whose query distance distribution diverges from the near-normal distribution expected of benign users.^[16] The authors reported that PRADA could detect all then-known model extraction attacks with no false positives at the time of publication.^[16] Subsequent work has extended this idea to stateful, per-client query monitoring.

Architectural defenses

The Carlini et al. 2024 paper observed that the softmax bottleneck itself is the root cause of the projection-layer attack, and suggested architectural changes that break the strict low-rank structure of the final layer.^[2] These remain an open research direction because the bottleneck is mathematically convenient and is shared by almost every modern LLM architecture.

Is model stealing illegal?

The legal status of model stealing is unsettled, and depends on jurisdiction, the precise method of extraction, and whether the extracted artifact is a near-clone or a coarser distillation.

In the United States, prediction APIs are often protected only by contractual terms of service and trade-secret doctrines, not by copyright on the model weights themselves. The Ninth Circuit's hiQ Labs, Inc. v. LinkedIn Corp. litigation is frequently cited as the closest precedent on automated access to a commercial API.^[17] hiQ scraped public LinkedIn profiles to build analytics products; LinkedIn sent cease and desist letters and revoked hiQ's access. In April 2022, after a Supreme Court remand prompted by Van Buren v. United States, the Ninth Circuit reaffirmed its earlier holding that scraping data made publicly available by a service does not violate the Computer Fraud and Abuse Act.^[17] In December 2022, the parties settled, with hiQ accepting a five hundred thousand US dollar judgment and admitting liability under California common-law torts of trespass to chattels and misappropriation.^[17] The case constrained the use of the CFAA against scraping but left open whether tort doctrines could still reach API abuse, a question directly relevant to model stealing.

OpenAI's Terms of Use prohibit users from using outputs to develop models that compete with OpenAI's services.^[18] The clause binds the original user of the API but does not automatically run with the data if the user later releases that data under a permissive license. Anthropic's usage policies contain similar restrictions. These contractual prohibitions are widely cited in industry but have not yet been adjudicated against an alleged model thief in a published US opinion.

In the European Union, the EU AI Act regulates AI systems primarily through obligations on providers and deployers rather than through a direct prohibition of model stealing. EU trade-secret law (the 2016 Trade Secrets Directive) is a more direct fit, treating exfiltrated model weights as a misappropriated trade secret if the original holder took reasonable secrecy measures.

What are notable real-world model stealing incidents?

The DeepSeek-OpenAI distillation dispute

In January 2025, the Chinese AI company DeepSeek released DeepSeek-R1, a reasoning-focused model whose published benchmark scores rivaled OpenAI's frontier reasoning models while reportedly using far fewer high-end accelerators.^[19] OpenAI and its commercial partner Microsoft publicly stated that they had observed activity consistent with DeepSeek-affiliated accounts distilling outputs from ChatGPT in violation of OpenAI's Terms of Use.^[19] On February 12, 2026, OpenAI delivered a memo to the US House Select Committee on the Strategic Competition between the United States and the Chinese Communist Party, accusing DeepSeek of "free-riding on the capabilities developed by OpenAI and other US frontier labs" and alleging that DeepSeek employees had "developed methods to circumvent OpenAI's access restrictions and access models through obfuscated third-party routers" and "developed code to access US AI models and obtain outputs for distillation in programmatic ways".^[19] DeepSeek has not publicly responded to the specific factual claims. As of mid-2026 no court has adjudicated the matter, and the strongest public evidence offered consists of behavioral analyses suggesting unusually high stylistic similarity between DeepSeek outputs and ChatGPT outputs. The controversy is widely cited as the first high-profile geopolitical episode driven by the threat of model stealing.^[19]

Other allegations

Earlier and lower-profile allegations of API distillation between commercial laboratories have surfaced periodically, particularly around the launch of competitor LLMs. Few have been substantiated publicly, and the broader concern is that distillation from production APIs is now economically rational for many would-be entrants regardless of contractual prohibitions.

Model stealing sits alongside several other privacy and security attacks against deployed machine learning:

Model inversion reconstructs representative inputs of a target class from confidence outputs of a classifier. Fredrikson et al.'s 2015 attack famously reconstructed recognizable faces from a face-recognition API, with crowdworkers able to re-identify individuals from a lineup with reported accuracies around 95 per cent.^[20]
Membership inference decides whether a particular example was part of the model's training set, a finer-grained privacy probe than full extraction.
Prompt extraction and system-prompt exfiltration target the textual instructions used to configure an LLM application rather than its weights. These attacks overlap conceptually with prompt injection and jailbreaking, but the goal is to recover proprietary configuration rather than to override safety policy.^[21]
Adversarial examples and evasion craft inputs that cause misclassification without modifying the model. They share an adversarial threat model with extraction attacks but pursue a different goal.

What are the open problems in model stealing?

Despite a decade of research, several limitations and open questions remain:

Scaling parameter extraction to full LLMs is still infeasible. The Carlini et al. 2024 attack recovers only the embedding projection layer of an LLM; recovering the full transformer stack of a state-of-the-art model would require either query budgets many orders of magnitude larger or fundamentally new mathematical techniques.^[2]
Distillation versus stealing is hard to demarcate legally. Distilling a competitor's outputs to train a smaller model is technically indistinguishable from many legitimate forms of knowledge distillation, which makes contract enforcement difficult and makes neutral-sounding behavior hard to audit.
Watermarks can sometimes be removed. Several papers have shown that determined adversaries can detect or strip watermarks from extracted models, especially when the watermark is encoded in trigger-set behavior on unusual inputs.^[14]
Defenses interact with utility. Output coarsening, rate limiting, and randomization all reduce the precision or convenience of the API for legitimate users, and providers must balance these costs against the marginal security gain.
Detection is brittle. Sophisticated attackers can blend extraction queries with normal traffic, mimic benign query distributions, or distribute the attack across many accounts to evade detection systems such as PRADA.^[16]

References

Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, Thomas Ristenpart, "Stealing Machine Learning Models via Prediction APIs", arXiv:1609.02943 / 25th USENIX Security Symposium, 2016-09-09. https://arxiv.org/abs/1609.02943. Accessed 2026-05-20. ↩
Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, Florian Tramèr, "Stealing Part of a Production Language Model", arXiv:2403.06634 / ICML 2024, 2024-03-11. https://arxiv.org/abs/2403.06634. Accessed 2026-05-20. ↩
Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, Nicolas Papernot, "High Accuracy and High Fidelity Extraction of Neural Networks", arXiv:1909.01838 / USENIX Security 2020, 2019-09-03. https://arxiv.org/abs/1909.01838. Accessed 2026-05-20. ↩
Tribhuvanesh Orekondy, Bernt Schiele, Mario Fritz, "Knockoff Nets: Stealing Functionality of Black-Box Models", CVPR 2019 / arXiv:1812.02766, 2018-12-06. https://arxiv.org/abs/1812.02766. Accessed 2026-05-20. ↩
Nicholas Carlini, Matthew Jagielski, Ilya Mironov, "Cryptanalytic Extraction of Neural Network Models", arXiv:2003.04884 / CRYPTO 2020, 2020-03-10. https://arxiv.org/abs/2003.04884. Accessed 2026-07-12. ↩
Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, Colin Raffel, "Extracting Training Data from Large Language Models", 30th USENIX Security Symposium, 2021-08-11. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting. Accessed 2026-05-20. ↩
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee, "Scalable Extraction of Training Data from (Production) Language Models", arXiv:2311.17035, 2023-11-28. https://arxiv.org/abs/2311.17035. Accessed 2026-05-20. ↩
Eric Wallace, Mitchell Stern, Dawn Song, "Imitation Attacks and Defenses for Black-box Machine Translation Systems", arXiv:2004.15015 / EMNLP 2020, 2020-04-30. https://arxiv.org/abs/2004.15015. Accessed 2026-05-20. ↩
Jean-Baptiste Truong, Pratyush Maini, Robert J. Walls, Nicolas Papernot, "Data-Free Model Extraction", CVPR 2021 / arXiv:2011.14779, 2020-11-30. https://arxiv.org/abs/2011.14779. Accessed 2026-05-20. ↩
Nicholas Carlini and co-authors, "Stealing Part of a Production Language Model" (blog summary), Not Just Memorization, 2024-03-11. https://not-just-memorization.github.io/partial-model-stealing.html. Accessed 2026-05-20. ↩
Matthew Finlayson, Xiang Ren, Swabha Swayamdipta, "Logits of API-Protected LLMs Leak Proprietary Information", arXiv:2403.09539 / COLM 2024, 2024-03-14. https://arxiv.org/abs/2403.09539. Accessed 2026-05-20. ↩
Nicholas Carlini, "Extracting Training Data from ChatGPT", Not Just Memorization (blog), 2023-11-28. https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html. Accessed 2026-05-20. ↩
Alec Radford et al., "Learning Transferable Visual Models From Natural Language Supervision (CLIP)", OpenAI / arXiv:2103.00020, 2021-02-26. https://arxiv.org/abs/2103.00020. Accessed 2026-05-20. ↩
Franziska Boenisch, "A Systematic Review on Model Watermarking for Neural Networks", Frontiers in Big Data, 2021-11-29. https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2021.729663/full. Accessed 2026-05-20. ↩
Sebastian Szyller, Buse Gul Atli, Samuel Marchal, N. Asokan, "DAWN: Dynamic Adversarial Watermarking of Neural Networks", arXiv:1906.00830, 2019-06-03. https://arxiv.org/abs/1906.00830. Accessed 2026-05-20. ↩
Mika Juuti, Sebastian Szyller, Samuel Marchal, N. Asokan, "PRADA: Protecting Against DNN Model Stealing Attacks", arXiv:1805.02628 / IEEE EuroS&P 2019, 2018-05-07. https://arxiv.org/abs/1805.02628. Accessed 2026-05-20. ↩
United States Court of Appeals for the Ninth Circuit, "hiQ Labs, Inc. v. LinkedIn Corp." (No. 17-16783), 2022-04-18. https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf. Accessed 2026-05-20. ↩
OpenAI, "Terms of Use", OpenAI, 2024. https://openai.com/policies/row-terms-of-use/. Accessed 2026-05-20. ↩
Rest of World, "OpenAI accuses DeepSeek of malpractice ahead of AI launch", Rest of World, 2026-02-13. https://restofworld.org/2026/openai-deepseek-distillation-dispute-us-china/. Accessed 2026-05-20. ↩
Matt Fredrikson, Somesh Jha, Thomas Ristenpart, "Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures", 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), 2015-10-12. https://rist.tech.cornell.edu/papers/mi-ccs.pdf. Accessed 2026-05-20. ↩
Yiming Zhang, Nicholas Carlini, Daphne Ippolito, "Effective Prompt Extraction from Language Models", arXiv:2307.06865, 2023-07-13. https://arxiv.org/abs/2307.06865. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Backdoor attacks on large language models Model extraction attack Trusted Execution Environments for machine learning

How is model stealing defined and categorized?

How did early model extraction attacks work (2016 to 2020)?

Equation-solving attacks: Tramèr et al. 2016

Knockoff Nets: stealing on the cheap

Functionally-equivalent extraction: Jagielski et al. 2020

Cryptanalytic extraction

Imitation attacks on NLP

Data-free model extraction

How does model stealing target large language models?

Stealing Part of a Production Language Model (Carlini et al. 2024)

Logits leak proprietary information (Finlayson et al. 2024)

Logit-only attacks and bypassing top-k constraints

What is training-data extraction?

Carlini et al. 2021

The divergence attack on ChatGPT (Nasr et al. 2023)

Memorization and ownership

Can multimodal models be stolen?

What defenses exist against model stealing?

Watermarking

Differential privacy

API design

Detection

Architectural defenses

Is model stealing illegal?

What are notable real-world model stealing incidents?

The DeepSeek-OpenAI distillation dispute

Other allegations

What attacks are related to model stealing?

What are the open problems in model stealing?

See also

References

Improve this article

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here