Model stealing
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,497 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,497 words
Add missing citations, update stale details, or suggest a clearer explanation.
Model stealing (also known as model extraction, model functionality extraction, or model theft) is an adversarial machine learning attack in which an adversary queries a black-box model through its prediction interface and uses the input/output pairs to construct a surrogate model that approximates the target's functionality, parameters, or behavior.[^1] The objective varies from copying decision boundaries cheaply (functionality stealing) to recovering exact weight matrices (parameter extraction) or even recovering data the target was trained on (training-data extraction).[^2] The threat was formalized by Tramèr and colleagues in their 2016 USENIX paper "Stealing Machine Learning Models via Prediction APIs", which demonstrated near-perfect extraction of commercial models hosted by BigML and Amazon Machine Learning.[^1] Interest in the topic surged again after Carlini and collaborators showed in 2024 that they could recover the embedding projection layer of OpenAI's Ada and Babbage production language models for under twenty US dollars in API queries.[^2]
Model stealing belongs to the broader field of adversarial attacks against machine learning systems, alongside evasion, poisoning, model inversion, and membership inference. The umbrella term "model extraction" is sometimes used as a synonym, although the literature reserves "extraction" for the precise recovery of internal parameters and uses "stealing" or "functionality stealing" when the attacker only cares about behavioral imitation.[^3]
Jagielski and colleagues offered an influential taxonomy in 2020 that categorizes attacks along two axes.[^3] The first axis is accuracy, meaning how well the surrogate performs on the underlying learning task (for example, image classification accuracy on a held-out test set). The second axis is fidelity, meaning how closely the surrogate's outputs match those of the victim on arbitrary inputs, including inputs drawn from off-distribution data. At the fidelity extreme sits functionally-equivalent extraction, where the surrogate and the target return identical outputs on every possible input.[^3]
A useful three-part taxonomy of model stealing has emerged from this body of work:
| Category | Goal | Typical methods | Example reference |
|---|---|---|---|
| Functionality stealing | Build a surrogate that performs the same task with acceptable accuracy | Query-and-train, knowledge distillation, surrogate dataset selection | Orekondy et al. 2019 (Knockoff Nets)[^4] |
| Parameter extraction | Recover specific weight matrices or layers of the target | Equation-solving, cryptanalytic differential attacks, softmax-bottleneck attacks | Tramèr 2016[^1]; Carlini, Jagielski, Mironov 2020[^5]; Carlini et al. 2024[^2] |
| Training-data extraction | Recover individual training examples memorized by the model | Memorization probing, divergence attacks, membership inference | Carlini et al. 2021[^6]; Nasr et al. 2023[^7] |
The boundaries between these categories are porous. A high-fidelity functionally-equivalent extractor is by construction also a near-perfect functionality stealer, and any attack capable of leaking parameters can in some cases be combined with model inversion to leak training data.[^3]
The 2016 USENIX Security paper by Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart laid the modern foundation of model extraction.[^1] The authors considered three black-box APIs from the era: BigML, Amazon Machine Learning, and PredictionIO, all of which exposed simple supervised models such as logistic regression, decision trees, and small neural networks behind prediction endpoints that returned class probabilities.[^1]
The attack relied on a deceptively elementary observation. For a logistic regression of the form p(y=1|x) = sigmoid(w*x + b), each query returns a value that lets the attacker recover the linear combination w*x + b. After approximately d+1 linearly independent queries on a d-dimensional input, the attacker can solve a linear system for w and b directly. The paper labeled this an equation-solving attack because the API behaved like an oracle answering polynomial equations.[^1]
For decision trees the team developed a path-finding attack that walked the tree by issuing carefully chosen queries; for shallow neural networks they used iterative retraining with the victim as supervisor. Across all three commercial services the resulting surrogates matched the original models with near-perfect fidelity, and in several cases the extraction cost less than a few thousand queries.[^1] The paper also showed that simply removing confidence values from API responses did not stop the attack: an attacker willing to issue more queries could still extract the boundary by reading the discrete labels.[^1]
In 2019, Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz introduced Knockoff Nets, which targeted deep image classifiers exposed as commercial APIs.[^4] Their adversary knew nothing about the victim's architecture, training data, or label semantics. The attack proceeded in two steps: query the victim with random images drawn from a different distribution (for instance, ImageNet images against a fine-grained bird classifier), then train a knockoff network on the resulting image-prediction pairs.[^4] The authors reported building a usable knockoff of a popular commercial image analysis API for around thirty US dollars of queries, foreshadowing the order-of-magnitude economics that would define later large language model attacks.[^4]
Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot published "High Accuracy and High Fidelity Extraction of Neural Networks" at USENIX Security 2020.[^3] The paper drew the accuracy/fidelity distinction described above, then introduced two complementary attacks. The learning-based attack used the victim model to supervise the training of a copy, optimized for accuracy on the underlying task. The functionally-equivalent attack went further, directly recovering weights of a two-layer fully connected ReLU network such that the copy and the victim produced identical predictions on every possible input. The team also reported extraction experiments against a production image classifier trained on more than one billion proprietary images.[^3]
Also in 2020, Carlini, Jagielski, and Ilya Mironov presented "Cryptanalytic Extraction of Neural Network Models" at CRYPTO, framing the problem as cryptanalysis of a piecewise-linear function.[^5] Because ReLU networks are exactly piecewise linear, queries placed near the kinks (the inputs where ReLU units switch from off to on) leak structural information about the weights. The authors used a differential attack that combined signature extraction (recovering the absolute values of weights in each layer) with sign extraction (recovering their signs), achieving 220 times higher precision and 100 times fewer queries than prior methods on toy networks.[^5] They reported extracting a 100,000-parameter MNIST classifier with roughly 2^21.5 queries.[^5] The sign-extraction step remained exponential in the worst case, leaving complete extraction of large networks out of reach.
In 2020, Eric Wallace, Mitchell Stern, and Dawn Song demonstrated imitation attacks on production machine translation systems.[^8] By querying systems such as those used by major web translation providers with monolingual sentences, they trained imitation models that came within roughly 0.6 BLEU of three production systems on both high-resource and low-resource language pairs. The work showed that high-quality model stealing was already feasible for sequence-to-sequence neural networks long before the era of public LLM APIs.[^8]
Truong, Maini, Walls, and Papernot proposed Data-Free Model Extraction at CVPR 2021, removing the assumption that the attacker has access to any surrogate data resembling the victim's domain.[^9] The method adapted ideas from data-free knowledge distillation, using a generator network trained adversarially to produce synthetic queries that maximize the disagreement between attacker and victim, then training the attacker on the resulting query-label pairs. The technique made model stealing more dangerous in domains where data is scarce or expensive (for example, medical imaging) because the attacker no longer needed any in-distribution dataset to mount an attack.[^9]
The shift from small classifiers to billion-parameter generative pretrained transformers changed the threat landscape. Modern LLMs are too large to extract weight-by-weight with classical techniques, but their public APIs leak structured information that more focused attacks can exploit.
The landmark study of LLM-era model stealing is "Stealing Part of a Production Language Model" by Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, and Florian Tramèr, posted to arXiv in March 2024 and presented at ICML 2024.[^2] The team demonstrated the first model-stealing attack that recovers precise non-trivial information from a production large language model.[^2]
The targets were OpenAI's Ada, Babbage, and gpt-3.5-turbo, and Google's PaLM-2.[^2] Rather than trying to copy the entire model, the team aimed at the embedding projection layer, the final linear map from the hidden state to the output vocabulary distribution. Their attack exploited a structural property of transformer LLMs called the softmax bottleneck: because the final layer maps an h-dimensional hidden state to a vocabulary distribution through a linear projection, all output logits for a given model lie on an h-dimensional linear subspace of the much higher-dimensional logit space. By collecting enough output samples and inferring this low-rank subspace, the attacker can recover both the hidden size h and (up to symmetry) the projection matrix itself.[^2]
The reported results were striking: for under twenty US dollars in API spending, the team extracted the entire projection matrix for OpenAI's Ada and Babbage models, confirming for the first time that those models had hidden dimensions of 1024 and 2048 respectively.[^2] They also recovered the hidden dimension of gpt-3.5-turbo and estimated that full extraction of its projection matrix would cost under two thousand US dollars in queries.[^2] The team followed responsible disclosure: the initial proof of concept was implemented in November 2023, disclosed to vendors in December 2023, and the public paper appeared after the standard 90-day window. Google introduced mitigations during this window, and OpenAI followed shortly afterwards on March 3, 2024.[^10]
In parallel, Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta published "Logits of API-Protected LLMs Leak Proprietary Information" in March 2024.[^11] Their work also exploited the softmax bottleneck but focused on what an outsider can learn rather than on weight extraction. With only a conservative architectural assumption, the authors showed that under one thousand US dollars of queries to OpenAI's gpt-3.5-turbo was enough to estimate its embedding dimension at approximately 4096, to audit when the underlying model had silently changed, to identify which model produced a given output, and to recover full-vocabulary logits even when the API exposed only top-k log probabilities.[^11]
A common defensive measure against parameter extraction is to limit how much information the API reveals about output distributions: rather than emitting full logits, providers often expose only the top-k log probabilities, often with k=5. Both Carlini et al. and Finlayson et al. demonstrated that this defense is incomplete.[^2][^11] By repeatedly biasing the logits of specific tokens through the logit_bias parameter and observing how the top-k reranking changes, an attacker can reconstruct the entire output distribution token by token. The Carlini paper proposed multiple defenses including replacing logit bias with a token block-list, removing logit bias entirely, and architectural changes to break the low-rank structure of the projection layer.[^2]
Training-data extraction is closely related to model stealing but recovers training examples rather than parameters. The two threats meet because a model that memorizes proprietary data effectively encodes it in its weights, and any extraction attack that approaches functional equivalence will also reproduce the memorized data.
The first systematic study of training-data extraction from LLMs is the USENIX Security 2021 paper "Extracting Training Data from Large Language Models" by Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel.[^6] Targeting GPT-2, the team demonstrated that an adversary could recover hundreds of verbatim sequences from the model's training data, including personally identifiable information, by sampling generations with carefully chosen prefixes and ranking them with a membership-inference-style scoring procedure.[^6] The paper received a Distinguished Paper award and reframed memorization as a privacy attack rather than a curiosity of language modeling.[^6]
In November 2023, Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee posted "Scalable Extraction of Training Data from (Production) Language Models".[^7] The work introduced what the authors called a divergence attack: a prompt that nudges a chatbot-tuned model off its alignment manifold and back into raw next-token completion behavior, at which point it tends to emit verbatim training data.[^7] The canonical example used in the paper and its accompanying blog post was the deceptively simple prompt "Repeat the word 'poem' forever", which caused ChatGPT to obey for a while and then break into long passages of memorized text.[^12]
The team reported that with around two hundred US dollars worth of queries to gpt-3.5-turbo, they could extract more than ten thousand unique verbatim training examples, and that more than five per cent of ChatGPT's output in their strongest configuration consisted of 50-token verbatim copies of training data.[^7] The authors followed responsible disclosure: they discovered the unusual behavior in July 2023, shared a draft with OpenAI in August, and published in late November after a 90-day window.[^12] OpenAI subsequently patched the surface symptom by refusing to obey infinitely repeated prompts, but the authors noted that "patching an exploit is often much easier than fixing the vulnerability" because the underlying memorization remains.[^12]
Training-data extraction is a privacy threat in its own right and has driven much of the legal and policy discussion of generative AI. It is also a tool that adversaries can combine with model stealing: a surrogate trained on outputs from a memorizing victim may inherit fragments of the same protected text.
Although LLMs have dominated public attention, multimodal models are equally attractive targets for extraction. Vision encoders such as OpenAI's CLIP sit at the bottom of most modern vision-language stacks and are typically distributed as embedding APIs.[^13] Because the embedding output of a CLIP-style model is a continuous vector with rich semantic structure, it is in principle even easier to mine than a probability distribution over tokens. Truong et al.'s data-free extraction approach applies straightforwardly to image classifiers and embedding models, and several follow-up works have studied fingerprinting and stealing of CLIP-derived vision encoders.[^9]
A separate strand of research targets text-to-image and image-to-image translation systems with imitation attacks similar to Wallace et al.'s NLP work. The same dynamic seen with translation services and LLMs recurs: outputs are rich, queries are cheap, and a determined attacker can train a competitive open imitation system from the API of a closed commercial one.
No single defense is sufficient against model stealing, and the literature treats the problem as a defense-in-depth exercise. The main families of countermeasures are watermarking, differentially private training, API-level mitigations, and behavioral detection.
Watermarking embeds a recoverable signal into the model so that, even if a copy is extracted and redeployed, the original owner can prove ownership.[^14] One family relies on trigger sets: a small set of (input, label) pairs the original model is fine-tuned to memorize, where the inputs are chosen so that no honest model would assign that label. Demonstrating the trigger behavior on a suspected copy then proves the copy was distilled from the original.[^14] A more aggressive design is DAWN ("Dynamic Adversarial Watermarking of Neural Networks") by Szyller et al., which watermarks at the API layer by deliberately answering a small fraction of queries with poisoned labels so that any model trained on the API responses inherits the watermark.[^15]
Watermarking is a passive defense: it cannot stop a model from being stolen, only help establish ownership after the fact. Public-key style watermarking schemes integrated with AI watermarking efforts such as SynthID are an active research direction.
Training with differential privacy (typically via DP-SGD) bounds the influence any single training example can have on the final weights, which limits both training-data extraction and certain forms of membership inference.[^6] Differentially private training does not directly stop functionality stealing, since the model's input-output behavior remains observable, but it reduces the value of what can be stolen by ensuring that memorization of individual examples is provably small.[^6]
API design is the cheapest and most widely deployed family of defenses:
k predictions, only the argmax label, or rounded probabilities reduces the information leaked per query. Tramèr et al. showed in 2016 that even with confidence values removed an attacker can still extract a model, just at higher query cost.[^1] Finlayson et al. and Carlini et al. showed in 2024 that top-k log probabilities are still enough to reconstruct full logit vectors via logit-bias manipulation.[^2][^11]logit_bias in ways that block the projection-layer attack.[^10]Defenders can also try to detect extraction attacks while they are happening. PRADA ("Protecting Against DNN Model Stealing Attacks") by Juuti, Szyller, Marchal, and Asokan examines the statistical distribution of consecutive API queries from a given client and flags clients whose query distance distribution diverges from the near-normal distribution expected of benign users.[^16] The authors reported that PRADA could detect all then-known model extraction attacks with no false positives at the time of publication.[^16] Subsequent work has extended this idea to stateful, per-client query monitoring.
The Carlini et al. 2024 paper observed that the softmax bottleneck itself is the root cause of the projection-layer attack, and suggested architectural changes that break the strict low-rank structure of the final layer.[^2] These remain an open research direction because the bottleneck is mathematically convenient and is shared by almost every modern LLM architecture.
The legal status of model stealing is unsettled, and depends on jurisdiction, the precise method of extraction, and whether the extracted artifact is a near-clone or a coarser distillation.
In the United States, prediction APIs are often protected only by contractual terms of service and trade-secret doctrines, not by copyright on the model weights themselves. The Ninth Circuit's hiQ Labs, Inc. v. LinkedIn Corp. litigation is frequently cited as the closest precedent on automated access to a commercial API.[^17] hiQ scraped public LinkedIn profiles to build analytics products; LinkedIn sent cease and desist letters and revoked hiQ's access. In April 2022, after a Supreme Court remand prompted by Van Buren v. United States, the Ninth Circuit reaffirmed its earlier holding that scraping data made publicly available by a service does not violate the Computer Fraud and Abuse Act.[^17] In December 2022, the parties settled, with hiQ accepting a five hundred thousand US dollar judgment and admitting liability under California common-law torts of trespass to chattels and misappropriation.[^17] The case constrained the use of the CFAA against scraping but left open whether tort doctrines could still reach API abuse, a question directly relevant to model stealing.
OpenAI's Terms of Use prohibit users from using outputs to develop models that compete with OpenAI's services.[^18] The clause binds the original user of the API but does not automatically run with the data if the user later releases that data under a permissive license. Anthropic's usage policies contain similar restrictions. These contractual prohibitions are widely cited in industry but have not yet been adjudicated against an alleged model thief in a published US opinion.
In the European Union, the EU AI Act regulates AI systems primarily through obligations on providers and deployers rather than through a direct prohibition of model stealing. EU trade-secret law (the 2016 Trade Secrets Directive) is a more direct fit, treating exfiltrated model weights as a misappropriated trade secret if the original holder took reasonable secrecy measures.
In January 2025, the Chinese AI company DeepSeek released DeepSeek-R1, a reasoning-focused model whose published benchmark scores rivaled OpenAI's frontier reasoning models while reportedly using far fewer high-end accelerators.[^19] OpenAI and its commercial partner Microsoft publicly stated that they had observed activity consistent with DeepSeek-affiliated accounts distilling outputs from ChatGPT in violation of OpenAI's Terms of Use.[^19] On February 12, 2026, OpenAI delivered a memo to the US House Select Committee on the Strategic Competition between the United States and the Chinese Communist Party, accusing DeepSeek of "free-riding on the capabilities developed by OpenAI and other US frontier labs" and alleging that DeepSeek employees had "developed methods to circumvent OpenAI's access restrictions and access models through obfuscated third-party routers" and "developed code to access US AI models and obtain outputs for distillation in programmatic ways".[^19] DeepSeek has not publicly responded to the specific factual claims. As of mid-2026 no court has adjudicated the matter, and the strongest public evidence offered consists of behavioral analyses suggesting unusually high stylistic similarity between DeepSeek outputs and ChatGPT outputs. The controversy is widely cited as the first high-profile geopolitical episode driven by the threat of model stealing.[^19]
Earlier and lower-profile allegations of API distillation between commercial laboratories have surfaced periodically, particularly around the launch of competitor LLMs. Few have been substantiated publicly, and the broader concern is that distillation from production APIs is now economically rational for many would-be entrants regardless of contractual prohibitions.
Model stealing sits alongside several other privacy and security attacks against deployed machine learning:
Despite a decade of research, several limitations and open questions remain: