Model extraction attack
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,438 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,438 words
Add missing citations, update stale details, or suggest a clearer explanation.
A model extraction attack is a class of machine learning security attacks in which an adversary, restricted to black-box query access to a target model (typically through a paid prediction API), uses the responses to construct a local "stolen" replica that approximates the target's behaviour, recover specific hyperparameters, or, in the strongest variants, extract individual weight matrices.[^1][^2] The canonical formulation was introduced by Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter and Thomas Ristenpart in their 2016 USENIX Security paper "Stealing Machine Learning Models via Prediction APIs", which demonstrated practical extraction of logistic regression, neural network and decision tree models deployed on BigML and Amazon Machine Learning.[^1] The threat regained prominence in 2024 when a Google DeepMind team led by Nicholas Carlini recovered the entire output-projection matrix of OpenAI's ada and babbage models, and the exact hidden dimension of gpt-3.5-turbo, by issuing top-logit queries to public APIs.[^2] Model extraction is conceptually distinct from membership inference (which targets training data rather than model parameters) and from training-data extraction (which recovers memorised inputs); it is, however, closely entangled with Knowledge Distillation, from which most modern functional-approximation variants directly inherit their training objectives.[^3]
Model extraction is defined with respect to a specific access regime. The defender holds a target model f_θ with parameters θ and exposes a prediction interface; the adversary issues queries x and observes responses f(x). The richness of the response determines what can be recovered.[^1]
The 2016 Tramèr et al. paper formalised three response granularities that remain the working taxonomy:[^1]
| Access level | Information returned | Typical extraction outcome |
|---|---|---|
| Label only | Argmax class | Functional approximation via active learning, with the highest query cost. |
| Probability / confidence | Full posterior p(y | x) over classes | Tight functional clones; equation-solving against parametric models. |
| Logits (full or top-k) | Pre-softmax scores, partial in modern APIs | Recovery of linear projections and hidden dimensions, as in Carlini et al. 2024.[^2] |
White-box extraction is excluded by definition: an attacker who already possesses θ has nothing left to steal. Some intermediate scenarios are referred to in the literature as grey-box, in which the architecture family (e.g. that a target is a Transformer) is known or strongly suspected, but the weights are not.[^4]
Adversary goals are typically partitioned into three objectives, following the 2020 Jagielski et al. analysis "High Accuracy and High Fidelity Extraction of Neural Networks":[^4]
Model extraction sits at the intersection of two earlier research strands. The first is membership inference and model inversion against ML APIs, studied from 2014 onwards. The second is Knowledge Distillation, introduced by Geoffrey Hinton, Oriol Vinyals and Jeff Dean in 2015 as a benign training technique for compressing models by matching a student's Logits to a teacher's Softmax outputs.[^3] Distillation became the prototypical training procedure for the "stolen" copies produced by extraction attacks, with the only methodological difference being that the student team is not the teacher team and lacks consent.
Tramèr et al.'s "Stealing Machine Learning Models via Prediction APIs" was presented at the 25th USENIX Security Symposium in Austin, Texas, in August 2016 (arXiv:1609.02943, pages 601-618 of the proceedings).[^1][^5] The paper introduced the term model extraction attack in its modern sense and demonstrated several concrete techniques:[^1]
The authors evaluated the attacks against BigML and Amazon Machine Learning and reported near-perfect fidelity for the targeted model families.[^1] The paper also discussed defences (rounding confidences, omitting confidence scores) and showed that, when confidence values are simply truncated, equation-solving still succeeds for many model families.[^1]
Binghui Wang and Neil Zhenqiang Gong's "Stealing Hyperparameters in Machine Learning" (IEEE Symposium on Security and Privacy 2018, arXiv:1802.05351) extended the threat model to recovery of confidential hyperparameters, including regularisation constants and kernel parameters, for ridge regression, logistic regression, SVMs and neural networks.[^6] The attack was demonstrated against Amazon Machine Learning, illustrating that even when weights are protected, leakage of the loss-function or regulariser can have substantial commercial value because such choices encode proprietary modelling decisions.[^6]
Tribhuvanesh Orekondy, Bernt Schiele and Mario Fritz presented "Knockoff Nets: Stealing Functionality of Black-Box Models" at CVPR 2019.[^7] Their method queries the victim with random images drawn from an unrelated public dataset, then trains a high-capacity convolutional clone on the resulting (image, prediction) pairs. The attack achieved competitive accuracy on the victim's task without any in-distribution data, demonstrating that prediction APIs can leak functionality even when the query distribution is unrelated to the training distribution.[^7]
Two USENIX Security 2020 papers raised the bar substantially. "High Accuracy and High Fidelity Extraction of Neural Networks" (Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin and Nicolas Papernot, arXiv:1909.01838) formalised the accuracy / fidelity / functional-equivalence hierarchy and demonstrated the first practical functionally-equivalent extraction of a two-layer ReLU network, treating extraction as a system of piecewise-linear equations.[^4] In parallel, "Cryptanalytic Extraction of Neural Network Models" (Carlini, Jagielski, Papernot, CRYPTO 2020) used differential analysis of ReLU activations to recover weights up to floating-point precision with roughly 100x fewer queries than the prior state of the art, extracting a 100,000-parameter MNIST network in under an hour.[^8]
The 2024 paper "Stealing Part of a Production Language Model" (Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, Florian Tramèr; arXiv:2403.06634; submitted 11 March 2024) extended the cryptanalytic line into the era of frontier LLMs.[^2][^9] It was selected as an outstanding paper at the 41st International Conference on Machine Learning (ICML 2024).[^10] Concurrently, large-scale jailbreak distillation controversies, including OpenAI's allegations against DeepSeek and Anthropic's reports of coordinated extraction attempts on Anthropic Claude, moved the topic from academic curiosity to commercial and regulatory front pages.[^11][^12]
In its simplest form, a functional-approximation extraction attack proceeds in four steps:[^1][^7]
Q = {x_1, ..., x_n}, drawn either from the victim's likely input distribution or from a generic surrogate corpus.x_i, the adversary records the victim's response y_i = f_θ(x_i). The form of y_i depends on the API tier (label, probability vector, top-k Logits).g_φ on the pairs (x_i, y_i) using a distillation-style loss such as the Knowledge Distillation cross-entropy on soft labels.[^3]When the victim is a parametric model whose predictions are a deterministic function of a small number of parameters, step 3 can be replaced by closed-form solution. Tramèr et al. observed that for d-dimensional logistic regression, exactly d + 1 queries returning full probability outputs suffice to solve for the weight vector and bias, modulo numerical conditioning.[^1]
Wang and Gong's 2018 method exploits a stationarity condition. At an optimum of a regularised loss L(θ) = L_data(θ) + λ * R(θ), the gradient vanishes, giving ∇L_data(θ) = -λ ∇R(θ). The adversary observes the trained parameters (or a sufficiently faithful clone), evaluates the two gradients on a labelled dataset, and solves the resulting linear system for λ.[^6] The technique generalises to several regularisers and to kernel hyperparameters when the kernel matrix can be reconstructed.[^6]
The 2024 attack against production language models exploits a structural property of Transformer decoders: the final layer is a linear projection W ∈ ℝ^{V × h} from a h-dimensional hidden state to a V-dimensional logit vector over the vocabulary, with V typically in the tens of thousands and h orders of magnitude smaller.[^2][^9] Because rank(W) ≤ h, every logit vector emitted by the model lies in an at-most-h-dimensional subspace of ℝ^V.
The attack collects a large set of logit vectors by issuing diverse prompts to the API and stacks them column-wise into a matrix L ∈ ℝ^{V × N}. A Full Softmax response is not required: the adversaries used the OpenAI logprobs and logit_bias parameters to recover sufficient information per query. They then performed a singular value decomposition of L. The number of non-trivial singular values reveals h (the model's hidden dimension), and the left singular vectors span the column space of W, recovering the projection matrix up to an orthogonal transformation.[^2][^9]
Reported headline results:[^2][^9]
| Target | Reported hidden dim | Reported cost | Status |
|---|---|---|---|
OpenAI ada | 1024 | under $20 USD | Full output-projection matrix extracted. |
OpenAI babbage | 2048 | under $20 USD | Full output-projection matrix extracted. |
OpenAI gpt-3.5-turbo | Recovered (not publicly disclosed) | Estimated under $2,000 for full matrix | Hidden dim confirmed with OpenAI; numeric value withheld at OpenAI's request. |
The researchers stressed the attack does not reveal the entire model: only the final linear projection and the hidden dimensionality are recovered, and the recovery is up to symmetry (left multiplication by an orthogonal matrix).[^2][^9] Even so, the recovered information leaks competitively sensitive architectural details and enables downstream attacks such as more efficient adversarial-example transfer and finer fingerprinting of API-served models.[^9]
The Jagielski et al. 2020 and Carlini-Jagielski-Papernot CRYPTO 2020 attacks rely on a different structural property: for piecewise-linear networks (e.g. ReLU MLPs), the function is piecewise affine, and the discontinuities in the gradient (the "kinks") reveal individual neuron boundaries.[^4][^8] By probing finely along carefully chosen lines in the input space, the adversary can detect each kink, recover each neuron's hyperplane, and solve for weight vectors up to a small set of symmetries. Subsequent work extended these techniques to deeper networks and to hard-label (label-only) settings, sacrificing query efficiency for reduced API access.[^4][^8]
A distinct, more pragmatic variant emerged with the open-weights ecosystem of 2023 and 2024: rather than reconstruct the victim's parameters, the adversary trains an open-weight base model (typically a LLaMA, Llama 2 or Llama 3 checkpoint) to imitate a closed model's outputs via Supervised fine-tuning on harvested completions.[^13] The original demonstration was Stanford's Alpaca project (March 2023), which fine-tuned a 7B LLaMA on 52,000 instruction-response pairs generated by querying OpenAI's text-davinci-003; the resulting model approached ChatGPT quality on instruction-following benchmarks at a reported training cost of roughly $600.[^13] Alpaca was voluntarily withdrawn after concerns about safety and OpenAI terms of service, but the methodology was adopted at scale by Vicuna (language model) and later by commercial Chinese labs.[^13]
OpenAI's standard API terms forbid using outputs to train competing models; Anthropic, Google and other API providers impose comparable restrictions.[^11][^14] In early 2025, OpenAI and Microsoft publicly alleged that DeepSeek had used distillation against OpenAI APIs in building its DeepSeek-R1 reasoning model, and in February 2026 Anthropic reported what it described as industrial-scale distillation activity against ChatGPT and Claude.[^11][^12] The distinction between this activity and ordinary benign use is fundamentally one of intent and scale, not technique; the legal status remains unsettled.[^14][^15]
A 2024 follow-up line of work, including "Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models" (Wallace et al., arXiv:2406.09355), extends Carlini-style recovery to commercial embedding APIs such as those offered by OpenAI API and Amazon Bedrock.[^16] Because Embeddings APIs return a fixed-dimensional vector per input, the structural assumptions are different from generation APIs but the linear-algebra recovery toolkit is closely related.[^16]
Beyond text models, Knockoff Nets and its descendants targeted commercial image classifiers; Wang and Liu's "Stealing GANs" (2018-2021) extended the framework to generative adversarial networks; and "Hard-Label Cryptanalytic Extraction of Neural Network Models" (ASIACRYPT 2024) sharpened the cryptanalytic line for the label-only access regime.[^17][^8]
The 2024 attack was conducted under a coordinated disclosure agreement. According to a companion blog post by the Carlini team and follow-up reporting, the researchers notified OpenAI and Google in late 2023; Google deployed mitigations first, and OpenAI followed with API changes on or around 3 March 2024.[^9][^18]
OpenAI's mitigation centred on two changes to the chat-completion and completions endpoints:[^9][^18]
logprobs and logit_bias in a single request for the models known to be affected. Because the attack relied on iteratively biasing logits while observing top-k log-probabilities to reconstruct full logit vectors, blocking this combination raised the per-query information yield substantially and pushed the cost of completing the attack on then-current models out of reach for the budgets demonstrated.[^9][^18]logit_bias so that fewer tokens can be re-weighted in any single request and the magnitude of permitted bias is constrained.[^18]OpenAI did not publicly disclose the recovered hidden dimension of gpt-3.5-turbo, and the researchers honoured that request in their published paper.[^2][^9] The attack, as published, therefore demonstrates a vulnerability class rather than a deployable exploit against current-generation production APIs.[^9]
Model extraction has consequences for several stakeholder groups:
θ, while training-data extraction targets the training set D. Membership inference (does record r belong to D?) is also distinct.[^4]Defences against extraction fall into four broad categories. None is a complete answer, and almost all impose utility costs on legitimate users.
The simplest defence is to expose only argmax labels, withholding probabilities and logits. This eliminates the most powerful equation-solving and SVD-based attacks but degrades the utility of the API for legitimate use cases such as calibrated decision-making and downstream Supervised fine-tuning for legitimate purposes.[^1] The Carlini-era OpenAI patch is a targeted variant of this defence, limiting only the specific combinations of logprobs and logit_bias that enable logit reconstruction.[^9][^18]
Orekondy, Schiele and Fritz's "Prediction Poisoning" (ICLR 2020, arXiv:1906.10908) introduced a utility-constrained defence in which the API actively perturbs the softmax output along directions that maximally distort the gradient of a hypothetical attacker's training loss while preserving the argmax label.[^19] The defence amplified an attacker's clone error rate by up to 85x for the evaluated benchmarks, with minor impact on benign accuracy.[^19] Earlier proposals using simple confidence rounding or noise injection were largely defeated by Tramèr et al.'s equation-solving attacks.[^1]
Mika Juuti, Sebastian Szyller, Samuel Marchal and N. Asokan's "PRADA: Protecting Against DNN Model Stealing Attacks" (EuroS&P 2019, arXiv:1805.02628) monitors the statistical distribution of consecutive queries from each client and raises an alarm when the distribution deviates from a benign reference.[^20] Extraction attacks tend to issue diverse, uncorrelated queries to maximise information gain, producing detectable signatures.[^20] In 2026, Anthropic publicly described an internal pipeline along similar lines, combining behavioural fingerprinting, infrastructure correlation across many accounts, and statistical tests against the expected power-law distribution of organic prompts.[^12]
API-level rate limiting and per-account budgets are a cruder operationalisation of the same idea; they raise extraction costs without preventing the attack.[^21]
A complementary defensive line equips the model with verifiable provenance signals. "Entangled Watermarks as a Defense against Model Extraction" (Hengrui Jia, Christopher A. Choquette-Choo, Varun Chandrasekaran and Nicolas Papernot, USENIX Security 2021, arXiv:2002.12200) trains a victim model so that watermark behaviour is entangled with primary-task behaviour, forcing any stolen substitute to either reproduce the watermark (revealing its provenance) or sacrifice utility on the primary task.[^22] Related work on canary inputs and embedded external features pursues similar goals: a defender can probe a suspect commercial deployment with secret trigger inputs and statistically distinguish a stolen substitute from an independently trained one.[^22]
For LLMs specifically, AI watermarking of outputs offers a partial defence against Knowledge Distillation-style extraction: if every output token is drawn from a green-list biased distribution governed by a secret key, downstream students that imitate the outputs may inherit a statistically detectable watermark.[^21] Anthropic's 2026 disclosures and earlier OpenAI proposals discussed but did not deploy production output watermarking at scale, citing utility and adversarial-removal concerns.[^12]
A more fundamental line, informed by Differential privacy and information-theoretic accounting, calibrates the API's response so that the total Shannon information about θ released per query is bounded.[^4] In the limit, this collapses to argmax responses, but intermediate trade-offs are possible: a noisy logit response with calibrated noise can preserve calibration utility for legitimate clients while bounding the attacker's per-query progress towards θ.
In the United States, the primary legal hook against extraction is contract: API providers' terms of service typically prohibit (i) automated programmatic extraction beyond rate limits, (ii) reverse engineering, and (iii) using outputs to train competing models.[^14][^15] Breach of these clauses is a breach of contract claim against an identifiable counter-party; violations have been the basis for account terminations against suspected DeepSeek-affiliated developers in 2024 and 2025, and for civil litigation threats.[^11][^15] OpenAI's account-level enforcement against jailbreak-distillation accounts in 2024 and Anthropic's bulk termination of approximately 24,000 accounts in 2026 are operational expressions of the same contractual basis.[^12]
Where contractual claims are unavailable (third-party adversaries, foreign jurisdictions), providers have argued that model weights and key hyperparameters are protectable trade secrets under the US Defend Trade Secrets Act and equivalent regimes. The doctrine is well-suited to weights, which are non-public, valuable, and subject to reasonable secrecy measures, but largely untested for the functional capabilities leaked by clone models.[^15]
Whether model weights themselves are copyrightable subject matter, and whether circumvention of a paid API constitutes a Digital Millennium Copyright Act anti-circumvention violation, are open questions in US law as of 2026.[^15] Some commentators have argued that fine-tuning a competing model on harvested outputs implicates the derivative-works right; others argue that learning from outputs is no more copyright-implicating than learning from publicly available text.[^15]
The EU AI Act, in force in stages from 2024 onwards, addresses model extraction primarily through general-purpose AI obligations and through the interaction between trade-secret protection and disclosure duties.[^23] The Act requires providers of general-purpose AI to publish a summary of training data while explicitly preserving the protection of trade secrets and confidential business information, creating a structural tension that extraction attacks exacerbate by making it harder to keep architectural details such as hidden dimension confidential.[^23] Database-rights jurisprudence (separate from copyright) provides a possible additional layer of protection against bulk extraction and re-utilisation of the model's "output database", but its applicability to ML model outputs is unsettled in EU law.[^15]
Several limitations are common to the published extraction literature:
ada and babbage, ≈$2,000 estimated for gpt-3.5-turbo) are unusually favourable. Frontier-scale models with larger hidden dimensions, faster API rate limits, and (post-March 2024) tighter logprobs and logit_bias policies push the corresponding cost into a regime where extraction is no longer trivially cheap.[^2][^9][^18]Several attack classes are routinely confused with model extraction in popular coverage. The key distinctions:
| Attack | Targets | Relation to extraction |
|---|---|---|
| Model extraction | Parameters / functionality of f_θ | The topic of this article. |
| Membership inference | Whether a specific record belongs to training set D | A privacy attack; orthogonal to extraction, although both exploit overfitting signals.[^4] |
| Training-data extraction | Recovery of memorised training inputs from f_θ | Related but targets D, not θ. |
| Model inversion | Reconstructing typical inputs of a class | A privacy attack about input space, not parameter space. |
| Adversarial examples | Inputs that fool f_θ | Often supercharged by prior extraction, but conceptually distinct. |
| Prompt injection | Manipulating LLM behaviour via crafted prompts | An access-control attack on an agentic system; not extraction. |
| Data poisoning | Corrupting training data | Affects training, not inference-time extraction. |
| Jailbreak (artificial intelligence) | Eliciting policy-violating outputs | Distinct goal, but jailbreaks can be a stepping stone to distillation. |
Model extraction is sometimes referred to in industry literature as model theft or model stealing; aiwiki maintains a Model stealing article that treats commercial and operational aspects in more detail, while the present article concentrates on the technical and academic taxonomy.