# Model extraction attack

> Source: https://aiwiki.ai/wiki/model_extraction_attack
> Updated: 2026-07-16
> Categories: AI Safety, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **model extraction attack** is a class of machine learning security attacks in which an adversary, restricted to black-box query access to a target model (typically through a paid prediction API), uses the responses to construct a local "stolen" replica that approximates the target's behaviour, recover specific hyperparameters, or, in the strongest variants, extract individual weight matrices.[^1][^2] The canonical formulation was introduced by Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter and Thomas Ristenpart in their 2016 USENIX Security paper "Stealing Machine Learning Models via Prediction APIs", which demonstrated practical extraction of logistic regression, neural network and decision tree models deployed on BigML and Amazon Machine Learning.[^1] The threat regained prominence in 2024 when a Google DeepMind team led by Nicholas Carlini recovered the entire output-projection matrix of OpenAI's `ada` and `babbage` models, and the exact hidden dimension of `gpt-3.5-turbo`, by issuing top-logit queries to public APIs.[^2] Model extraction is conceptually distinct from membership inference (which targets training data rather than model parameters) and from training-data extraction (which recovers memorised inputs); it is, however, closely entangled with [Knowledge Distillation](/wiki/knowledge_distillation), from which most modern functional-approximation variants directly inherit their training objectives.[^3]

## Threat model

Model extraction is defined with respect to a specific access regime. The defender holds a target model `f_θ` with parameters `θ` and exposes a prediction interface; the adversary issues queries `x` and observes responses `f(x)`. The richness of the response determines what can be recovered.[^1]

The 2016 Tramèr et al. paper formalised three response granularities that remain the working taxonomy:[^1]

| Access level | Information returned | Typical extraction outcome |
| --- | --- | --- |
| Label only | Argmax class | Functional approximation via active learning, with the highest query cost. |
| Probability / confidence | Full posterior $$p(y \mid x)$$ over classes | Tight functional clones; equation-solving against parametric models. |
| Logits (full or top-k) | Pre-softmax scores, partial in modern APIs | Recovery of linear projections and hidden dimensions, as in Carlini et al. 2024.[^2] |

White-box extraction is excluded by definition: an attacker who already possesses `θ` has nothing left to steal. Some intermediate scenarios are referred to in the literature as **grey-box**, in which the architecture family (e.g. that a target is a [Transformer](/wiki/transformer)) is known or strongly suspected, but the weights are not.[^4]

Adversary goals are typically partitioned into three objectives, following the 2020 Jagielski et al. analysis "High Accuracy and High Fidelity Extraction of Neural Networks":[^4]

1. **Accuracy.** Build a substitute model that performs well on the underlying task, e.g. matches the victim's classification accuracy on a held-out test set.[^4]
2. **Fidelity.** Build a substitute whose predictions agree with the victim's predictions on arbitrary inputs, including out-of-distribution and adversarial ones. Fidelity is strictly stronger than accuracy.[^4]
3. **Functional equivalence.** Recover a model whose mapping is identical to the victim's on the entire input space. Jagielski et al. proved that learning-based strategies cannot generally achieve functional equivalence, and that direct cryptanalytic extraction is required.[^4]

## History and key milestones

### Pre-history (2009-2015)

Model extraction sits at the intersection of two earlier research strands. The first is membership inference and model inversion against ML APIs, studied from 2014 onwards. The second is [Knowledge Distillation](/wiki/knowledge_distillation), introduced by Geoffrey Hinton, Oriol Vinyals and Jeff Dean in 2015 as a benign training technique for compressing models by matching a student's [Logits](/wiki/logits) to a teacher's [Softmax](/wiki/softmax) outputs.[^3] Distillation became the prototypical training procedure for the "stolen" copies produced by extraction attacks, with the only methodological difference being that the student team is not the teacher team and lacks consent.

### Tramèr et al. 2016: the canonical paper

Tramèr et al.'s "Stealing Machine Learning Models via Prediction APIs" was presented at the 25th USENIX Security Symposium in Austin, Texas, in August 2016 (arXiv:1609.02943, pages 601-618 of the proceedings).[^1][^5] The paper introduced the term *model extraction attack* in its modern sense and demonstrated several concrete techniques:[^1]

- **Equation-solving attacks** against logistic regression, multinomial logit, multi-layer perceptrons and softmax regression: given enough confidence-score responses, the parameters of a linear (or near-linear) model satisfy a system of linear equations that can be solved in closed form.
- **Path-finding attacks** against decision trees served by BigML, exploiting the platform's exposure of node-level confidence values to traverse and reconstruct the underlying tree.
- **Retraining attacks** for non-linear models, in which the adversary trains a substitute on labels obtained from the victim.

The authors evaluated the attacks against BigML and Amazon Machine Learning and reported near-perfect fidelity for the targeted model families.[^1] The paper also discussed defences (rounding confidences, omitting confidence scores) and showed that, when confidence values are simply truncated, equation-solving still succeeds for many model families.[^1]

### Hyperparameter stealing (2018)

Binghui Wang and Neil Zhenqiang Gong's "Stealing Hyperparameters in Machine Learning" (IEEE Symposium on Security and Privacy 2018, arXiv:1802.05351) extended the threat model to recovery of confidential hyperparameters, including regularisation constants and kernel parameters, for ridge regression, logistic regression, SVMs and neural networks.[^6] The attack was demonstrated against Amazon Machine Learning, illustrating that even when weights are protected, leakage of the loss-function or regulariser can have substantial commercial value because such choices encode proprietary modelling decisions.[^6]

### Functional-clone era (2019)

Tribhuvanesh Orekondy, Bernt Schiele and Mario Fritz presented "Knockoff Nets: Stealing Functionality of Black-Box Models" at CVPR 2019.[^7] Their method queries the victim with random images drawn from an unrelated public dataset, then trains a high-capacity convolutional clone on the resulting (image, prediction) pairs. The attack achieved competitive accuracy on the victim's task without any in-distribution data, demonstrating that prediction APIs can leak functionality even when the query distribution is unrelated to the training distribution.[^7]

### Cryptanalytic extraction (2020)

Two USENIX Security 2020 papers raised the bar substantially. "High Accuracy and High Fidelity Extraction of Neural Networks" (Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin and Nicolas Papernot, arXiv:1909.01838) formalised the accuracy / fidelity / functional-equivalence hierarchy and demonstrated the first practical functionally-equivalent extraction of a two-layer ReLU network, treating extraction as a system of piecewise-linear equations.[^4] In parallel, "Cryptanalytic Extraction of Neural Network Models" (Carlini, Jagielski, Papernot, CRYPTO 2020) used differential analysis of ReLU activations to recover weights up to floating-point precision with roughly 100x fewer queries than the prior state of the art, extracting a 100,000-parameter MNIST network in under an hour.[^8]

### LLM-era attacks (2024-2026)

The 2024 paper "Stealing Part of a Production Language Model" (Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, Florian Tramèr; arXiv:2403.06634; submitted 11 March 2024) extended the cryptanalytic line into the era of frontier LLMs.[^2][^9] It was selected as an outstanding paper at the 41st International Conference on Machine Learning ([ICML](/wiki/icml) 2024).[^10] Concurrently, large-scale **jailbreak distillation** controversies, including OpenAI's allegations against [DeepSeek](/wiki/deepseek) and Anthropic's reports of coordinated extraction attempts on [Anthropic](/wiki/anthropic) Claude, moved the topic from academic curiosity to commercial and regulatory front pages.[^11][^12]

## Technical details

### Functional approximation

In its simplest form, a functional-approximation extraction attack proceeds in four steps:[^1][^7]

1. The adversary assembles a pool of query inputs $$Q = \{x_1, \ldots, x_n\}$$, drawn either from the victim's likely input distribution or from a generic surrogate corpus.
2. For each `x_i`, the adversary records the victim's response $$y_i = f_\theta(x_i)$$. The form of `y_i` depends on the API tier (label, probability vector, top-k [Logits](/wiki/logits)).
3. The adversary trains a student model `g_φ` on the pairs `(x_i, y_i)` using a distillation-style loss such as the [Knowledge Distillation](/wiki/knowledge_distillation) cross-entropy on soft labels.[^3]
4. The student is evaluated for accuracy and fidelity against the victim, often by issuing additional comparison queries.[^4]

When the victim is a parametric model whose predictions are a deterministic function of a small number of parameters, step 3 can be replaced by closed-form solution. Tramèr et al. observed that for `d`-dimensional logistic regression, exactly $$d + 1$$ queries returning full probability outputs suffice to solve for the weight vector and bias, modulo numerical conditioning.[^1]

### Hyperparameter recovery

Wang and Gong's 2018 method exploits a stationarity condition. At an optimum of a regularised loss $$L(\theta) = L_{\text{data}}(\theta) + \lambda R(\theta)$$, the gradient vanishes, giving $$\nabla L_{\text{data}}(\theta) = -\lambda \nabla R(\theta)$$. The adversary observes the trained parameters (or a sufficiently faithful clone), evaluates the two gradients on a labelled dataset, and solves the resulting linear system for `λ`.[^6] The technique generalises to several regularisers and to kernel hyperparameters when the kernel matrix can be reconstructed.[^6]

### The Carlini et al. 2024 logit attack

The 2024 attack against production language models exploits a structural property of [Transformer](/wiki/transformer) decoders: the final layer is a linear projection $$W \in \mathbb{R}^{V \times h}$$ from a `h`-dimensional hidden state to a `V`-dimensional logit vector over the vocabulary, with `V` typically in the tens of thousands and `h` orders of magnitude smaller.[^2][^9] Because $$\mathrm{rank}(W) \le h$$, every logit vector emitted by the model lies in an at-most-`h`-dimensional subspace of $$\mathbb{R}^V$$.

The attack collects a large set of logit vectors by issuing diverse prompts to the API and stacks them column-wise into a matrix $$L \in \mathbb{R}^{V \times N}$$. A [Full Softmax](/wiki/full_softmax) response is not required: the adversaries used the OpenAI `logprobs` and `logit_bias` parameters to recover sufficient information per query. They then performed a singular value decomposition of `L`. The number of non-trivial singular values reveals `h` (the model's hidden dimension), and the left singular vectors span the column space of `W`, recovering the projection matrix up to an orthogonal transformation.[^2][^9]

Reported headline results:[^2][^9]

| Target | Reported hidden dim | Reported cost | Status |
| --- | --- | --- | --- |
| OpenAI `ada` | 1024 | under \$20 USD | Full output-projection matrix extracted. |
| OpenAI `babbage` | 2048 | under \$20 USD | Full output-projection matrix extracted. |
| OpenAI `gpt-3.5-turbo` | Recovered (not publicly disclosed) | Estimated under \$2,000 for full matrix | Hidden dim confirmed with OpenAI; numeric value withheld at OpenAI's request. |

The researchers stressed the attack does not reveal the entire model: only the final linear projection and the hidden dimensionality are recovered, and the recovery is up to symmetry (left multiplication by an orthogonal matrix).[^2][^9] Even so, the recovered information leaks competitively sensitive architectural details and enables downstream attacks such as more efficient adversarial-example transfer and finer fingerprinting of API-served models.[^9]

### Cryptanalytic and hard-label extraction

The Jagielski et al. 2020 and Carlini-Jagielski-Papernot CRYPTO 2020 attacks rely on a different structural property: for piecewise-linear networks (e.g. ReLU MLPs), the function is piecewise affine, and the discontinuities in the gradient (the "kinks") reveal individual neuron boundaries.[^4][^8] By probing finely along carefully chosen lines in the input space, the adversary can detect each kink, recover each neuron's hyperplane, and solve for weight vectors up to a small set of symmetries. Subsequent work extended these techniques to deeper networks and to hard-label (label-only) settings, sacrificing query efficiency for reduced API access.[^4][^8]

## Variants

### Jailbreak distillation as soft model extraction

A distinct, more pragmatic variant emerged with the open-weights ecosystem of 2023 and 2024: rather than reconstruct the victim's parameters, the adversary trains an open-weight base model (typically a [LLaMA](/wiki/llama), [Llama 2](/wiki/llama_2) or [Llama 3](/wiki/llama_3) checkpoint) to imitate a closed model's outputs via [Supervised fine-tuning](/wiki/supervised_fine-tuning) on harvested completions.[^13] The original demonstration was Stanford's Alpaca project (March 2023), which fine-tuned a 7B LLaMA on 52,000 instruction-response pairs generated by querying OpenAI's `text-davinci-003`; the resulting model approached [ChatGPT](/wiki/chatgpt) quality on instruction-following benchmarks at a reported training cost of roughly \$600.[^13] Alpaca was voluntarily withdrawn after concerns about safety and OpenAI terms of service, but the methodology was adopted at scale by [Vicuna (language model)](/wiki/vicuna) and later by commercial Chinese labs.[^13]

OpenAI's standard API terms forbid using outputs to train competing models; Anthropic, Google and other API providers impose comparable restrictions.[^11][^14] In early 2025, OpenAI and Microsoft publicly alleged that [DeepSeek](/wiki/deepseek) had used distillation against OpenAI APIs in building its [DeepSeek-R1](/wiki/deepseek_r1) reasoning model, and in February 2026 [Anthropic](/wiki/anthropic) reported what it described as industrial-scale distillation activity against [ChatGPT](/wiki/chatgpt) and Claude.[^11][^12] The distinction between this activity and ordinary benign use is fundamentally one of *intent* and *scale*, not technique; the legal status remains unsettled.[^14][^15]

### Embedding-model extraction

A 2024 follow-up line of work, including "Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models" (Wallace et al., arXiv:2406.09355), extends Carlini-style recovery to commercial embedding APIs such as those offered by [OpenAI API](/wiki/openai_api) and [Amazon Bedrock](/wiki/amazon_bedrock).[^16] Because [Embeddings](/wiki/embeddings) APIs return a fixed-dimensional vector per input, the structural assumptions are different from generation APIs but the linear-algebra recovery toolkit is closely related.[^16]

### Vision and structured-prediction variants

Beyond text models, Knockoff Nets and its descendants targeted commercial image classifiers; Wang and Liu's "Stealing GANs" (2018-2021) extended the framework to generative adversarial networks; and "Hard-Label Cryptanalytic Extraction of Neural Network Models" (ASIACRYPT 2024) sharpened the cryptanalytic line for the label-only access regime.[^17][^8]

## OpenAI's mitigations and disclosure timeline

The 2024 attack was conducted under a coordinated disclosure agreement. According to a companion blog post by the Carlini team and follow-up reporting, the researchers notified OpenAI and Google in late 2023; Google deployed mitigations first, and OpenAI followed with API changes on or around 3 March 2024.[^9][^18]

OpenAI's mitigation centred on two changes to the chat-completion and completions endpoints:[^9][^18]

1. **Removing the ability to combine `logprobs` and `logit_bias` in a single request** for the models known to be affected. Because the attack relied on iteratively biasing logits while observing top-k log-probabilities to reconstruct full logit vectors, blocking this combination raised the per-query information yield substantially and pushed the cost of completing the attack on then-current models out of reach for the budgets demonstrated.[^9][^18]
2. **Tightening the documented limits on `logit_bias`** so that fewer tokens can be re-weighted in any single request and the magnitude of permitted bias is constrained.[^18]

OpenAI did not publicly disclose the recovered hidden dimension of `gpt-3.5-turbo`, and the researchers honoured that request in their published paper.[^2][^9] The attack, as published, therefore demonstrates a vulnerability class rather than a deployable exploit against current-generation production APIs.[^9]

## Applications and significance

Model extraction has consequences for several stakeholder groups:

- **Model providers** lose proprietary information that may have cost millions of dollars to develop, ranging from full functionality (Knockoff Nets) to hyperparameters (Wang and Gong) to architectural details such as hidden dimension (Carlini et al. 2024).[^1][^6][^2]
- **Downstream security.** A high-fidelity substitute makes it dramatically easier to craft black-box adversarial examples that transfer to the victim model, sharpening attacks against safety filters and content classifiers.[^4][^7]
- **Privacy and copyright.** Extraction can serve as an intermediate step in **training-data extraction** (recovering memorised inputs from the model). The two attacks are conceptually distinct: extraction targets `θ`, while training-data extraction targets the training set `D`. Membership inference (does record `r` belong to `D`?) is also distinct.[^4]
- **Competitive dynamics.** When a frontier model can be approximated for a few thousand dollars in API queries plus a modest fine-tuning budget, the economic moat of capital-intensive pretraining narrows substantially. This is the regulatory concern raised in OpenAI's February 2025 memorandum to the US House Select Committee on the Chinese Communist Party and in Anthropic's 2026 reports.[^11][^12]

## Defences

Defences against extraction fall into four broad categories. None is a complete answer, and almost all impose utility costs on legitimate users.

### Restricting API responses

The simplest defence is to expose only argmax labels, withholding probabilities and logits. This eliminates the most powerful equation-solving and SVD-based attacks but degrades the utility of the API for legitimate use cases such as calibrated decision-making and downstream [Supervised fine-tuning](/wiki/supervised_fine-tuning) for legitimate purposes.[^1] The Carlini-era OpenAI patch is a targeted variant of this defence, limiting only the specific combinations of `logprobs` and `logit_bias` that enable logit reconstruction.[^9][^18]

### Output perturbation and prediction poisoning

Orekondy, Schiele and Fritz's "Prediction Poisoning" (ICLR 2020, arXiv:1906.10908) introduced a utility-constrained defence in which the API actively perturbs the softmax output along directions that maximally distort the gradient of a hypothetical attacker's training loss while preserving the argmax label.[^19] The defence amplified an attacker's clone error rate by up to 85x for the evaluated benchmarks, with minor impact on benign accuracy.[^19] Earlier proposals using simple confidence rounding or noise injection were largely defeated by Tramèr et al.'s equation-solving attacks.[^1]

### Query-pattern monitoring (PRADA and successors)

Mika Juuti, Sebastian Szyller, Samuel Marchal and N. Asokan's "PRADA: Protecting Against DNN Model Stealing Attacks" (EuroS&P 2019, arXiv:1805.02628) monitors the statistical distribution of consecutive queries from each client and raises an alarm when the distribution deviates from a benign reference.[^20] Extraction attacks tend to issue diverse, uncorrelated queries to maximise information gain, producing detectable signatures.[^20] In 2026, [Anthropic](/wiki/anthropic) publicly described an internal pipeline along similar lines, combining behavioural fingerprinting, infrastructure correlation across many accounts, and statistical tests against the expected power-law distribution of organic prompts.[^12]

API-level rate limiting and per-account budgets are a cruder operationalisation of the same idea; they raise extraction costs without preventing the attack.[^21]

### Watermarking and embedded features

A complementary defensive line equips the model with verifiable provenance signals. "Entangled Watermarks as a Defense against Model Extraction" (Hengrui Jia, Christopher A. Choquette-Choo, Varun Chandrasekaran and Nicolas Papernot, USENIX Security 2021, arXiv:2002.12200) trains a victim model so that watermark behaviour is *entangled* with primary-task behaviour, forcing any stolen substitute to either reproduce the watermark (revealing its provenance) or sacrifice utility on the primary task.[^22] Related work on canary inputs and embedded external features pursues similar goals: a defender can probe a suspect commercial deployment with secret trigger inputs and statistically distinguish a stolen substitute from an independently trained one.[^22]

For LLMs specifically, [AI watermarking](/wiki/ai_watermarking) of outputs offers a partial defence against [Knowledge Distillation](/wiki/knowledge_distillation)-style extraction: if every output token is drawn from a green-list biased distribution governed by a secret key, downstream students that imitate the outputs may inherit a statistically detectable watermark.[^21] Anthropic's 2026 disclosures and earlier OpenAI proposals discussed but did not deploy production output watermarking at scale, citing utility and adversarial-removal concerns.[^12]

### Architectural and information-theoretic defences

A more fundamental line, informed by [Differential privacy](/wiki/differential_privacy) and information-theoretic accounting, calibrates the API's response so that the total Shannon information about `θ` released per query is bounded.[^4] In the limit, this collapses to argmax responses, but intermediate trade-offs are possible: a noisy logit response with calibrated noise can preserve calibration utility for legitimate clients while bounding the attacker's per-query progress towards `θ`.

## Legal and policy dimensions

### Terms of service and contract law

In the United States, the primary legal hook against extraction is contract: API providers' terms of service typically prohibit (i) automated programmatic extraction beyond rate limits, (ii) reverse engineering, and (iii) using outputs to train competing models.[^14][^15] Breach of these clauses is a breach of contract claim against an identifiable counter-party; violations have been the basis for account terminations against suspected DeepSeek-affiliated developers in 2024 and 2025, and for civil litigation threats.[^11][^15] OpenAI's account-level enforcement against jailbreak-distillation accounts in 2024 and Anthropic's bulk termination of approximately 24,000 accounts in 2026 are operational expressions of the same contractual basis.[^12]

### Trade-secret law

Where contractual claims are unavailable (third-party adversaries, foreign jurisdictions), providers have argued that model weights and key hyperparameters are protectable trade secrets under the US Defend Trade Secrets Act and equivalent regimes. The doctrine is well-suited to weights, which are non-public, valuable, and subject to reasonable secrecy measures, but largely untested for the *functional capabilities* leaked by clone models.[^15]

### Copyright and DMCA

Whether model weights themselves are copyrightable subject matter, and whether circumvention of a paid API constitutes a Digital Millennium Copyright Act anti-circumvention violation, are open questions in US law as of 2026.[^15] Some commentators have argued that fine-tuning a competing model on harvested outputs implicates the derivative-works right; others argue that learning from outputs is no more copyright-implicating than learning from publicly available text.[^15]

### EU AI Act

The [EU AI Act](/wiki/eu_ai_act), in force in stages from 2024 onwards, addresses model extraction primarily through general-purpose AI obligations and through the interaction between trade-secret protection and disclosure duties.[^23] The Act requires providers of general-purpose AI to publish a summary of training data while explicitly preserving the protection of trade secrets and confidential business information, creating a structural tension that extraction attacks exacerbate by making it harder to keep architectural details such as hidden dimension confidential.[^23] Database-rights jurisprudence (separate from copyright) provides a possible additional layer of protection against bulk extraction and re-utilisation of the model's "output database", but its applicability to ML model outputs is unsettled in EU law.[^15]

## Limitations and open problems

Several limitations are common to the published extraction literature:

- **Cost in real money.** The Carlini et al. 2024 figures (\$20 for `ada` and `babbage`, ≈\$2,000 estimated for `gpt-3.5-turbo`) are unusually favourable. Frontier-scale models with larger hidden dimensions, faster API rate limits, and (post-March 2024) tighter `logprobs` and `logit_bias` policies push the corresponding cost into a regime where extraction is no longer trivially cheap.[^2][^9][^18]
- **Functional equivalence is provably hard.** Jagielski et al. showed that high-fidelity extraction is sometimes attainable but functional equivalence in general is not, except via cryptanalytic methods restricted to small piecewise-linear networks.[^4][^8]
- **Defences trade utility.** Every defence with non-trivial extraction-resistance also reduces the information content of the API for legitimate users. Calibrated probability outputs are valuable in many downstream applications, and removing them imposes real costs on legitimate API consumers.[^19]
- **Attribution is hard.** Distillation-style soft extraction (Alpaca-style) leaves no cryptographic forensic trace by default. Watermarking and entangled watermarks help, but adversaries can fine-tune on small amounts of additional data to attenuate watermark signal at a measurable but tolerable utility cost.[^21][^22]
- **Concept drift.** Many published defences (PRADA, Prediction Poisoning) were developed against attacks on image classifiers and small networks; their evaluation against today's frontier LLM APIs is incomplete.[^20][^19]

## Distinction from related attacks

Several attack classes are routinely confused with model extraction in popular coverage. The key distinctions:

| Attack | Targets | Relation to extraction |
| --- | --- | --- |
| Model extraction | Parameters / functionality of `f_θ` | The topic of this article. |
| Membership inference | Whether a specific record belongs to training set `D` | A privacy attack; orthogonal to extraction, although both exploit overfitting signals.[^4] |
| Training-data extraction | Recovery of memorised training inputs from `f_θ` | Related but targets `D`, not `θ`. |
| Model inversion | Reconstructing typical inputs of a class | A privacy attack about input space, not parameter space. |
| Adversarial examples | Inputs that fool `f_θ` | Often supercharged by prior extraction, but conceptually distinct. |
| [Prompt injection](/wiki/prompt_injection) | Manipulating LLM behaviour via crafted prompts | An access-control attack on an agentic system; not extraction. |
| [Data poisoning](/wiki/data_poisoning) | Corrupting training data | Affects training, not inference-time extraction. |
| [Jailbreak (artificial intelligence)](/wiki/jailbreak) | Eliciting policy-violating outputs | Distinct goal, but jailbreaks can be a stepping stone to distillation. |

Model extraction is sometimes referred to in industry literature as **model theft** or **model stealing**; aiwiki maintains a [Model stealing](/wiki/model_stealing) article that treats commercial and operational aspects in more detail, while the present article concentrates on the technical and academic taxonomy.

## See also

- [Model stealing](/wiki/model_stealing)
- [Knowledge Distillation](/wiki/knowledge_distillation)
- [Data poisoning](/wiki/data_poisoning)
- [Prompt injection](/wiki/prompt_injection)
- [Differential privacy](/wiki/differential_privacy)
- [AI watermarking](/wiki/ai_watermarking)
- [Jailbreak (artificial intelligence)](/wiki/jailbreak)
- [DeepSeek-R1](/wiki/deepseek_r1)
- [Logits](/wiki/logits)
- [Softmax](/wiki/softmax)
- [Transformer](/wiki/transformer) (architecture targeted by Carlini et al. 2024)
- [EU AI Act](/wiki/eu_ai_act)
- [OpenAI API](/wiki/openai_api)
- [Anthropic](/wiki/anthropic)
- [Google DeepMind](/wiki/google_deepmind)
- [GPT-3.5](/wiki/gpt-3.5)
- [GPT-2](/wiki/gpt-2)
- [Vicuna (language model)](/wiki/vicuna)
- [LLaMA](/wiki/llama)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)

## References

[^1]: Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, Thomas Ristenpart, "Stealing Machine Learning Models via Prediction APIs", USENIX Security Symposium, 2016-08-10. https://arxiv.org/abs/1609.02943. Accessed 2026-05-20.
[^2]: Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, Florian Tramèr, "Stealing Part of a Production Language Model", arXiv preprint, 2024-03-11. https://arxiv.org/abs/2403.06634. Accessed 2026-05-20.
[^3]: Geoffrey Hinton, Oriol Vinyals, Jeff Dean, "Distilling the Knowledge in a Neural Network", NeurIPS Deep Learning Workshop, 2015-03-09. https://arxiv.org/abs/1503.02531. Accessed 2026-05-20.
[^4]: Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, Nicolas Papernot, "High Accuracy and High Fidelity Extraction of Neural Networks", USENIX Security Symposium, 2020-08-12. https://arxiv.org/abs/1909.01838. Accessed 2026-05-20.
[^5]: USENIX Association, "Stealing Machine Learning Models via Prediction APIs - 25th USENIX Security Symposium", USENIX, 2016-08-10. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer. Accessed 2026-05-20.
[^6]: Binghui Wang, Neil Zhenqiang Gong, "Stealing Hyperparameters in Machine Learning", IEEE Symposium on Security and Privacy, 2018-02-14. https://arxiv.org/abs/1802.05351. Accessed 2026-05-20.
[^7]: Tribhuvanesh Orekondy, Bernt Schiele, Mario Fritz, "Knockoff Nets: Stealing Functionality of Black-Box Models", CVPR, 2019-06-16. https://openaccess.thecvf.com/content_CVPR_2019/html/Orekondy_Knockoff_Nets_Stealing_Functionality_of_Black-Box_Models_CVPR_2019_paper.html. Accessed 2026-05-20.
[^8]: Nicholas Carlini, Matthew Jagielski, Ilya Mironov, "Cryptanalytic Extraction of Neural Network Models", CRYPTO, 2020-08-17. https://arxiv.org/abs/2003.04884. Accessed 2026-05-20.
[^9]: Nicholas Carlini et al., "Stealing Part of a Production Language Model (companion blog post)", not-just-memorization.github.io, 2024-03-11. https://not-just-memorization.github.io/partial-model-stealing.html. Accessed 2026-05-20.
[^10]: International Conference on Machine Learning, "ICML 2024 Poster: Stealing part of a production language model", ICML, 2024-07-25. https://icml.cc/virtual/2024/poster/33922. Accessed 2026-05-20.
[^11]: Berkeley Law The Network, "The Innovation Dilemma: AI Distillation in OpenAI v. DeepSeek", UC Berkeley Law, 2025-03-30. https://sites.law.berkeley.edu/thenetwork/2025/03/30/the-innovation-dilemma-ai-distillation-in-openai-v-deepseek/. Accessed 2026-05-20.
[^12]: CNBC, "Anthropic accuses DeepSeek, Moonshot and MiniMax of distillation attacks on Claude", CNBC, 2026-02-24. https://www.cnbc.com/2026/02/24/anthropic-openai-china-firms-distillation-deepseek.html. Accessed 2026-05-20.
[^13]: Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto, "Stanford Alpaca: An Instruction-following LLaMA model", Stanford CRFM, 2023-03-13. https://crfm.stanford.edu/2023/03/13/alpaca.html. Accessed 2026-05-20.
[^14]: OpenAI, "Terms of Use", OpenAI, 2024-12-11. https://openai.com/policies/row-terms-of-use/. Accessed 2026-05-20.
[^15]: Patent PC, "DMCA and AI Models: Navigating Legal Challenges in the U.S.", Patent PC blog, 2025-05-20. https://patentpc.com/blog/dmca-and-ai-models-navigating-legal-challenges-in-the-u-s. Accessed 2026-05-20.
[^16]: Yiyang Chen, Daniel Paleka, Eric Wallace et al., "Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models", arXiv preprint, 2024-06-13. https://arxiv.org/abs/2406.09355. Accessed 2026-05-20.
[^17]: Yiyong Chen, Hengrui Jia, Christopher A. Choquette-Choo et al., "Hard-Label Cryptanalytic Extraction of Neural Network Models", ASIACRYPT, 2024-12-09. https://arxiv.org/abs/2409.11646. Accessed 2026-05-20.
[^18]: Synced Review, "First Model-Stealing Attack Reveals Secrets of Black-Box Production Language Models", Synced, 2024-03-27. https://syncedreview.com/2024/03/27/first-model-stealing-attack-reveals-secrets-of-black-box-production-language-models/. Accessed 2026-05-20.
[^19]: Tribhuvanesh Orekondy, Bernt Schiele, Mario Fritz, "Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks", ICLR, 2020-04-26. https://arxiv.org/abs/1906.10908. Accessed 2026-05-20.
[^20]: Mika Juuti, Sebastian Szyller, Samuel Marchal, N. Asokan, "PRADA: Protecting Against DNN Model Stealing Attacks", IEEE EuroS&P, 2019-06-17. https://arxiv.org/abs/1805.02628. Accessed 2026-05-20.
[^21]: Snyk, "Understanding AI Model Theft: Risks and Mitigation of the LLM Threat Landscape", Snyk, 2024-09-26. https://snyk.io/articles/ai-model-theft/. Accessed 2026-05-20.
[^22]: Hengrui Jia, Christopher A. Choquette-Choo, Varun Chandrasekaran, Nicolas Papernot, "Entangled Watermarks as a Defense against Model Extraction", USENIX Security Symposium, 2021-08-11. https://arxiv.org/abs/2002.12200. Accessed 2026-05-20.
[^23]: IP STARS, "The EU's AI Act is in force: how does it deal with the protection of intellectual property rights?", IP STARS, 2024-08-12. https://www.ipstars.com/NewsAndAnalysis/The-EUs-AI-Act-is-in-force-how-does-it-deal-with-the-protection-of-intellectual/Index/10124. Accessed 2026-05-20.