Membership Inference Attack

AI Safety Machine Learning

22 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v4 · 4,440 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A Membership Inference Attack (MIA) is a privacy attack against a trained machine learning model in which an adversary, given a candidate data record and access to the model, tries to determine whether that record was part of the model's training set. The attack was formalised in the 2017 paper "Membership Inference Attacks Against Machine Learning Models" by Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov, which demonstrated that black-box classifiers from commercial machine-learning-as-a-service providers leak per-record training-set membership through their prediction confidences.^[1] In the decade since, MIAs have become the canonical empirical instrument for measuring how much a model memorises individual training points, the de facto evaluation for machine learning privacy defences, and a load-bearing component in arguments about copyright infringement and the GDPR right to erasure. The most influential refinement of the technique is the per-example Likelihood Ratio Attack (LiRA) of Carlini, Chien, Nasr, Song, Terzis, and Tramèr (IEEE Symposium on Security and Privacy, 2022), which reframed MIA evaluation around true-positive rate at low false-positive rate and produced attacks roughly an order of magnitude stronger than prior baselines at those operating points.^[2]

Background

The privacy concern that membership inference formalises predates deep learning. Classical statistics and genome-wide association literature recognised in the late 2000s that aggregate statistics published over a sensitive dataset (for example, summary allele frequencies) could leak whether a particular individual contributed to the aggregate. The contribution of Shokri, Stronati, Song, and Shmatikov was to operationalise that observation against arbitrary supervised classifiers as a concrete, reproducible attack pipeline. Their paper, posted to arXiv on 18 October 2016 and presented at the 38th IEEE Symposium on Security and Privacy in San Jose on 22 to 24 May 2017, defines the now-standard threat model: an attacker who knows the schema of the training distribution, can query the target model on inputs of their choice, observes the model's prediction (a vector of class probabilities in the black-box case), and must decide for a queried record whether it was a training member.^[1]

Yeom, Giacomelli, Fredrikson, and Jha (CSF 2018) provided an early theoretical bridge between MIA success and a model's degree of overfitting, showing that an attacker that thresholds the per-example training loss already approaches optimal performance in the limit of strong overfitting and that the membership advantage metric admits a clean interpretation under differential privacy.^[3] In 2019 Nasr, Shokri, and Houmansadr generalised the threat model to white-box and federated settings, demonstrating that exposed gradients and intermediate activations in federated learning enable substantially more powerful attacks than black-box prediction access alone.^[4] Sablayrolles, Douze, Ollivier, Schmid, and Jégou (ICML 2019) derived Bayes-optimal strategies under explicit assumptions on parameter distributions, concluded that the optimal score only depends on the loss, and argued that, under their assumptions, white-box access provides no asymptotic advantage over black-box loss queries.^[5]

The next inflection was a critique of how attacks had been evaluated. Carlini and collaborators argued in 2021 to 2022 that average-case accuracy and area-under-the-ROC obscure whether an attack can confidently flag any individual training record; they proposed using the true-positive rate at a very low false-positive rate (typically TPR at FPR equal to 0.1 percent or 0.001 percent) as the right yardstick.^[2] Their per-example LiRA attack, which fits Gaussians to a target sample's logit-scaled confidences across many shadow models that did and did not see the sample, became the standard reference attack against vision models and a recurring building block for downstream privacy auditing.

Threat Model

A membership inference attack is specified by three properties: what the adversary knows about the data distribution, what the adversary can do with the target model, and how the adversary's success is measured.

Adversary knowledge. The strongest standard assumption is that the attacker has samples drawn from the same distribution as the training set (though disjoint from it), enough to train shadow models with realistic statistics.^[1] Weaker variants assume only partial knowledge of the input schema, or that the attacker can synthesise candidate inputs (for example via hill-climbing on the target model's confidence).^[1] In the limit, the attacker has access to the actual member or non-member candidate and only seeks a binary decision.

Model access. In the black-box setting, the attacker queries the model and observes its output: a probability vector, top-k scores, or just an argmax label. The original Shokri et al. attack assumes the full probability vector; "Label-Only Membership Inference Attacks" by Choquette-Choo, Tramèr, Carlini, and Papernot (ICML 2021) showed that even hard-label-only access (no confidences) suffices: probing the robustness of the predicted label to small input perturbations reveals membership about as effectively as confidence-based attacks, undermining defences that obfuscate output scores.^[6] In the white-box setting, the attacker additionally observes parameters, gradients, or intermediate activations; Nasr, Shokri, and Houmansadr showed these channels can substantially raise success rates against deep models, especially in federated learning where gradients leak per round.^[4]

Evaluation metric. Early work reported balanced accuracy and AUC. Carlini, Chien, Nasr, Song, Terzis, and Tramèr argued that these averages can hide an attack that is no better than random on the vast majority of examples while being almost certain on a thin tail; the practically relevant question is whether the attacker can name a member with high precision, so the right metric is TPR at low FPR.^[2] The MIA community has since broadly adopted this convention, particularly for auditing differential privacy guarantees.

Classic Shadow-Model Attack

The original Shokri et al. methodology has three components: shadow models, attack data, and a meta-classifier.^[1]

Shadow models. The adversary trains many models with the same architecture (or via the same machine-learning-as-a-service pipeline) as the target, each on a disjoint subset of attacker-controlled data. Because the adversary chose the shadow training sets, the membership label of every record relative to every shadow model is known.
Attack-training dataset. For each shadow model and each record the attacker possessed, the adversary queries the shadow with the record, records the probability vector, and stores the example (probability vector, true class) paired with the membership label "in" or "out".
Attack model. A per-class binary classifier (typically a small neural network) is trained on the resulting dataset. At inference time, the attacker queries the target model on the candidate record, retrieves its probability vector, and feeds that into the corresponding per-class attack classifier to decide membership.

The intuition is that overfit models exhibit subtly different confidence distributions on members versus non-members, and that shadow models drawn from the same procedure inherit the same idiosyncrasies, so a classifier trained on shadow-derived features transfers to the target.

The 2017 paper evaluated the attack against CIFAR-10 and CIFAR-100 image classifiers, the Purchase-100 shopping dataset (197,324 records with 600 binary features), the Texas hospital discharge dataset (67,330 inpatient stays with 6,170 features), a Foursquare location dataset, MNIST, and the UCI Adult census dataset. The most striking numbers came from off-the-shelf commercial services: against a Google Prediction API model trained on a 10,000-record Purchase-100 task, the median per-class attack accuracy was reported as 94 percent; against Amazon ML the attack reached 74 percent at default settings and 91 percent under a configuration that increased overfitting.^[1] Conversely, well-regularised models on tasks with little memorisation (MNIST, UCI Adult) were near-random. The paper also surveyed three ways to synthesise shadow training data when the attacker does not have real samples (hill-climbing the target, marginal-distribution sampling, and feature flipping of real records), and evaluated naive defences (top-k masking, output rounding, temperature scaling) which it found largely ineffective.^[1]

Loss-Based and Confidence-Based Attacks

A much simpler family of attacks does not train a meta-classifier at all. Yeom, Giacomelli, Fredrikson, and Jha (CSF 2018) proposed the LOSS attack: compute the per-example training loss under the target model and classify the example as a member if the loss is below a threshold (often the average training loss).^[3] The intuition is direct: a memorised training point has lower loss than an unseen point drawn from the same distribution. Despite its simplicity, LOSS is a strong baseline whenever the model overfits.

Subsequent calibrated variants account for the fact that some inputs are intrinsically easy or hard. Sablayrolles et al. compute a per-example threshold using the expected loss under shadow models that did not see the example.^[5] Watson, Guo, Cormode, and Sablayrolles (ICLR 2022) made the case for difficulty calibration explicitly, normalising the loss by an estimate of how hard the example is on a population of unrelated models, and showed that calibration drastically reduces the false-positive rate at the cost of essentially no true-positive rate.^[7] These calibrated attacks were the immediate predecessors of LiRA and motivated its parametric likelihood-ratio framing.

LiRA: Likelihood Ratio Attack

In "Membership Inference Attacks From First Principles" (IEEE S&P 2022; arXiv 2112.03570, posted 7 December 2021) Carlini, Chien, Nasr, Song, Terzis, and Tramèr derive a Neyman-Pearson-style likelihood ratio for membership.^[2]

For a candidate example x, the attacker trains many shadow models on random subsets of an auxiliary dataset; roughly half of these subsets contain x ("IN" models) and the other half do not ("OUT" models). The attacker records the model's confidence on the correct class, transforms it via the logit map phi(p) equals log(p / (1 minus p)), and fits a Gaussian to the IN values and another Gaussian to the OUT values. At inference time the attacker queries the target model on x, transforms its confidence, and computes a likelihood ratio of the value under the IN versus OUT Gaussians; values for which the IN density dominates are predicted as members.

The paper distinguishes an online variant, in which fresh shadow models are trained for the specific x being queried (most expensive, strongest), and an offline variant, in which a single pool of shadow models is amortised across queries by performing a one-sided test against just the OUT distribution.^[2] On CIFAR-10 the online LiRA reaches a TPR of 8.4 percent at an FPR of 0.1 percent, and at the much stricter FPR of 0.001 percent it still attains 2.2 percent TPR, a roughly tenfold improvement over the previous best attack in that regime. On CIFAR-100 the same operating point yields 27.6 percent TPR, on ImageNet 8.7 percent, and on WikiText-103 1.4 percent.^[2] Crucially, the authors report that the benefits of additional shadow models plateau around 64 to 256 models on standard benchmarks, making the attack practical for academic privacy auditing.

LiRA also surfaced an important conceptual point: per-example vulnerability is highly heterogeneous. A small subset of training points (often outliers or duplicates) are nearly guaranteed to be flagged, while most points remain difficult. Average-case metrics smear this distribution together and understate the worst-case privacy harm. After 2022 the privacy literature largely converged on TPR at low FPR (or equivalent log-scaled ROC curves) as the standard reporting convention.^[2]^[7]

Variants and Extensions

Variant	Year	Access	Key idea
Shokri et al. shadow attack	2017	Black-box (probabilities)	Per-class neural meta-classifier over shadow output vectors^[1]
Yeom et al. LOSS	2018	Black-box (loss)	Threshold the per-example training loss^[3]
LOGAN	2019	White or black-box on generators	Use a GAN discriminator to score over-fit generated samples^[8]
Nasr, Shokri, Houmansadr	2019	White-box and federated	Exploit gradients and activations^[4]
Sablayrolles et al.	2019	Black-box	Bayes-optimal threshold; per-example calibration^[5]
Label-only (Choquette-Choo et al.)	2021	Hard-label only	Probe robustness of the predicted label under perturbations^[6]
Difficulty calibration (Watson et al.)	2022	Black-box	Normalise loss by per-example difficulty^[7]
LiRA (Carlini et al.)	2022	Black-box	Per-example Gaussian likelihood ratio over shadow logits^[2]
MLM attack (Mireshghallah et al.)	2022	Black or grey-box on masked LMs	Reference-model log-likelihood ratio against a base masked language model^[9]
Min-K Percent Prob	2023	Black-box on LLMs	Average log-prob of the k percent lowest-probability tokens^[10]

The LOGAN attack of Hayes, Melis, Danezis, and De Cristofaro (PoPETs 2019) extended membership inference to GAN generators by training a discriminator that learns to recognise overfit training samples, with experiments on Labeled Faces in the Wild, CIFAR-10, and Diabetic Retinopathy images.^[8] Carlini, Hayes, Nasr, Jagielski, Sehwag, Tramèr, Balle, Ippolito, and Wallace (USENIX Security 2023) later adapted LiRA to diffusion models, showing that Stable Diffusion can be coaxed into regenerating individual training images and that membership scores correlate strongly with this regurgitation, with direct copyright implications.^[11]

MIA Against Large Language Models

Membership inference against large language models is harder than against vision classifiers, for several intertwined reasons that have made it a focus of intense recent study.

Single-epoch training. Modern pre-training pipelines for models like GPT-2, Pythia, LLaMA, and frontier proprietary models pass each token roughly once. Without repeated exposure, per-example overfitting is minimal and the loss signal MIA exploits is weak. Kandpal, Wallace, and Raffel (ICML 2022) showed that the rate at which a language model regurgitates a training sequence is superlinear in the number of times that sequence appears in the corpus, so the bulk of memorisation is driven by duplicated documents; conversely, deduplicated data is dramatically more robust to extraction.^[12]

Members and non-members are hard to define. Practical evaluations require a held-out non-member set drawn from the same distribution and the same time period as members. Distribution shift between supposedly-IID members and non-members can be picked up by any classifier and reported as a successful MIA. Duan, Suri, Mireshghallah, Min, Shi, Zettlemoyer, Tsvetkov, Choi, Evans, and Hajishirzi conducted a large-scale evaluation of standard MIAs across a suite of Pythia-style models from 160 million to 12 billion parameters trained on The Pile, releasing the MIMIR benchmark, and reported that MIAs barely outperform random guessing on most domains and most model scales once distribution shift is controlled; apparent successes on earlier benchmarks were often explained by temporal or stylistic gaps between members and non-members rather than genuine memorisation.^[13]

Special-case successes. When members and non-members are genuinely close in distribution but differ in training-set inclusion, MIA against language models becomes feasible again. Mireshghallah, Goyal, Uniyal, Berg-Kirkpatrick, and Shokri (EMNLP 2022) developed a likelihood-ratio attack against masked language models fine-tuned on clinical notes, raising AUC from 0.66 (loss baseline) to 0.90 and improving TPR at 1 percent FPR by roughly 51 times over prior baselines.^[9] On the pre-training side, Shi, Ajith, Xia, Huang, Liu, Blevins, Chen, and Zettlemoyer (ICLR 2024) proposed Min-K Percent Prob, a reference-free score that averages the log-probability of the k percent lowest-probability tokens in the candidate text, and a companion benchmark called WikiMIA that uses the Wikipedia edit timestamp to construct members (events covered before a model's training cutoff) and non-members (events after the cutoff). They reported that Min-K Percent Prob improves AUC over Yeom-style baselines on WikiMIA across LLaMA 1 and 2, OPT, GPT-Neo, and Pythia families, and applied it to argue that GPT-3 had likely been trained on copyrighted books from the Books3 corpus.^[10]

Training-data extraction is the strictly harder cousin of MIA: rather than confirming membership, the attacker is asked to reconstruct verbatim training text. Carlini, Tramèr, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea, and Raffel (USENIX Security 2021) extracted hundreds of unique training examples from GPT-2 by generating large volumes of text and ranking them with membership-inference-style scores; the extracted material included names, phone numbers, email addresses, and identifying URLs, demonstrating that MIA techniques double as the verification step in extraction pipelines and that larger models are more vulnerable.^[14]

Defences

No defence is perfect, but several reduce MIA success substantially.

Differential privacy. Training with differential privacy, most commonly the DP-SGD algorithm of Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, and Zhang (ACM CCS 2016), is the only deployed defence that comes with formal worst-case bounds on membership advantage.^[15] DP-SGD clips per-example gradients to a fixed norm and adds calibrated Gaussian noise at every step; tighter privacy parameters reduce attack power but also degrade utility, often substantially on overparameterised models. The PATE framework of Papernot, Abadi, Erlingsson, Goodfellow, and Talwar trains an ensemble of teacher models on disjoint shards and releases only the noisy plurality of their votes, providing a different route to a DP-trained public student model.^[16] Empirically, attacks like LiRA become a calibrated audit of these guarantees: the achieved TPR at fixed FPR translates into a lower bound on epsilon (Jagielski, Ullman, and Oprea, NeurIPS 2020).^[17]

Data deduplication. Because memorisation is dominated by duplicate sequences, removing exact or near-duplicate documents from the training corpus reduces MIA success and verbatim regurgitation. Kandpal, Wallace, and Raffel report that aggressive deduplication of corpora like C4 cuts emitted training text roughly tenfold and weakens MIAs significantly.^[12]

Calibration and regularisation. Regularization techniques, including L2 regularization, dropout, and label smoothing, reduce the train-test loss gap and therefore weaken loss-based MIAs, but Choquette-Choo et al. showed that they do not defeat label-only attacks unless L2 regularization is so strong that it hurts utility.^[6] Output post-processing (top-k masking, softmax temperature, score rounding) was already shown in 2017 to be inadequate against shadow-model attacks and remains so.^[1]

Machine unlearning. When a regulator or data subject requests deletion, machine-unlearning methods aim to remove a specific record's influence from the trained parameters without full retraining. Evaluations of these methods almost universally use MIAs (LiRA in particular) to measure whether the unlearned model is statistically indistinguishable from a model that never saw the record. This pattern, in which MIA serves as the empirical proxy for the right to be forgotten, is now standard in GDPR-motivated research.^[18]

System-level defences. Limiting the number of queries per user, watermarking outputs, and rate-limiting are deployed by commercial APIs but offer no formal privacy guarantees.

Applications

Membership inference has crystallised into a general-purpose privacy and provenance probe.

Auditing differential privacy. Theoretical epsilon bounds for DP-SGD are loose. Privacy auditing inserts adversarial canary records into the training set, runs an MIA against the released model, and uses the achieved TPR at fixed FPR to derive an empirical lower bound on the realised epsilon. Jagielski, Ullman, and Oprea were early proponents of this approach; subsequent work has scaled it to one-run audits and to large language model training, providing a sanity check on whether reported privacy parameters are correctly implemented.^[17]

Unlearning evaluation. As noted above, the MIA-based "forgetting score" has become the de facto evaluation for machine learning unlearning, with benchmarks tracking how indistinguishable an unlearned model's responses to forgotten examples are from a fully retrained baseline.^[18]

Copyright and provenance. When a plaintiff alleges that a model's owner trained on copyrighted material, MIA-style probes can serve as evidence. In New York Times v. OpenAI (S.D.N.Y., filed December 2023), the Times alleged that GPT-4 had ingested its articles and supplied examples in which the model produced near-verbatim continuations of paywalled stories.^[19] In Authors Guild v. OpenAI and the Anthropic book-piracy litigation that culminated in a 1.5 billion-dollar settlement announced in September 2025, plaintiffs likewise pointed to memorisation and extraction as proof of training-set inclusion.^[20] Carlini et al.'s diffusion-model extraction work has been cited similarly in disputes over Stable Diffusion and other image generators.^[11]

Regulatory compliance. The GDPR Article 17 right to erasure does not distinguish between database rows and model parameters; regulators have increasingly suggested that personal data embedded in a model's weights remains personal data, making MIA-based audits relevant to compliance reporting. Similar arguments arise under the California Consumer Privacy Act, although the legal status of trained parameters under CCPA is unsettled.

Security research. MIAs serve as a yardstick for proposed privacy-preserving training schemes: a defence that does not reduce LiRA TPR at low FPR is unlikely to be considered effective.

Limitations

Membership inference has well-known limits as both an attack and a measurement tool.

Distribution shift confounds. Many published MIA results against large language models have been re-examined and attributed to subtle distribution differences between the supposed members and non-members rather than genuine memorisation. MIMIR was constructed in part to control for this, and Duan et al. found that once temporal and topical drift are removed, off-the-shelf attacks barely beat random guessing on Pythia models trained on The Pile across scales from 160 million to 12 billion parameters.^[13]

Heterogeneous vulnerability. LiRA established that the average TPR understates the worst-case privacy risk: a small subset of training points, typically outliers or duplicates, is far more vulnerable than the median. Mean metrics hide this, which is why the field has moved to ROC curves on a log scale and to canary-based audits that focus on the most exposed points.^[2]^[17]

Computational cost. Per-example LiRA-style attacks require training tens to hundreds of shadow models. For deep learning systems on full ImageNet or The Pile scale, this cost is large; cheaper amortised variants exist but trade off statistical power.^[2]

Evaluations can mislead. Aerni, Zhang, and Tramèr in "Evaluations of Machine Learning Privacy Defenses are Misleading" (arXiv 2404.17399) argued that many defences claim victory because they were evaluated against weak attacks; under LiRA-class attacks at low FPR many proposed defences offer little additional protection over plain training with conventional regularization.^[21]

LLM extraction is not membership. Confirming that a sequence was in the training set does not imply it can be extracted, and vice versa. The two problems share machinery (probability-based scores, shadow comparisons) but answer different questions, and confusing them has produced inflated claims on both sides.^[14]

Comparison to Adjacent Privacy Attacks

Attack	Goal	Typical signal
Membership inference	Was record x in the training set?	Confidence or loss gap between members and non-members
Attribute inference	What is the value of a hidden feature of x?	Conditional predictions when other features are fixed
Model inversion	Reconstruct a representative input for a class or person	Gradient-based inversion of the prediction
Training-data extraction	Reproduce a verbatim training example	Generative sampling plus membership-style ranking
Model stealing	Replicate the target model's parameters or behaviour	Query-response distillation

Membership inference is the lowest-rung privacy attack, in the sense that successful extraction or attribute inference implies membership but not vice versa. It is also the easiest to evaluate rigorously, which is why it has become the universal proxy.

References

Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov, "Membership Inference Attacks Against Machine Learning Models", arXiv preprint and 2017 IEEE Symposium on Security and Privacy (San Jose, 22-24 May 2017), 2016-10-18 (v1) / 2017-03-31 (v2). https://arxiv.org/abs/1610.05820. Accessed 2026-05-20. ↩
Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, Florian Tramèr, "Membership Inference Attacks From First Principles", arXiv preprint and 2022 IEEE Symposium on Security and Privacy, 2021-12-07. https://arxiv.org/abs/2112.03570. Accessed 2026-05-20. ↩
Samuel Yeom, Irene Giacomelli, Matt Fredrikson, Somesh Jha, "Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting", arXiv preprint and IEEE Computer Security Foundations Symposium (CSF) 2018, 2017-09-05. https://arxiv.org/abs/1709.01604. Accessed 2026-05-20. ↩
Milad Nasr, Reza Shokri, Amir Houmansadr, "Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning", 2019 IEEE Symposium on Security and Privacy. https://www.princeton.edu/~pmittal/publications/liwei-dls19.pdf. Accessed 2026-05-20. ↩
Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, Hervé Jégou, "White-box vs Black-box: Bayes Optimal Strategies for Membership Inference", Proceedings of the 36th International Conference on Machine Learning (ICML), 2019-08-26. https://arxiv.org/abs/1908.11229. Accessed 2026-05-20. ↩
Christopher A. Choquette-Choo, Florian Tramèr, Nicholas Carlini, Nicolas Papernot, "Label-Only Membership Inference Attacks", Proceedings of the 38th International Conference on Machine Learning (PMLR 139:1964-1974), 2021-07-01. https://arxiv.org/abs/2007.14321. Accessed 2026-05-20. ↩
Lauren Watson, Chuan Guo, Graham Cormode, Alex Sablayrolles, "On the Importance of Difficulty Calibration in Membership Inference Attacks", International Conference on Learning Representations (ICLR), 2022-04-11. https://arxiv.org/abs/2111.08440. Accessed 2026-05-20. ↩
Jamie Hayes, Luca Melis, George Danezis, Emiliano De Cristofaro, "LOGAN: Membership Inference Attacks Against Generative Models", Proceedings on Privacy Enhancing Technologies (PoPETs) Vol. 2019, Issue 1, 2017-05-22 (arXiv). https://arxiv.org/abs/1705.07663. Accessed 2026-05-20. ↩
Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, Reza Shokri, "Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks", Proceedings of EMNLP 2022, 2022-03-08. https://arxiv.org/abs/2203.03929. Accessed 2026-05-20. ↩
Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer, "Detecting Pretraining Data from Large Language Models", International Conference on Learning Representations (ICLR), 2023-10-25. https://arxiv.org/abs/2310.16789. Accessed 2026-05-20. ↩
Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace, "Extracting Training Data from Diffusion Models", 32nd USENIX Security Symposium (USENIX Security 23), 2023-01-30. https://arxiv.org/abs/2301.13188. Accessed 2026-05-20. ↩
Nikhil Kandpal, Eric Wallace, Colin Raffel, "Deduplicating Training Data Mitigates Privacy Risks in Language Models", Proceedings of the 39th International Conference on Machine Learning (ICML), 2022-02-14. https://arxiv.org/abs/2202.06539. Accessed 2026-05-20. ↩
Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, Hannaneh Hajishirzi, "Do Membership Inference Attacks Work on Large Language Models?", Conference on Language Modeling (COLM), 2024-02-12. https://arxiv.org/abs/2402.07841. Accessed 2026-05-20. ↩
Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, Colin Raffel, "Extracting Training Data from Large Language Models", 30th USENIX Security Symposium, 2020-12-14. https://arxiv.org/abs/2012.07805. Accessed 2026-05-20. ↩
Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang, "Deep Learning with Differential Privacy", Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), Vienna, 24-28 October 2016, pages 308-318, 2016-07-01. https://arxiv.org/abs/1607.00133. Accessed 2026-05-20. ↩
Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, Úlfar Erlingsson, "Scalable Private Learning with PATE", International Conference on Learning Representations (ICLR), 2018-02-24. https://arxiv.org/abs/1802.08908. Accessed 2026-05-20. ↩
Matthew Jagielski, Jonathan Ullman, Alina Oprea, "Auditing Differentially Private Machine Learning: How Private is Private SGD?", Advances in Neural Information Processing Systems (NeurIPS) 33, 2020-06-13. https://arxiv.org/abs/2006.07709. Accessed 2026-05-20. ↩
Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot, "Machine Unlearning", 2021 IEEE Symposium on Security and Privacy, 2019-12-09. https://arxiv.org/abs/1912.03817. Accessed 2026-05-20. ↩
The New York Times Company, "The New York Times Company v. Microsoft Corporation, OpenAI, Inc., et al.", complaint filed in the United States District Court for the Southern District of New York, 2023-12-27. https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf. Accessed 2026-05-20. ↩
Bobby Allyn, "Anthropic to pay authors $1.5B to settle lawsuit over pirated chatbot training material", NPR, 2025-09-05. https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-settlement-pirated-chatbot-training-material. Accessed 2026-05-20. ↩
Michael Aerni, Jie Zhang, Florian Tramèr, "Evaluations of Machine Learning Privacy Defenses are Misleading", arXiv preprint, 2024-04-26. https://arxiv.org/abs/2404.17399. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Model stealing Trusted Execution Environments for machine learning

Background

Threat Model

Classic Shadow-Model Attack

Loss-Based and Confidence-Based Attacks

LiRA: Likelihood Ratio Attack

Variants and Extensions

MIA Against Large Language Models

Defences

Applications

Limitations

Comparison to Adjacent Privacy Attacks

See also

References

Improve this article

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here