Membership Inference Attack
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,444 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,444 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Membership Inference Attack (MIA) is a privacy attack against a trained machine learning model in which an adversary, given a candidate data record and access to the model, tries to determine whether that record was part of the model's training set. The attack was formalised in the 2017 paper "Membership Inference Attacks Against Machine Learning Models" by Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov, which demonstrated that black-box classifiers from commercial machine-learning-as-a-service providers leak per-record training-set membership through their prediction confidences.[^1] In the decade since, MIAs have become the canonical empirical instrument for measuring how much a model memorises individual training points, the de facto evaluation for machine learning privacy defences, and a load-bearing component in arguments about copyright infringement and the GDPR right to erasure. The most influential refinement of the technique is the per-example Likelihood Ratio Attack (LiRA) of Carlini, Chien, Nasr, Song, Terzis, and Tramèr (IEEE Symposium on Security and Privacy, 2022), which reframed MIA evaluation around true-positive rate at low false-positive rate and produced attacks roughly an order of magnitude stronger than prior baselines at those operating points.[^2]
The privacy concern that membership inference formalises predates deep learning. Classical statistics and genome-wide association literature recognised in the late 2000s that aggregate statistics published over a sensitive dataset (for example, summary allele frequencies) could leak whether a particular individual contributed to the aggregate. The contribution of Shokri, Stronati, Song, and Shmatikov was to operationalise that observation against arbitrary supervised classifiers as a concrete, reproducible attack pipeline. Their paper, posted to arXiv on 18 October 2016 and presented at the 38th IEEE Symposium on Security and Privacy in San Jose on 22 to 24 May 2017, defines the now-standard threat model: an attacker who knows the schema of the training distribution, can query the target model on inputs of their choice, observes the model's prediction (a vector of class probabilities in the black-box case), and must decide for a queried record whether it was a training member.[^1]
Yeom, Giacomelli, Fredrikson, and Jha (CSF 2018) provided an early theoretical bridge between MIA success and a model's degree of overfitting, showing that an attacker that thresholds the per-example training loss already approaches optimal performance in the limit of strong overfitting and that the membership advantage metric admits a clean interpretation under differential privacy.[^3] In 2019 Nasr, Shokri, and Houmansadr generalised the threat model to white-box and federated settings, demonstrating that exposed gradients and intermediate activations in federated learning enable substantially more powerful attacks than black-box prediction access alone.[^4] Sablayrolles, Douze, Ollivier, Schmid, and Jégou (ICML 2019) derived Bayes-optimal strategies under explicit assumptions on parameter distributions, concluded that the optimal score only depends on the loss, and argued that, under their assumptions, white-box access provides no asymptotic advantage over black-box loss queries.[^5]
The next inflection was a critique of how attacks had been evaluated. Carlini and collaborators argued in 2021 to 2022 that average-case accuracy and area-under-the-ROC obscure whether an attack can confidently flag any individual training record; they proposed using the true-positive rate at a very low false-positive rate (typically TPR at FPR equal to 0.1 percent or 0.001 percent) as the right yardstick.[^2] Their per-example LiRA attack, which fits Gaussians to a target sample's logit-scaled confidences across many shadow models that did and did not see the sample, became the standard reference attack against vision models and a recurring building block for downstream privacy auditing.
A membership inference attack is specified by three properties: what the adversary knows about the data distribution, what the adversary can do with the target model, and how the adversary's success is measured.
Adversary knowledge. The strongest standard assumption is that the attacker has samples drawn from the same distribution as the training set (though disjoint from it), enough to train shadow models with realistic statistics.[^1] Weaker variants assume only partial knowledge of the input schema, or that the attacker can synthesise candidate inputs (for example via hill-climbing on the target model's confidence).[^1] In the limit, the attacker has access to the actual member or non-member candidate and only seeks a binary decision.
Model access. In the black-box setting, the attacker queries the model and observes its output: a probability vector, top-k scores, or just an argmax label. The original Shokri et al. attack assumes the full probability vector; "Label-Only Membership Inference Attacks" by Choquette-Choo, Tramèr, Carlini, and Papernot (ICML 2021) showed that even hard-label-only access (no confidences) suffices: probing the robustness of the predicted label to small input perturbations reveals membership about as effectively as confidence-based attacks, undermining defences that obfuscate output scores.[^6] In the white-box setting, the attacker additionally observes parameters, gradients, or intermediate activations; Nasr, Shokri, and Houmansadr showed these channels can substantially raise success rates against deep models, especially in federated learning where gradients leak per round.[^4]
Evaluation metric. Early work reported balanced accuracy and AUC. Carlini, Chien, Nasr, Song, Terzis, and Tramèr argued that these averages can hide an attack that is no better than random on the vast majority of examples while being almost certain on a thin tail; the practically relevant question is whether the attacker can name a member with high precision, so the right metric is TPR at low FPR.[^2] The MIA community has since broadly adopted this convention, particularly for auditing differential privacy guarantees.
The original Shokri et al. methodology has three components: shadow models, attack data, and a meta-classifier.[^1]
The intuition is that overfit models exhibit subtly different confidence distributions on members versus non-members, and that shadow models drawn from the same procedure inherit the same idiosyncrasies, so a classifier trained on shadow-derived features transfers to the target.
The 2017 paper evaluated the attack against CIFAR-10 and CIFAR-100 image classifiers, the Purchase-100 shopping dataset (197,324 records with 600 binary features), the Texas hospital discharge dataset (67,330 inpatient stays with 6,170 features), a Foursquare location dataset, MNIST, and the UCI Adult census dataset. The most striking numbers came from off-the-shelf commercial services: against a Google Prediction API model trained on a 10,000-record Purchase-100 task, the median per-class attack accuracy was reported as 94 percent; against Amazon ML the attack reached 74 percent at default settings and 91 percent under a configuration that increased overfitting.[^1] Conversely, well-regularised models on tasks with little memorisation (MNIST, UCI Adult) were near-random. The paper also surveyed three ways to synthesise shadow training data when the attacker does not have real samples (hill-climbing the target, marginal-distribution sampling, and feature flipping of real records), and evaluated naive defences (top-k masking, output rounding, temperature scaling) which it found largely ineffective.[^1]
A much simpler family of attacks does not train a meta-classifier at all. Yeom, Giacomelli, Fredrikson, and Jha (CSF 2018) proposed the LOSS attack: compute the per-example training loss under the target model and classify the example as a member if the loss is below a threshold (often the average training loss).[^3] The intuition is direct: a memorised training point has lower loss than an unseen point drawn from the same distribution. Despite its simplicity, LOSS is a strong baseline whenever the model overfits.
Subsequent calibrated variants account for the fact that some inputs are intrinsically easy or hard. Sablayrolles et al. compute a per-example threshold using the expected loss under shadow models that did not see the example.[^5] Watson, Guo, Cormode, and Sablayrolles (ICLR 2022) made the case for difficulty calibration explicitly, normalising the loss by an estimate of how hard the example is on a population of unrelated models, and showed that calibration drastically reduces the false-positive rate at the cost of essentially no true-positive rate.[^7] These calibrated attacks were the immediate predecessors of LiRA and motivated its parametric likelihood-ratio framing.
In "Membership Inference Attacks From First Principles" (IEEE S&P 2022; arXiv 2112.03570, posted 7 December 2021) Carlini, Chien, Nasr, Song, Terzis, and Tramèr derive a Neyman-Pearson-style likelihood ratio for membership.[^2]
For a candidate example x, the attacker trains many shadow models on random subsets of an auxiliary dataset; roughly half of these subsets contain x ("IN" models) and the other half do not ("OUT" models). The attacker records the model's confidence on the correct class, transforms it via the logit map phi(p) equals log(p / (1 minus p)), and fits a Gaussian to the IN values and another Gaussian to the OUT values. At inference time the attacker queries the target model on x, transforms its confidence, and computes a likelihood ratio of the value under the IN versus OUT Gaussians; values for which the IN density dominates are predicted as members.
The paper distinguishes an online variant, in which fresh shadow models are trained for the specific x being queried (most expensive, strongest), and an offline variant, in which a single pool of shadow models is amortised across queries by performing a one-sided test against just the OUT distribution.[^2] On CIFAR-10 the online LiRA reaches a TPR of 8.4 percent at an FPR of 0.1 percent, and at the much stricter FPR of 0.001 percent it still attains 2.2 percent TPR, a roughly tenfold improvement over the previous best attack in that regime. On CIFAR-100 the same operating point yields 27.6 percent TPR, on ImageNet 8.7 percent, and on WikiText-103 1.4 percent.[^2] Crucially, the authors report that the benefits of additional shadow models plateau around 64 to 256 models on standard benchmarks, making the attack practical for academic privacy auditing.
LiRA also surfaced an important conceptual point: per-example vulnerability is highly heterogeneous. A small subset of training points (often outliers or duplicates) are nearly guaranteed to be flagged, while most points remain difficult. Average-case metrics smear this distribution together and understate the worst-case privacy harm. After 2022 the privacy literature largely converged on TPR at low FPR (or equivalent log-scaled ROC curves) as the standard reporting convention.[^2][^7]
| Variant | Year | Access | Key idea |
|---|---|---|---|
| Shokri et al. shadow attack | 2017 | Black-box (probabilities) | Per-class neural meta-classifier over shadow output vectors[^1] |
| Yeom et al. LOSS | 2018 | Black-box (loss) | Threshold the per-example training loss[^3] |
| LOGAN | 2019 | White or black-box on generators | Use a GAN discriminator to score over-fit generated samples[^8] |
| Nasr, Shokri, Houmansadr | 2019 | White-box and federated | Exploit gradients and activations[^4] |
| Sablayrolles et al. | 2019 | Black-box | Bayes-optimal threshold; per-example calibration[^5] |
| Label-only (Choquette-Choo et al.) | 2021 | Hard-label only | Probe robustness of the predicted label under perturbations[^6] |
| Difficulty calibration (Watson et al.) | 2022 | Black-box | Normalise loss by per-example difficulty[^7] |
| LiRA (Carlini et al.) | 2022 | Black-box | Per-example Gaussian likelihood ratio over shadow logits[^2] |
| MLM attack (Mireshghallah et al.) | 2022 | Black or grey-box on masked LMs | Reference-model log-likelihood ratio against a base masked language model[^9] |
| Min-K Percent Prob | 2023 | Black-box on LLMs | Average log-prob of the k percent lowest-probability tokens[^10] |
The LOGAN attack of Hayes, Melis, Danezis, and De Cristofaro (PoPETs 2019) extended membership inference to GAN generators by training a discriminator that learns to recognise overfit training samples, with experiments on Labeled Faces in the Wild, CIFAR-10, and Diabetic Retinopathy images.[^8] Carlini, Hayes, Nasr, Jagielski, Sehwag, Tramèr, Balle, Ippolito, and Wallace (USENIX Security 2023) later adapted LiRA to diffusion models, showing that Stable Diffusion can be coaxed into regenerating individual training images and that membership scores correlate strongly with this regurgitation, with direct copyright implications.[^11]
Membership inference against large language models is harder than against vision classifiers, for several intertwined reasons that have made it a focus of intense recent study.
Single-epoch training. Modern pre-training pipelines for models like GPT-2, Pythia, LLaMA, and frontier proprietary models pass each token roughly once. Without repeated exposure, per-example overfitting is minimal and the loss signal MIA exploits is weak. Kandpal, Wallace, and Raffel (ICML 2022) showed that the rate at which a language model regurgitates a training sequence is superlinear in the number of times that sequence appears in the corpus, so the bulk of memorisation is driven by duplicated documents; conversely, deduplicated data is dramatically more robust to extraction.[^12]
Members and non-members are hard to define. Practical evaluations require a held-out non-member set drawn from the same distribution and the same time period as members. Distribution shift between supposedly-IID members and non-members can be picked up by any classifier and reported as a successful MIA. Duan, Suri, Mireshghallah, Min, Shi, Zettlemoyer, Tsvetkov, Choi, Evans, and Hajishirzi conducted a large-scale evaluation of standard MIAs across a suite of Pythia-style models from 160 million to 12 billion parameters trained on The Pile, releasing the MIMIR benchmark, and reported that MIAs barely outperform random guessing on most domains and most model scales once distribution shift is controlled; apparent successes on earlier benchmarks were often explained by temporal or stylistic gaps between members and non-members rather than genuine memorisation.[^13]
Special-case successes. When members and non-members are genuinely close in distribution but differ in training-set inclusion, MIA against language models becomes feasible again. Mireshghallah, Goyal, Uniyal, Berg-Kirkpatrick, and Shokri (EMNLP 2022) developed a likelihood-ratio attack against masked language models fine-tuned on clinical notes, raising AUC from 0.66 (loss baseline) to 0.90 and improving TPR at 1 percent FPR by roughly 51 times over prior baselines.[^9] On the pre-training side, Shi, Ajith, Xia, Huang, Liu, Blevins, Chen, and Zettlemoyer (ICLR 2024) proposed Min-K Percent Prob, a reference-free score that averages the log-probability of the k percent lowest-probability tokens in the candidate text, and a companion benchmark called WikiMIA that uses the Wikipedia edit timestamp to construct members (events covered before a model's training cutoff) and non-members (events after the cutoff). They reported that Min-K Percent Prob improves AUC over Yeom-style baselines on WikiMIA across LLaMA 1 and 2, OPT, GPT-Neo, and Pythia families, and applied it to argue that GPT-3 had likely been trained on copyrighted books from the Books3 corpus.[^10]
Training-data extraction is the strictly harder cousin of MIA: rather than confirming membership, the attacker is asked to reconstruct verbatim training text. Carlini, Tramèr, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea, and Raffel (USENIX Security 2021) extracted hundreds of unique training examples from GPT-2 by generating large volumes of text and ranking them with membership-inference-style scores; the extracted material included names, phone numbers, email addresses, and identifying URLs, demonstrating that MIA techniques double as the verification step in extraction pipelines and that larger models are more vulnerable.[^14]
No defence is perfect, but several reduce MIA success substantially.
Differential privacy. Training with differential privacy, most commonly the DP-SGD algorithm of Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, and Zhang (ACM CCS 2016), is the only deployed defence that comes with formal worst-case bounds on membership advantage.[^15] DP-SGD clips per-example gradients to a fixed norm and adds calibrated Gaussian noise at every step; tighter privacy parameters reduce attack power but also degrade utility, often substantially on overparameterised models. The PATE framework of Papernot, Abadi, Erlingsson, Goodfellow, and Talwar trains an ensemble of teacher models on disjoint shards and releases only the noisy plurality of their votes, providing a different route to a DP-trained public student model.[^16] Empirically, attacks like LiRA become a calibrated audit of these guarantees: the achieved TPR at fixed FPR translates into a lower bound on epsilon (Jagielski, Ullman, and Oprea, NeurIPS 2020).[^17]
Data deduplication. Because memorisation is dominated by duplicate sequences, removing exact or near-duplicate documents from the training corpus reduces MIA success and verbatim regurgitation. Kandpal, Wallace, and Raffel report that aggressive deduplication of corpora like C4 cuts emitted training text roughly tenfold and weakens MIAs significantly.[^12]
Calibration and regularisation. Regularization techniques, including L2 regularization, dropout, and label smoothing, reduce the train-test loss gap and therefore weaken loss-based MIAs, but Choquette-Choo et al. showed that they do not defeat label-only attacks unless L2 regularization is so strong that it hurts utility.[^6] Output post-processing (top-k masking, softmax temperature, score rounding) was already shown in 2017 to be inadequate against shadow-model attacks and remains so.[^1]
Machine unlearning. When a regulator or data subject requests deletion, machine-unlearning methods aim to remove a specific record's influence from the trained parameters without full retraining. Evaluations of these methods almost universally use MIAs (LiRA in particular) to measure whether the unlearned model is statistically indistinguishable from a model that never saw the record. This pattern, in which MIA serves as the empirical proxy for the right to be forgotten, is now standard in GDPR-motivated research.[^18]
System-level defences. Limiting the number of queries per user, watermarking outputs, and rate-limiting are deployed by commercial APIs but offer no formal privacy guarantees.
Membership inference has crystallised into a general-purpose privacy and provenance probe.
Auditing differential privacy. Theoretical epsilon bounds for DP-SGD are loose. Privacy auditing inserts adversarial canary records into the training set, runs an MIA against the released model, and uses the achieved TPR at fixed FPR to derive an empirical lower bound on the realised epsilon. Jagielski, Ullman, and Oprea were early proponents of this approach; subsequent work has scaled it to one-run audits and to large language model training, providing a sanity check on whether reported privacy parameters are correctly implemented.[^17]
Unlearning evaluation. As noted above, the MIA-based "forgetting score" has become the de facto evaluation for machine learning unlearning, with benchmarks tracking how indistinguishable an unlearned model's responses to forgotten examples are from a fully retrained baseline.[^18]
Copyright and provenance. When a plaintiff alleges that a model's owner trained on copyrighted material, MIA-style probes can serve as evidence. In New York Times v. OpenAI (S.D.N.Y., filed December 2023), the Times alleged that GPT-4 had ingested its articles and supplied examples in which the model produced near-verbatim continuations of paywalled stories.[^19] In Authors Guild v. OpenAI and the Anthropic book-piracy litigation that culminated in a 1.5 billion-dollar settlement announced in September 2025, plaintiffs likewise pointed to memorisation and extraction as proof of training-set inclusion.[^20] Carlini et al.'s diffusion-model extraction work has been cited similarly in disputes over Stable Diffusion and other image generators.[^11]
Regulatory compliance. The GDPR Article 17 right to erasure does not distinguish between database rows and model parameters; regulators have increasingly suggested that personal data embedded in a model's weights remains personal data, making MIA-based audits relevant to compliance reporting. Similar arguments arise under the California Consumer Privacy Act, although the legal status of trained parameters under CCPA is unsettled.
Security research. MIAs serve as a yardstick for proposed privacy-preserving training schemes: a defence that does not reduce LiRA TPR at low FPR is unlikely to be considered effective.
Membership inference has well-known limits as both an attack and a measurement tool.
Distribution shift confounds. Many published MIA results against large language models have been re-examined and attributed to subtle distribution differences between the supposed members and non-members rather than genuine memorisation. MIMIR was constructed in part to control for this, and Duan et al. found that once temporal and topical drift are removed, off-the-shelf attacks barely beat random guessing on Pythia models trained on The Pile across scales from 160 million to 12 billion parameters.[^13]
Heterogeneous vulnerability. LiRA established that the average TPR understates the worst-case privacy risk: a small subset of training points, typically outliers or duplicates, is far more vulnerable than the median. Mean metrics hide this, which is why the field has moved to ROC curves on a log scale and to canary-based audits that focus on the most exposed points.[^2][^17]
Computational cost. Per-example LiRA-style attacks require training tens to hundreds of shadow models. For deep learning systems on full ImageNet or The Pile scale, this cost is large; cheaper amortised variants exist but trade off statistical power.[^2]
Evaluations can mislead. Aerni, Zhang, and Tramèr in "Evaluations of Machine Learning Privacy Defenses are Misleading" (arXiv 2404.17399) argued that many defences claim victory because they were evaluated against weak attacks; under LiRA-class attacks at low FPR many proposed defences offer little additional protection over plain training with conventional regularization.[^21]
LLM extraction is not membership. Confirming that a sequence was in the training set does not imply it can be extracted, and vice versa. The two problems share machinery (probability-based scores, shadow comparisons) but answer different questions, and confusing them has produced inflated claims on both sides.[^14]
| Attack | Goal | Typical signal |
|---|---|---|
| Membership inference | Was record x in the training set? | Confidence or loss gap between members and non-members |
| Attribute inference | What is the value of a hidden feature of x? | Conditional predictions when other features are fixed |
| Model inversion | Reconstruct a representative input for a class or person | Gradient-based inversion of the prediction |
| Training-data extraction | Reproduce a verbatim training example | Generative sampling plus membership-style ranking |
| Model stealing | Replicate the target model's parameters or behaviour | Query-response distillation |
Membership inference is the lowest-rung privacy attack, in the sense that successful extraction or attribute inference implies membership but not vice versa. It is also the easiest to evaluate rigorously, which is why it has become the universal proxy.