Linear Probes
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,911 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,911 words
Add missing citations, update stale details, or suggest a clearer explanation.
A linear probe is a small linear classifier (or linear regressor) trained on the frozen internal activations of a neural network in order to test whether a particular concept, property, or label is linearly decodable from those activations. The probe's parameters are optimized while the host network's weights are held fixed, so any predictive accuracy that the probe achieves is attributed to information already present in the representation rather than to learning new computation in the underlying model.[^1] Linear probing was introduced as a general diagnostic for deep neural networks by Alain and Bengio in 2016 and has since become one of the most widely used techniques in the analysis of representations inside vision models, language models, and multimodal systems.[^1][^2][^3] Probes have been used to detect syntactic structure, factual knowledge, sentiment, truthfulness, and refusal directions in modern transformer-based models, and they remain a foundational primitive in both interpretability and mechanistic interpretability.[^4][^5][^6][^7][^8]
Because a linear probe has very limited capacity, high probe accuracy is often taken as evidence that the underlying representation has organized information about the target concept along a linear (or affine) direction. The validity of that inference, however, has been challenged: a probe can succeed for reasons that do not imply the host model itself uses the concept downstream, so a literature of selectivity tests, control tasks, and causal interventions has grown up around the basic recipe.[^9][^10][^11]
In the simplest case, let f denote a fixed neural network with L layers, and let h_ℓ(x) ∈ ℝ^d denote the activation produced at layer ℓ when input x is fed into the network. A linear probe is a function g(z) = Wz + b (often followed by a softmax for classification) whose parameters W and b are fit on a labeled dataset {(x_i, y_i)} using the loss L(g(h_ℓ(x_i)), y_i). The host network f is frozen throughout this fit; only the probe parameters are learned. Probe accuracy on a held-out set is interpreted as a measure of how well the property encoded by y is linearly decodable from layer ℓ.[^1][^9]
The motivation for restricting the probe to a linear map is twofold. First, modern deep networks have intermediate states of high dimensionality, and a sufficiently expressive nonlinear classifier could in principle extract a concept that is buried inside the activations only after complex transformations; if a one-layer linear function already suffices, the concept is in some operational sense "present" in the representation. Second, downstream computations inside many networks, particularly the residual streams of transformers and the dot products that produce attention scores, are themselves linear maps, so linear decodability is closer to the kind of access the rest of the network has.[^4][^12] Alain and Bengio framed their original proposal as a way to "monitor the features at every layer of a model and measure how suitable they are for classification," explicitly because such monitoring could help diagnose architectural problems and reveal the role of each layer.[^1]
Linear probing should be distinguished from fine-tuning: in fine-tuning, the parameters of the host model are updated, and any resulting gains in accuracy can come from new computation rather than from information already encoded. In probing, by contrast, only the readout is learned. Linear probing should also be distinguished from full diagnostic classifiers, which may use multilayer perceptrons or other nonlinear models; the linearity constraint is what makes probe results most directly interpretable.[^9]
The idea of training a simple classifier on top of fixed features predates deep learning by decades; representation learning has long been evaluated by training a linear classifier on top of representations produced by an unsupervised model. Within the modern deep learning era, an explicit "probe" framing was advanced in 2016 by Guillaume Alain and Yoshua Bengio in the arXiv paper Understanding intermediate layers using linear classifier probes.[^1] Working with the Inception v3 and ResNet-50 image classifiers, they trained an independent linear classifier on the activations of each intermediate layer and observed that the linear separability of class features increased monotonically with depth, producing a now widely reproduced diagnostic plot of "probe accuracy vs. layer."[^1]
In parallel, NLP researchers adopted the same recipe to ask whether learned vector representations of words and sentences encoded specific linguistic properties. Ettinger, Elgohary, and Resnik introduced classifier-based probes for semantic composition in 2016, and a sequence of follow-ups by Adi et al., Conneau et al., and Belinkov and colleagues used trained classifiers to probe sentence embeddings and the hidden states of recurrent translation systems for morphological, syntactic, and semantic properties.[^13][^3]
The 2018 to 2019 wave of contextual representations, including ELMo and BERT, triggered a surge of probing studies. John Hewitt and Christopher Manning's NAACL 2019 paper A Structural Probe for Finding Syntax in Word Representations introduced a structural probe that learns a single linear transformation under which squared L2 distances between transformed word embeddings approximate distances in a dependency parse tree, and a separate transformation under which squared L2 norms approximate tree depth.[^4] Their result, that such transformations exist for ELMo and BERT but not for non-contextual baselines, was widely interpreted as evidence that entire syntax trees are implicitly embedded in the geometry of contextual representations.[^4]
Tenney, Das, and Pavlick's ACL 2019 paper BERT Rediscovers the Classical NLP Pipeline combined probes with a scalar mixing trick to assign weights to each layer for each task, then showed that the layers carrying information about POS tagging, parsing, named entity recognition, semantic role labeling, and coreference were arranged in approximately the order of the traditional NLP pipeline.[^5] A companion ICLR 2019 paper by Tenney and colleagues, What do you learn from context?, introduced "edge probing," in which a probe is asked to label edges of a structured object (such as a parse tree or a coreference chain) given the contextual vectors at the endpoints, and applied this design to CoVe, ELMo, OpenAI GPT, and BERT.[^14]
By 2019, probing had become so common that a methodological critique was overdue. John Hewitt and Percy Liang's EMNLP 2019 paper Designing and Interpreting Probes with Control Tasks asked the central question: when a probe succeeds, did the representation already encode the linguistic property, or did the probe itself learn the task by memorizing word-type-to-label mappings?[^9] They proposed control tasks, in which each word type is assigned a random label drawn from a fixed distribution. Because a control task has no linguistic structure, a probe that nevertheless achieves high accuracy on it must be doing so by memorizing word identity. The authors then defined the selectivity of a probe as the difference between its accuracy on a real linguistic task and its accuracy on the matched control task. A good probe should have high linguistic accuracy and low control accuracy, that is, high selectivity.[^9] Empirically they found that small, well-regularized linear probes were more selective than larger MLP probes that achieved higher raw accuracy but were also better at the control task, sharpening the tradeoff between probe expressivity and the meaningfulness of the inference.[^9]
The Hewitt and Liang paper has since framed best practice for probing: report accuracy on both the target task and a matched random-label control, and compare a linear probe with a small MLP, treating large discrepancies as a warning that the probe rather than the model is doing the work.[^9] A 2021 survey by Belinkov and Glass, Analysis Methods in Neural Language Processing, codified probing as one of the central techniques of NLP analysis, alongside attention visualization, behavioral testing, and erasure-based methods, and explicitly called out the probe-vs-classifier ambiguity as an open methodological issue.[^3][^15]
From 2022 onward, linear probes returned to prominence as a tool for analyzing large language models. Burns, Ye, Klein, and Steinhardt's Discovering Latent Knowledge in Language Models Without Supervision, posted to arXiv in December 2022, introduced Contrast-Consistent Search (CCS), an unsupervised method that finds a linear direction in activation space such that contradictory statements get opposite labels, and showed that this direction predicts question-answer truth at roughly 4 percent above the average zero-shot accuracy across six models and ten datasets.[^6] The same year, Meng, Bau, Andonian, and Belinkov's Locating and Editing Factual Associations in GPT (the ROME paper) combined causal tracing with linear interventions on middle-layer feed-forward modules to localize and rewrite specific factual associations in autoregressive transformers.[^7]
Samuel Marks and Max Tegmark's 2023 paper The Geometry of Truth extended the truth-direction story by showing that simple difference-in-means linear probes generalize across datasets of factual statements and that surgical interventions along the probe's direction can flip a model's behavior, providing causal as well as correlational support for a "truth direction" in large language model activations.[^16] In April 2023, Amos Azaria and Tom Mitchell posted The Internal State of an LLM Knows When It's Lying, which trained supervised classifiers on hidden-layer activations to predict statement truthfulness, reporting accuracies between 71 and 83 percent depending on the base model.[^17] Andy Arditi and colleagues at Anthropic and partner labs released a widely cited 2024 paper, Refusal in Language Models Is Mediated by a Single Direction, showing that across thirteen open-source chat models up to 72 billion parameters, refusal behavior is mediated by a single one-dimensional refusal direction in activation space.[^8] In 2025, Anthropic researchers Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey extended this line to multi-trait monitoring with Persona Vectors, a framework that derives linear directions for traits such as evil, sycophancy, and hallucination propensity and uses them for both monitoring and steering.[^18]
A typical linear probing pipeline collects activations from a frozen network f on a labeled dataset, then fits a linear readout. In practice, three implementation choices dominate the result:
Beyond raw accuracy, the modern probe literature reports several diagnostics. Selectivity (accuracy on the target task minus accuracy on a matched control task) is the Hewitt and Liang standard.[^9] Generalization across distributions (does a probe trained on one truth dataset transfer to another?) is a second test, used heavily in the truth-direction work.[^16] Comparison across probe families (linear vs. nonlinear) provides a third axis: if a small linear probe already matches the accuracy of a much larger nonlinear probe, the linear hypothesis is more credible.[^9][^4]
The choice between a linear and nonlinear probe is sometimes called the probing tax tradeoff: a more expressive probe can extract concepts that are present but tangled (raising accuracy) at the cost of attributing less of that accuracy to the host model's representation.[^9] Hewitt and Liang's experiments suggested that careful regularization (notably dropout was not effective for MLPs in their setting) can shift this tradeoff, but that the linear probe remains the cleanest default.[^9]
A separate strand of methodology argues that purely correlational probes are not enough: even a highly selective linear probe shows only that a property is decodable, not that the model uses it. Causal probes therefore intervene in the host network during the forward pass and measure changes in behavior. Three common patterns are:
The 2023 Linear Representation Hypothesis and the Geometry of Large Language Models by Park, Choe, and Veitch formalized two notions of linear representation, one in output space and one in input space, and proved that they correspond to linear probing and to model steering respectively, providing a theoretical bridge between probing and causal intervention.[^19]
Probing was first used at scale to test whether contextual word representations encoded discrete linguistic structure. Hewitt and Manning's structural probe found that for ELMo and BERT there exist linear transformations under which pairwise squared L2 distances between transformed word embeddings approximate parse tree distances and squared norms approximate tree depths; non-contextual baselines such as plain word embeddings failed this test by a wide margin.[^4] Tenney, Das, and Pavlick's BERT Rediscovers the Classical NLP Pipeline reported that the center of mass for the probes corresponding to part-of-speech tagging, parsing, named entity recognition, semantic role labeling, and coreference resolution shifted to progressively higher layers, recapitulating the order in which a traditional NLP pipeline runs.[^5] Their accompanying edge probing study, What do you learn from context?, gave a more granular picture, finding that contextual encoders gave large gains over non-contextual baselines on syntactic edge tasks but smaller gains on semantic tasks, suggesting that the linear separability of semantic relations was less complete.[^14]
Linear probes for sentiment in NLP have been used since the early days of recurrent network analysis, and the result that an essentially linear "sentiment neuron" emerges in language models was popularized by an OpenAI study of byte-level recurrent networks. Probing for factual knowledge is more delicate: the ROME paper's causal tracing combined linear interventions with patching to localize the recall of subject-relation-object facts in transformer GPT-style models to middle-layer feed-forward modules acting on the last token of the subject name, and used this localization to perform Rank-One Model Editing, that is, a low-rank modification of the relevant feed-forward weight matrix.[^7] The probe-style intuition behind ROME, that linear directions in the residual stream carry compact subject representations, has since been used to understand and modify knowledge in large language models.
Truth-direction studies are among the most discussed applications of linear probes in current interpretability. Burns and colleagues' CCS searches for a linear direction that satisfies the logical constraint that a statement and its negation receive opposite labels, without any supervision; across six models including the GPT family and various BERT derivatives and across ten question-answering datasets, the method outperformed zero-shot accuracy by roughly four percent on average and reduced prompt sensitivity.[^6] The Geometry of Truth showed that simple difference-in-means probes find a direction that generalizes across datasets of factual statements and that interventions along this direction causally affect outputs, supporting a linear-feature interpretation of truth representation.[^16] Azaria and Mitchell's The Internal State of an LLM Knows When It's Lying trained a supervised classifier on hidden activations and reported 71 to 83 percent accuracy at distinguishing true from false statements, outperforming probability-based proxies that confound truth with sentence length and word frequency.[^17] These works are often invoked in discussions of eliciting latent knowledge, the alignment subproblem of accessing what a model "knows" beyond what it says.
Modern interpretability for chat models has used linear probes to find directions corresponding to safety-relevant behaviors. Arditi and colleagues' 2024 paper identified a single direction in activation space such that removing the direction prevented refusal across a wide range of harmful prompts and adding it triggered refusal on harmless ones, across thirteen open-source chat models up to 72 billion parameters.[^8] The same direction enabled a white-box jailbreak that, the authors argued, disabled refusal with minimal effect on other capabilities; their methodology has been independently reproduced. The 2025 Anthropic Persona Vectors paper generalized the recipe: given a natural-language description of a trait, the team automated the construction of contrast prompts that elicit and suppress the trait, computed a difference-in-means activation, and used the resulting vector both to monitor trait expression during deployment and to steer it.[^18] Persona vectors were demonstrated for traits including evil, sycophancy, and hallucination propensity, and the open-source toolkit included a steering and monitoring pipeline.[^18]
Linear probes are deployed across a wide range of analytical and engineering settings.
| Application | Representative Studies | Concept Probed |
|---|---|---|
| Syntactic structure | Hewitt and Manning 2019; Tenney et al. 2019 | Parse tree distance, POS, NER, coreference[^4][^5] |
| Semantic composition | Ettinger et al. 2016; Tenney et al. ICLR 2019 | Semantic roles, sentence-level features[^13][^14] |
| Factual knowledge | Meng et al. 2022 | Subject-relation-object recall[^7] |
| Truthfulness | Burns et al. 2022; Marks and Tegmark 2023; Azaria and Mitchell 2023 | Latent truth direction[^6][^16][^17] |
| Refusal behavior | Arditi et al. 2024 | Single refusal direction[^8] |
| Persona/character traits | Chen, Arditi, Sleight, Evans, Lindsey 2025 | Evil, sycophancy, hallucination[^18] |
| Vision representation quality | Radford et al. 2021 (CLIP) | Class linear separability[^20] |
| Linear separability through depth | Alain and Bengio 2016 | Class label[^1] |
A second common use, beyond interpretation, is representation evaluation. The linear probe accuracy of a pretrained vision encoder on ImageNet or another benchmark, computed on frozen features, is a standard summary of the quality of self-supervised and contrastive pretraining, popularized in the CLIP paper.[^20] The zero-shot vs. linear probe comparison is a defining benchmark in the CLIP evaluation suite, where linear probing on frozen CLIP features outperformed Noisy Student EfficientNet-L2 on a majority of evaluated transfer datasets.[^20] In speech and audio similar evaluations are standard.
A third class of uses is targeted intervention. Once a probe direction is learned, it can be added to or subtracted from activations during inference to steer behavior, an approach known as activation steering when the direction is used at runtime. The refusal direction work, persona vectors, and the Geometry of Truth interventions all use the probe-derived direction not only as a measurement but as an actuator. This dual role makes the linear probe one of the most cost-effective tools in modern interpretability.[^8][^18][^16]
Linear probes have several well-documented limitations.
The most discussed limitation is that probe accuracy measures decodability rather than downstream use. A probe might recover a concept from a representation that the model itself never reads; the property is present in the activations but ignored by the subsequent computation. This is the core of the Hewitt and Liang concern.[^9] Causal interventions of the kind used by ROME, Geometry of Truth, and the refusal direction paper are intended to close this gap, but each only proves that some downstream computation depends on the probed direction; it does not prove that the same direction is the model's "preferred" way to represent the concept.[^7][^16][^8][^11]
Selectivity studies have shown that a more expressive probe can succeed where a linear probe fails, but at the cost of weakening the inference. The selectivity gap (linguistic accuracy minus control task accuracy) tends to shrink as the probe grows more expressive.[^9] Choosing a probe family is therefore a methodological decision with substantive consequences; many recent papers report multiple probe sizes and prefer the linear probe by default.
Probes are sensitive to dataset choice. CCS, while unsupervised, is sensitive to the choice of contrast prompts and the underlying data distribution; Geometry of Truth documents how truth probes can generalize less well across logical transformations or task types than initial results suggested, motivating more rigorous benchmarks.[^16][^6] In vision, the linear probe accuracy of a representation can vary considerably with hyperparameters of the probe optimizer, especially the L2 regularization strength.[^20]
Linear probes typically recover a single direction (or low-dimensional subspace), but multiple directions may correspond to the same concept, and the probe may pick out an arbitrary representative of a high-dimensional family. Difference-in-means and contrastive losses can produce different solutions, and the linear-representation hypothesis literature has explored how to identify causal directions versus spurious ones, often using contrastive interventions.[^19][^16]
A more recent line of interpretability argues that the unit of interest should not be a single direction (which is what a probe gives you) but a decomposition of the activation space into many directions, each corresponding to an interpretable feature. Sparse autoencoders (SAEs) are trained on activations to produce a high-dimensional but sparse code in which each dimension is intended to correspond to a single human-interpretable feature; they are often described as performing dictionary learning on activations.[^21] Linear probes and SAEs have complementary strengths:
| Property | Linear Probe | Sparse Autoencoder |
|---|---|---|
| Goal | Detect one concept of interest[^1] | Decompose activations into many features[^21] |
| Supervision | Supervised (concept labels) or unsupervised contrastive (CCS) | Unsupervised reconstruction with sparsity penalty |
| Output | One linear direction | A large dictionary of directions |
| Coverage | Only the probed concept | Many features, including unknown ones |
| Causal interpretation | Direction can be intervened on[^8][^16] | Individual features can be activated or ablated[^21] |
| Failure modes | Decodability without use; control task confound[^9] | Reconstruction-interpretability tradeoff; missing features[^21] |
Empirical comparisons have produced mixed verdicts. Investigations of OthelloGPT have found that supervised linear probes can recover many more board-state features than current sparse autoencoders, while in other settings sparse autoencoders surface concepts that no one had a label for and that no probe was set up to find.[^21] The two methods are best understood as complementary: probes are precise tools when you already know what you are looking for, while sparse autoencoders are exploratory tools for finding directions you did not anticipate.[^21]
A separate line of work formalizes the assumption underlying both approaches: the linear representation hypothesis states that high-level concepts are stored as linear directions in activation space, and Park, Choe, and Veitch's 2023 paper showed how to make this precise using counterfactuals, connecting linear probing to linear steering through a particular causal inner product.[^19]
| Method | Probe Family | Causal Test | Typical Use |
|---|---|---|---|
| Linear probe | Linear classifier on frozen activations | None inherently | Detect concept presence at a layer[^1][^9] |
| MLP probe | Small nonlinear classifier | None inherently | Detect concepts whose linear separability is partial[^9] |
| Structural probe | Linear map then geometric loss | None inherently | Recover tree structures from geometry[^4] |
| Edge probe | Span-based linear classifier | None inherently | Test relations between token spans[^14] |
| CCS | Linear classifier with logical-consistency loss | None inherently | Unsupervised truth discovery[^6] |
| Difference-in-means | Class-conditional mean difference | Direction intervention | Robust direction discovery; truth, refusal, persona[^16][^8][^18] |
| Causal tracing | Activation patching | Yes by construction | Localize computations[^7] |
| Causal scrubbing | Resampling under a hypothesis | Yes by construction | Stress-test interpretability hypotheses[^11] |
| Sparse autoencoder | Dictionary learning on activations | Per-feature intervention | Many-direction decomposition[^21] |
| Activation steering | Add or subtract a direction at inference | Yes by construction | Behavior control via activation steering[^8][^18] |
Linear probing sits near the top of the cost-benefit frontier: it is computationally cheap, easy to implement, and produces a single interpretable artifact. When combined with a causal intervention along the recovered direction, it crosses from a correlational diagnostic into a tool that can both measure and modify behavior. The probe direction is then often described as a feature direction, a steering vector, or, in safety-relevant contexts, a refusal direction or a persona axis.[^8][^18]
Linear probes are one of the foundational techniques of empirical interpretability. They are responsible for a substantial fraction of the field's positive results: the discovery of layer-wise linguistic pipelines in BERT,[^5] the demonstration that syntax trees are linearly embedded in transformer activations,[^4] the localization of factual knowledge in mid-layer feed-forward modules,[^7] the identification of a low-dimensional truth direction,[^6][^16][^17] the discovery of a single direction mediating refusal in chat models,[^8] and the construction of persona vectors for monitoring and steering traits.[^18] These results have shaped how researchers think about representations in deep networks and have informed several alignment research agendas, including eliciting latent knowledge and representation engineering.
At the same time, the literature has matured around a sharper methodology: control tasks and selectivity from Hewitt and Liang,[^9] causal interventions from ROME and the refusal direction work,[^7][^8] and causal scrubbing from Redwood Research,[^11] together with the formal probing/steering correspondence of Park, Choe, and Veitch.[^19] These methodological advances have made linear probing more rigorous without changing its essential simplicity: a small linear classifier trained on frozen activations remains the workhorse of representation analysis.