Linear Probes

Interpretability Neural Networks

25 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v4 · 5,042 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A linear probe is a small linear classifier (or linear regressor) trained on the frozen internal activations of a neural network to test whether a particular concept, property, or label is linearly decodable from those activations. Because the host network's weights are held fixed and only the probe's parameters are optimized, any accuracy the probe achieves is attributed to information already present in the representation rather than to new computation learned in the underlying model.^[1] Linear probing was introduced as a general diagnostic for deep neural networks by Guillaume Alain and Yoshua Bengio in 2016, and it has since become one of the most widely used techniques for analyzing the representations inside vision models, language models, and multimodal systems.^[1]^[2]^[3] Probes have been used to detect syntactic structure, factual knowledge, sentiment, truthfulness, and refusal directions in modern transformer-based models, and they remain a foundational primitive in both interpretability and mechanistic interpretability.^[4]^[5]^[6]^[7]^[8]

Because a linear probe has very limited capacity, high probe accuracy is often taken as evidence that the underlying representation has organized information about the target concept along a linear (or affine) direction. The validity of that inference, however, has been challenged: a probe can succeed for reasons that do not imply the host model itself uses the concept downstream, so a literature of selectivity tests, control tasks, and causal interventions has grown up around the basic recipe.^[9]^[10]^[11]

What is a linear probe?

In the simplest case, let f denote a fixed neural network with L layers, and let h_ℓ(x) ∈ ℝ^d denote the activation produced at layer ℓ when input x is fed into the network. A linear probe is a function g(z) = Wz + b (often followed by a softmax for classification) whose parameters W and b are fit on a labeled dataset {(x_i, y_i)} using the loss L(g(h_ℓ(x_i)), y_i). The host network f is frozen throughout this fit; only the probe parameters are learned. Probe accuracy on a held-out set is interpreted as a measure of how well the property encoded by y is linearly decodable from layer ℓ.^[1]^[9]

Alain and Bengio defined the technique compactly in their 2016 paper: "We use linear classifiers, which we refer to as 'probes', trained entirely independently of the model itself."^[1] A 2022 survey by Yonatan Belinkov gives a one-sentence summary of the whole framework: "The basic idea is simple: a classifier is trained to predict some linguistic property from a model's representations."^[15]

The motivation for restricting the probe to a linear map is twofold. First, modern deep networks have intermediate states of high dimensionality, and a sufficiently expressive nonlinear classifier could in principle extract a concept that is buried inside the activations only after complex transformations; if a one-layer linear function already suffices, the concept is in some operational sense "present" in the representation. Second, downstream computations inside many networks, particularly the residual streams of transformers and the dot products that produce attention scores, are themselves linear maps, so linear decodability is closer to the kind of access the rest of the network has.^[4]^[12] Alain and Bengio framed their original proposal as a way to "monitor the features at every layer of a model and measure how suitable they are for classification," explicitly because such monitoring could help diagnose architectural problems and reveal the role of each layer.^[1]

How is linear probing different from fine-tuning?

Linear probing should be distinguished from fine-tuning: in fine-tuning, the parameters of the host model are updated, and any resulting gains in accuracy can come from new computation rather than from information already encoded. In probing, by contrast, only the readout is learned. Linear probing should also be distinguished from full diagnostic classifiers, which may use multilayer perceptrons or other nonlinear models; the linearity constraint is what makes probe results most directly interpretable.^[9]

When was linear probing introduced?

Early roots in feature analysis

The idea of training a simple classifier on top of fixed features predates deep learning by decades; representation learning has long been evaluated by training a linear classifier on top of representations produced by an unsupervised model. Within the modern deep learning era, an explicit "probe" framing was advanced in 2016 by Guillaume Alain and Yoshua Bengio in the arXiv paper Understanding intermediate layers using linear classifier probes.^[1] Working with the Inception v3 and ResNet-50 image classifiers, they trained an independent linear classifier on the activations of each intermediate layer and observed that the linear separability of class features increased monotonically with depth, producing a now widely reproduced diagnostic plot of "probe accuracy vs. layer."^[1]

In parallel, NLP researchers adopted the same recipe to ask whether learned vector representations of words and sentences encoded specific linguistic properties. Ettinger, Elgohary, and Resnik introduced classifier-based probes for semantic composition in 2016, and a sequence of follow-ups by Adi et al., Conneau et al., and Belinkov and colleagues used trained classifiers to probe sentence embeddings and the hidden states of recurrent translation systems for morphological, syntactic, and semantic properties.^[13]^[3]

Probes for contextual word representations

The 2018 to 2019 wave of contextual representations, including ELMo and BERT, triggered a surge of probing studies. John Hewitt and Christopher Manning's NAACL 2019 paper A Structural Probe for Finding Syntax in Word Representations introduced a structural probe that learns a single linear transformation under which squared L2 distances between transformed word embeddings approximate distances in a dependency parse tree, and a separate transformation under which squared L2 norms approximate tree depth.^[4] Their result, that such transformations exist for ELMo and BERT but not for non-contextual baselines, was widely interpreted as evidence that entire syntax trees are implicitly embedded in the geometry of contextual representations. As the authors put it, "the probe identifies a linear transformation under which squared L2 distance encodes the distance between words in the parse tree."^[4]

Tenney, Das, and Pavlick's ACL 2019 paper BERT Rediscovers the Classical NLP Pipeline combined probes with a scalar mixing trick to assign weights to each layer for each task, then showed that the layers carrying information about POS tagging, parsing, named entity recognition, semantic role labeling, and coreference were arranged in approximately the order of the traditional NLP pipeline.^[5] A companion ICLR 2019 paper by Tenney and colleagues, What do you learn from context?, introduced "edge probing," in which a probe is asked to label edges of a structured object (such as a parse tree or a coreference chain) given the contextual vectors at the endpoints, and applied this design to CoVe, ELMo, OpenAI GPT, and BERT.^[14]

The selectivity critique

By 2019, probing had become so common that a methodological critique was overdue. John Hewitt and Percy Liang's EMNLP 2019 paper Designing and Interpreting Probes with Control Tasks asked the central question: when a probe succeeds, did the representation already encode the linguistic property, or did the probe itself learn the task by memorizing word-type-to-label mappings?^[9] They proposed control tasks, in which each word type is assigned a random label drawn from a fixed distribution. Because a control task has no linguistic structure, a probe that nevertheless achieves high accuracy on it must be doing so by memorizing word identity. The authors then defined the selectivity of a probe as the difference between its accuracy on a real linguistic task and its accuracy on the matched control task. A good probe should have high linguistic accuracy and low control accuracy, that is, high selectivity.^[9] Empirically they found that small, well-regularized linear probes were more selective than larger MLP probes that achieved higher raw accuracy but were also better at the control task, sharpening the tradeoff between probe expressivity and the meaningfulness of the inference.^[9]

The Hewitt and Liang paper has since framed best practice for probing: report accuracy on both the target task and a matched random-label control, and compare a linear probe with a small MLP, treating large discrepancies as a warning that the probe rather than the model is doing the work.^[9] A 2021 survey by Belinkov and Glass, Analysis Methods in Neural Language Processing, codified probing as one of the central techniques of NLP analysis, alongside attention visualization, behavioral testing, and erasure-based methods, and explicitly called out the probe-vs-classifier ambiguity as an open methodological issue.^[3]^[15]

Probes in modern large language models

From 2022 onward, linear probes returned to prominence as a tool for analyzing large language models. Burns, Ye, Klein, and Steinhardt's Discovering Latent Knowledge in Language Models Without Supervision, posted to arXiv in December 2022, introduced Contrast-Consistent Search (CCS), an unsupervised method that finds a linear direction in activation space such that contradictory statements get opposite labels; across 6 models and 10 question-answering datasets, the method outperforms zero-shot accuracy by 4 percent on average.^[6] The same year, Meng, Bau, Andonian, and Belinkov's Locating and Editing Factual Associations in GPT (the ROME paper) combined causal tracing with linear interventions on middle-layer feed-forward modules to localize and rewrite specific factual associations in autoregressive transformers.^[7]

Samuel Marks and Max Tegmark's 2023 paper The Geometry of Truth extended the truth-direction story by showing that simple difference-in-means linear probes generalize across datasets of factual statements and that surgical interventions along the probe's direction can flip a model's behavior, providing causal as well as correlational support for a "truth direction" in large language model activations.^[16] In April 2023, Amos Azaria and Tom Mitchell posted The Internal State of an LLM Knows When It's Lying, which trained supervised classifiers on hidden-layer activations to predict statement truthfulness, reporting accuracies between 71 and 83 percent depending on the base model.^[17] Andy Arditi and colleagues at Anthropic and partner labs released a widely cited 2024 paper, Refusal in Language Models Is Mediated by a Single Direction, showing that across 13 open-source chat models up to 72 billion parameters, refusal behavior is mediated by a single one-dimensional refusal direction in activation space.^[8] In 2025, Anthropic researchers Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey extended this line to multi-trait monitoring with Persona Vectors, a framework that derives linear directions for traits such as evil, sycophancy, and hallucination propensity and uses them for both monitoring and steering.^[18]

How does linear probing work?

Training pipeline

A typical linear probing pipeline collects activations from a frozen network f on a labeled dataset, then fits a linear readout. In practice, three implementation choices dominate the result:

Pooling. For a transformer, the activation at a particular token position must be selected. Common choices include the activation at the final token (for autoregressive models), at the [CLS] token (for BERT-style encoders), or a mean over tokens.^[5]^[14]
Layer choice. Activations may be taken from any residual-stream layer, the output of an attention block, the output of a feed-forward block, or even individual neurons. Probing every layer and plotting accuracy versus depth is a standard diagnostic and was the central technique of Alain and Bengio's original paper.^[1]
Probe family. Linear probes are usually fit by logistic regression (cross-entropy) or by ridge regression for continuous targets. Some studies use "difference-in-means" probes that simply take the difference of class-conditional mean activations, a closed-form linear classifier that the Geometry of Truth paper found to generalize as well as fitted probes for truth directions.^[16]

How is a linear probe evaluated? Accuracy, selectivity, and probe complexity

Beyond raw accuracy, the modern probe literature reports several diagnostics. Selectivity (accuracy on the target task minus accuracy on a matched control task) is the Hewitt and Liang standard.^[9] Generalization across distributions (does a probe trained on one truth dataset transfer to another?) is a second test, used heavily in the truth-direction work.^[16] Comparison across probe families (linear vs. nonlinear) provides a third axis: if a small linear probe already matches the accuracy of a much larger nonlinear probe, the linear hypothesis is more credible.^[9]^[4]

The choice between a linear and nonlinear probe is sometimes called the probing tax tradeoff: a more expressive probe can extract concepts that are present but tangled (raising accuracy) at the cost of attributing less of that accuracy to the host model's representation.^[9] Hewitt and Liang's experiments suggested that careful regularization (notably dropout was not effective for MLPs in their setting) can shift this tradeoff, but that the linear probe remains the cleanest default.^[9]

Causal probes

A separate strand of methodology argues that purely correlational probes are not enough: even a highly selective linear probe shows only that a property is decodable, not that the model uses it. Causal probes therefore intervene in the host network during the forward pass and measure changes in behavior. Three common patterns are:

Activation patching. Replace the activation at some location with one drawn from a different input and measure how the output changes. Patching was central to the ROME paper's causal tracing, which used it to identify mid-layer feed-forward modules as the locus of factual recall.^[7]
Direction intervention. Project activations onto a probe direction and either set the projection to zero (ablation), scale it, or shift it; observe the resulting change in behavior. The Geometry of Truth and Refusal in Language Models Is Mediated by a Single Direction both rely heavily on this style of intervention, showing that a single linear direction is sufficient to control truthfulness reporting and refusal behavior, respectively.^[16]^[8]
Causal scrubbing. A more demanding evaluation introduced by Redwood Research in 2022 by Lawrence Chan and colleagues; an interpretability hypothesis is interpreted as a claim about which activations can be resampled from other inputs without changing model behavior, and the hypothesis is tested by performing those resamplings and measuring behavior change.^[11] Probing claims can be reformulated as causal scrubbing hypotheses, in which case the probe direction has to survive a much stricter test than achieving high decoding accuracy.^[11]

The 2023 Linear Representation Hypothesis and the Geometry of Large Language Models by Park, Choe, and Veitch formalized two notions of linear representation, one in output space and one in input space, and proved that they correspond to linear probing and to model steering respectively, providing a theoretical bridge between probing and causal intervention.^[19]

What can linear probes detect? Empirical results

Syntax and the classical NLP pipeline

Probing was first used at scale to test whether contextual word representations encoded discrete linguistic structure. Hewitt and Manning's structural probe found that for ELMo and BERT there exist linear transformations under which pairwise squared L2 distances between transformed word embeddings approximate parse tree distances and squared norms approximate tree depths; non-contextual baselines such as plain word embeddings failed this test by a wide margin.^[4] Tenney, Das, and Pavlick's BERT Rediscovers the Classical NLP Pipeline reported that the center of mass for the probes corresponding to part-of-speech tagging, parsing, named entity recognition, semantic role labeling, and coreference resolution shifted to progressively higher layers, recapitulating the order in which a traditional NLP pipeline runs.^[5] Their accompanying edge probing study, What do you learn from context?, gave a more granular picture, finding that contextual encoders gave large gains over non-contextual baselines on syntactic edge tasks but smaller gains on semantic tasks, suggesting that the linear separability of semantic relations was less complete.^[14]

Sentiment, factual knowledge, and ROME

Linear probes for sentiment in NLP have been used since the early days of recurrent network analysis, and the result that an essentially linear "sentiment neuron" emerges in language models was popularized by an OpenAI study of byte-level recurrent networks. Probing for factual knowledge is more delicate: the ROME paper's causal tracing combined linear interventions with patching to localize the recall of subject-relation-object facts in transformer GPT-style models to middle-layer feed-forward modules acting on the last token of the subject name, and used this localization to perform Rank-One Model Editing, that is, a low-rank modification of the relevant feed-forward weight matrix.^[7] The probe-style intuition behind ROME, that linear directions in the residual stream carry compact subject representations, has since been used to understand and modify knowledge in large language models.

Truth, lies, and CCS

Truth-direction studies are among the most discussed applications of linear probes in current interpretability. Burns and colleagues' CCS searches for a linear direction that satisfies the logical constraint that a statement and its negation receive opposite labels, without any supervision; across six models including the GPT family and various BERT derivatives and across ten question-answering datasets, the method outperformed zero-shot accuracy by roughly four percent on average and reduced prompt sensitivity.^[6] The Geometry of Truth showed that simple difference-in-means probes find a direction that generalizes across datasets of factual statements and that interventions along this direction causally affect outputs, supporting a linear-feature interpretation of truth representation.^[16] Azaria and Mitchell's The Internal State of an LLM Knows When It's Lying trained a supervised classifier on hidden activations and reported 71 to 83 percent accuracy at distinguishing true from false statements, outperforming probability-based proxies that confound truth with sentence length and word frequency.^[17] These works are often invoked in discussions of eliciting latent knowledge, the alignment subproblem of accessing what a model "knows" beyond what it says.

Refusal and persona vectors

Modern interpretability for chat models has used linear probes to find directions corresponding to safety-relevant behaviors. Arditi and colleagues' 2024 paper identified a single direction in activation space such that removing the direction prevented refusal across a wide range of harmful prompts and adding it triggered refusal on harmless ones, across 13 open-source chat models up to 72 billion parameters.^[8] The same direction enabled a white-box jailbreak that, the authors argued, disabled refusal with minimal effect on other capabilities; their methodology has been independently reproduced. The 2025 Anthropic Persona Vectors paper generalized the recipe: given a natural-language description of a trait, the team automated the construction of contrast prompts that elicit and suppress the trait, computed a difference-in-means activation, and used the resulting vector both to monitor trait expression during deployment and to steer it.^[18] Persona vectors were demonstrated for traits including evil, sycophancy, and hallucination propensity, and the open-source toolkit included a steering and monitoring pipeline.^[18]

What is linear probing used for?

Linear probes are deployed across a wide range of analytical and engineering settings.

Application	Representative Studies	Concept Probed
Syntactic structure	Hewitt and Manning 2019; Tenney et al. 2019	Parse tree distance, POS, NER, coreference^[4]^[5]
Semantic composition	Ettinger et al. 2016; Tenney et al. ICLR 2019	Semantic roles, sentence-level features^[13]^[14]
Factual knowledge	Meng et al. 2022	Subject-relation-object recall^[7]
Truthfulness	Burns et al. 2022; Marks and Tegmark 2023; Azaria and Mitchell 2023	Latent truth direction^[6]^[16]^[17]
Refusal behavior	Arditi et al. 2024	Single refusal direction^[8]
Persona/character traits	Chen, Arditi, Sleight, Evans, Lindsey 2025	Evil, sycophancy, hallucination^[18]
Vision representation quality	Radford et al. 2021 (CLIP)	Class linear separability^[20]
Linear separability through depth	Alain and Bengio 2016	Class label^[1]

A second common use, beyond interpretation, is representation evaluation. The linear probe accuracy of a pretrained vision encoder on ImageNet or another benchmark, computed on frozen features, is a standard summary of the quality of self-supervised and contrastive pretraining, popularized in the CLIP paper.^[20] The zero-shot vs. linear probe comparison is a defining benchmark in the CLIP evaluation suite, where linear probing on the best CLIP model's frozen features outperformed a Noisy Student EfficientNet-L2 baseline on 21 of 27 evaluated transfer datasets.^[20] In speech and audio similar evaluations are standard.

A third class of uses is targeted intervention. Once a probe direction is learned, it can be added to or subtracted from activations during inference to steer behavior, an approach known as activation steering when the direction is used at runtime. The refusal direction work, persona vectors, and the Geometry of Truth interventions all use the probe-derived direction not only as a measurement but as an actuator. This dual role makes the linear probe one of the most cost-effective tools in modern interpretability.^[8]^[18]^[16]

What are the limitations of linear probing?

Linear probes have several well-documented limitations.

Decodability does not imply use

The most discussed limitation is that probe accuracy measures decodability rather than downstream use. A probe might recover a concept from a representation that the model itself never reads; the property is present in the activations but ignored by the subsequent computation. This is the core of the Hewitt and Liang concern.^[9] Causal interventions of the kind used by ROME, Geometry of Truth, and the refusal direction paper are intended to close this gap, but each only proves that some downstream computation depends on the probed direction; it does not prove that the same direction is the model's "preferred" way to represent the concept.^[7]^[16]^[8]^[11]

Probe complexity confounds attribution

Selectivity studies have shown that a more expressive probe can succeed where a linear probe fails, but at the cost of weakening the inference. The selectivity gap (linguistic accuracy minus control task accuracy) tends to shrink as the probe grows more expressive.^[9] Choosing a probe family is therefore a methodological decision with substantive consequences; many recent papers report multiple probe sizes and prefer the linear probe by default.

Distributional fragility

Probes are sensitive to dataset choice. CCS, while unsupervised, is sensitive to the choice of contrast prompts and the underlying data distribution; Geometry of Truth documents how truth probes can generalize less well across logical transformations or task types than initial results suggested, motivating more rigorous benchmarks.^[16]^[6] In vision, the linear probe accuracy of a representation can vary considerably with hyperparameters of the probe optimizer, especially the L2 regularization strength.^[20]

Identifiability of directions

Linear probes typically recover a single direction (or low-dimensional subspace), but multiple directions may correspond to the same concept, and the probe may pick out an arbitrary representative of a high-dimensional family. Difference-in-means and contrastive losses can produce different solutions, and the linear-representation hypothesis literature has explored how to identify causal directions versus spurious ones, often using contrastive interventions.^[19]^[16]

How do linear probes compare to sparse autoencoders?

A more recent line of interpretability argues that the unit of interest should not be a single direction (which is what a probe gives you) but a decomposition of the activation space into many directions, each corresponding to an interpretable feature. Sparse autoencoders (SAEs) are trained on activations to produce a high-dimensional but sparse code in which each dimension is intended to correspond to a single human-interpretable feature; they are often described as performing dictionary learning on activations.^[21] Linear probes and SAEs have complementary strengths:

Property	Linear Probe	Sparse Autoencoder
Goal	Detect one concept of interest^[1]	Decompose activations into many features^[21]
Supervision	Supervised (concept labels) or unsupervised contrastive (CCS)	Unsupervised reconstruction with sparsity penalty
Output	One linear direction	A large dictionary of directions
Coverage	Only the probed concept	Many features, including unknown ones
Causal interpretation	Direction can be intervened on^[8]^[16]	Individual features can be activated or ablated^[21]
Failure modes	Decodability without use; control task confound^[9]	Reconstruction-interpretability tradeoff; missing features^[21]

Empirical comparisons have produced mixed verdicts. Investigations of OthelloGPT have found that supervised linear probes can recover many more board-state features than current sparse autoencoders, while in other settings sparse autoencoders surface concepts that no one had a label for and that no probe was set up to find.^[21] The two methods are best understood as complementary: probes are precise tools when you already know what you are looking for, while sparse autoencoders are exploratory tools for finding directions you did not anticipate.^[21]

A separate line of work formalizes the assumption underlying both approaches: the linear representation hypothesis states that high-level concepts are stored as linear directions in activation space, and Park, Choe, and Veitch's 2023 paper showed how to make this precise using counterfactuals, connecting linear probing to linear steering through a particular causal inner product.^[19]

Method	Probe Family	Causal Test	Typical Use
Linear probe	Linear classifier on frozen activations	None inherently	Detect concept presence at a layer^[1]^[9]
MLP probe	Small nonlinear classifier	None inherently	Detect concepts whose linear separability is partial^[9]
Structural probe	Linear map then geometric loss	None inherently	Recover tree structures from geometry^[4]
Edge probe	Span-based linear classifier	None inherently	Test relations between token spans^[14]
CCS	Linear classifier with logical-consistency loss	None inherently	Unsupervised truth discovery^[6]
Difference-in-means	Class-conditional mean difference	Direction intervention	Robust direction discovery; truth, refusal, persona^[16]^[8]^[18]
Causal tracing	Activation patching	Yes by construction	Localize computations^[7]
Causal scrubbing	Resampling under a hypothesis	Yes by construction	Stress-test interpretability hypotheses^[11]
Sparse autoencoder	Dictionary learning on activations	Per-feature intervention	Many-direction decomposition^[21]
Activation steering	Add or subtract a direction at inference	Yes by construction	Behavior control via activation steering^[8]^[18]

Linear probing sits near the top of the cost-benefit frontier: it is computationally cheap, easy to implement, and produces a single interpretable artifact. When combined with a causal intervention along the recovered direction, it crosses from a correlational diagnostic into a tool that can both measure and modify behavior. The probe direction is then often described as a feature direction, a steering vector, or, in safety-relevant contexts, a refusal direction or a persona axis.^[8]^[18]

Why are linear probes significant?

Linear probes are one of the foundational techniques of empirical interpretability. They are responsible for a substantial fraction of the field's positive results: the discovery of layer-wise linguistic pipelines in BERT,^[5] the demonstration that syntax trees are linearly embedded in transformer activations,^[4] the localization of factual knowledge in mid-layer feed-forward modules,^[7] the identification of a low-dimensional truth direction,^[6]^[16]^[17] the discovery of a single direction mediating refusal in chat models,^[8] and the construction of persona vectors for monitoring and steering traits.^[18] These results have shaped how researchers think about representations in deep networks and have informed several alignment research agendas, including eliciting latent knowledge and representation engineering.

At the same time, the literature has matured around a sharper methodology: control tasks and selectivity from Hewitt and Liang,^[9] causal interventions from ROME and the refusal direction work,^[7]^[8] and causal scrubbing from Redwood Research,^[11] together with the formal probing/steering correspondence of Park, Choe, and Veitch.^[19] These methodological advances have made linear probing more rigorous without changing its essential simplicity: a small linear classifier trained on frozen activations remains the workhorse of representation analysis.

References

Guillaume Alain and Yoshua Bengio, "Understanding intermediate layers using linear classifier probes", arXiv, 2016-10-05 (v1; latest v4 2018-11-22). https://arxiv.org/abs/1610.01644. Accessed 2026-05-20. ↩
Guillaume Alain and Yoshua Bengio, "Understanding intermediate layers using linear classifier probes (v4)", arXiv, 2018-11-22. https://arxiv.org/abs/1610.01644v4. Accessed 2026-05-20. ↩
Yonatan Belinkov and James Glass, "Analysis Methods in Neural Language Processing: A Survey", arXiv, 2018-12-21. https://arxiv.org/abs/1812.08951. Accessed 2026-05-20. ↩
John Hewitt and Christopher D. Manning, "A Structural Probe for Finding Syntax in Word Representations", NAACL 2019, ACL Anthology, 2019-06. https://aclanthology.org/N19-1419/. Accessed 2026-05-20. ↩
Ian Tenney, Dipanjan Das, and Ellie Pavlick, "BERT Rediscovers the Classical NLP Pipeline", ACL 2019 / arXiv, 2019-05-15. https://arxiv.org/abs/1905.05950. Accessed 2026-05-20. ↩
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt, "Discovering Latent Knowledge in Language Models Without Supervision", arXiv, 2022-12-07. https://arxiv.org/abs/2212.03827. Accessed 2026-05-20. ↩
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov, "Locating and Editing Factual Associations in GPT", arXiv, 2022-02-10. https://arxiv.org/abs/2202.05262. Accessed 2026-05-20. ↩
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda, "Refusal in Language Models Is Mediated by a Single Direction", arXiv, 2024-06-17. https://arxiv.org/abs/2406.11717. Accessed 2026-05-20. ↩
John Hewitt and Percy Liang, "Designing and Interpreting Probes with Control Tasks", EMNLP 2019 / arXiv, 2019-09-08. https://arxiv.org/abs/1909.03368. Accessed 2026-05-20. ↩
John Hewitt and Percy Liang, "Designing and Interpreting Probes with Control Tasks", ACL Anthology D19-1275, 2019-11. https://aclanthology.org/D19-1275/. Accessed 2026-05-20. ↩
Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas, "Causal Scrubbing: A Method for Rigorously Testing Interpretability Hypotheses", Redwood Research / AI Alignment Forum, 2022-12-03. https://www.alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN. Accessed 2026-05-20. ↩
Nelson Elhage, Neel Nanda, Catherine Olsson, et al., "A Mathematical Framework for Transformer Circuits", Anthropic / Transformer Circuits Thread, 2021-12-22. https://transformer-circuits.pub/2021/framework/index.html. Accessed 2026-05-20. ↩
Allyson Ettinger, Ahmed Elgohary, and Philip Resnik, "Probing for semantic evidence of composition by means of simple classification tasks", Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, ACL 2016, 2016-08. https://aclanthology.org/W16-2524/. Accessed 2026-05-20. ↩
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick, "What do you learn from context? Probing for sentence structure in contextualized word representations", ICLR 2019 / arXiv, 2019-05-15. https://arxiv.org/abs/1905.06316. Accessed 2026-05-20. ↩
Yonatan Belinkov, "Probing Classifiers: Promises, Shortcomings, and Advances", Computational Linguistics 48(1), MIT Press / arXiv 2102.12452, 2022-03. https://direct.mit.edu/coli/article/48/1/207/107571/Probing-Classifiers-Promises-Shortcomings-and. Accessed 2026-05-20. ↩
Samuel Marks and Max Tegmark, "The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets", arXiv, 2023-10-10. https://arxiv.org/abs/2310.06824. Accessed 2026-05-20. ↩
Amos Azaria and Tom Mitchell, "The Internal State of an LLM Knows When It's Lying", arXiv, 2023-04-26. https://arxiv.org/abs/2304.13734. Accessed 2026-05-20. ↩
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey, "Persona Vectors: Monitoring and Controlling Character Traits in Language Models", arXiv, 2025-07-29. https://arxiv.org/abs/2507.21509. Accessed 2026-05-20. ↩
Kiho Park, Yo Joong Choe, and Victor Veitch, "The Linear Representation Hypothesis and the Geometry of Large Language Models", arXiv, 2023-11-07. https://arxiv.org/abs/2311.03658. Accessed 2026-05-20. ↩
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, "Learning Transferable Visual Models From Natural Language Supervision (CLIP)", PMLR / arXiv, 2021-02-26. https://arxiv.org/abs/2103.00020. Accessed 2026-05-20. ↩
Adam Karvonen, "Sparse Autoencoders find only 9/180 board state features in OthelloGPT", LessWrong / AI Alignment Forum, 2024-03-05. https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-report-sparse-autoencoders-find-only-9-180-board. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Activation steering Eliciting latent knowledge Representation Engineering SimCLR

What is a linear probe?

How is linear probing different from fine-tuning?

When was linear probing introduced?

Early roots in feature analysis

Probes for contextual word representations

The selectivity critique

Probes in modern large language models

How does linear probing work?

Training pipeline

How is a linear probe evaluated? Accuracy, selectivity, and probe complexity

Causal probes

What can linear probes detect? Empirical results

Syntax and the classical NLP pipeline

Sentiment, factual knowledge, and ROME

Truth, lies, and CCS

Refusal and persona vectors

What is linear probing used for?

What are the limitations of linear probing?

Decodability does not imply use

Probe complexity confounds attribution

Distributional fragility

Identifiability of directions

How do linear probes compare to sparse autoencoders?

How does linear probing compare to related methods?

Why are linear probes significant?

See also

References

Improve this article

Related Articles

Superposition (Mechanistic Interpretability)

Polysemanticity

Transcoder

Sparse Coding

Monosemanticity

Feature Importances

What links here

Related Articles

Superposition (Mechanistic Interpretability)

Polysemanticity

Transcoder

Sparse Coding

Monosemanticity

Feature Importances

What links here