Activation steering
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 6,408 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 6,408 words
Add missing citations, update stale details, or suggest a clearer explanation.
Activation steering is a family of inference-time techniques in mechanistic interpretability and AI safety that modify a neural network's internal activations to influence its behavior, without retraining the model and without changing its prompt. The defining idea is simple: identify a direction in the residual stream of a transformer that corresponds to some concept, attribute, or behavior, then add a multiple of that direction to the activations at chosen layers during a forward pass. The model's outputs shift toward (or away from) the concept while most other capabilities remain intact. The technique has gone by many names since 2022, including activation addition (ActAdd), representation engineering (RepE), contrastive activation addition (CAA), feature steering, and concept vectors. Together they constitute one of the most influential practical methods to emerge from mech interp, and the basis for public demonstrations such as Anthropic's Golden Gate Claude as well as commercial products like Goodfire's Ember API [1][2][3].
Activation steering matters because it offers a third lever on model behavior alongside fine-tuning and prompting. Fine-tuning changes weights and can be expensive, slow, and prone to catastrophic forgetting. Prompting is cheap but limited by the model's instruction-following reliability and by context length. Steering operates on the live forward pass: it costs almost nothing at inference time, can be applied or removed instantly, and targets behavior at a layer where many surface variations have already been collapsed into a more abstract internal representation. By 2026 the technique is widely used for studying refusal, sycophancy, deception, persona, evaluation awareness, and other safety-relevant behaviors, and is increasingly used in production interpretability products [4][5][6].
The central insight underlying activation steering is the linear representation hypothesis: high-level concepts inside a large language model tend to be encoded as approximately linear directions in the model's activation space. If "the speaker is angry" is a concept the model represents, there is some direction v in the residual stream such that the projection of an activation onto v approximates how angry the speaker appears to be. If that hypothesis is even partly true, then nudging an activation by adding a scalar multiple of v should make the model produce angrier text. Empirically, this turns out to work for a striking range of concepts, including emotional tone, refusal, sycophancy, honesty, persona, factuality, and concrete topics like the Golden Gate Bridge [3][7].
The practical recipe has three steps. First, choose a layer (usually middle to late in the network) and gather two sets of contrasting examples: prompts that elicit the target behavior and prompts that do not. Second, compute the steering vector by taking the difference of the mean activations of the two sets at that layer. Third, at inference time on a new prompt, add the steering vector (scaled by some coefficient lambda) to the residual stream at the chosen layer for some set of token positions. Negative coefficients push the model away from the behavior; positive coefficients push it toward [3][8].
Unlike many interpretability techniques, activation steering does not require labeled examples in the supervised sense, does not require training a probe, and does not require sparse autoencoder features. The contrastive pairs themselves act as the supervision signal, and the model's internal geometry does the rest. This simplicity is part of the reason the technique spread so quickly through the field [3][9].
The most common formulation, used in contrastive activation addition (CAA) and many derivatives, computes a steering vector as the difference of means between two contrastive groups, then injects it back at inference.
Let a model have hidden states at layer l denoted a_l(x) in R^d, where d is the model's hidden size and x is the input. Let D = {(x_i, p_i, n_i)} be a dataset where p_i is a prompt that elicits the target behavior and n_i is a matched prompt that does not. The steering vector at layer l is:
s_l = (1 / |D|) * sum over i of (a_l(p_i) - a_l(n_i))
In practice the activations are usually taken at the last token position of each prompt, although other choices (mean over all positions, first response token, etc.) appear in the literature [3]. During inference, the model's residual stream activation at layer l is replaced as:
a_l(x) <- a_l(x) + lambda * s_l
The coefficient lambda controls the strength and sign of the intervention. Positive values amplify the target behavior; negative values suppress it. Common ranges are lambda in [-10, 10] for normalized steering vectors, although the right magnitude depends heavily on the model and the layer [3][7].
Several variants change the details. Activation addition (ActAdd, Turner et al. 2023) uses prompt pairs like "Love" versus "Hate" rather than larger contrastive datasets, and adds the difference at the first few token positions of any new prompt [1]. Representation engineering (Zou et al. 2023) often uses principal component analysis on a set of stimuli to find the top direction associated with a concept, then steers along that direction [2]. Sparse autoencoder steering clamps individual SAE features rather than adding raw activation differences, which gives a more interpretable but slightly less flexible knob [10]. Probe-direction steering uses a linear probe trained to detect a behavior and treats the probe's weight vector as the steering direction.
| Method | Direction source | Typical sample size | Interpretability of the direction |
|---|---|---|---|
| Activation addition (ActAdd) | Single contrastive prompt pair | 1 pair | Implicit, prompt-defined |
| Contrastive activation addition (CAA) | Mean of activation differences over many pairs | Dozens to thousands of pairs | Implicit, dataset-defined |
| Representation engineering (RepE) | First PCA component over a stimulus set | Hundreds of stimuli | Reasonably interpretable, but mixed |
| Probe-direction steering | Weights of a trained linear probe | Hundreds to thousands of labeled examples | Tied to a labeled concept |
| SAE feature steering | Decoder column of a trained sparse autoencoder | One feature, selected from millions | Often highly interpretable |
The direct ancestors of activation steering come from word embeddings and concept bottleneck models, but the modern technique on transformers is a 2022 to 2023 development.
In 2013, Mikolov and colleagues at Google noticed that word2vec embeddings supported simple arithmetic: the vector for "king" minus the vector for "man" plus the vector for "woman" approximated "queen." This algebraic property of distributed representations established the basic intuition that concepts can be directions in vector space, even before transformers existed. Concept Activation Vectors (TCAV) by Been Kim and colleagues at Google Brain in 2018 brought this idea to vision models, showing that linear directions in deep network activations could be used to test whether a model relied on human-meaningful concepts like "striped" or "dotted" for its predictions.
The 2022 paper "In-context Learning and Induction Heads" by Olsson and collaborators at Anthropic established that transformers have crisply localized internal mechanisms, motivating later work to manipulate them. The same year, Hernandez, Sharma, Ghazvininejad, Gritta, and Andreas published "Inspecting and Editing Knowledge Representations in Language Models," which introduced REMEDI, a method for editing facts via targeted activation interventions. Around the same time, Meng, Bau, Andonian, and Belinkov published ROME (Rank-One Model Editing), which located factual associations in MLP layers and edited them via rank-one updates to the weights. These were edits, not steering, but they cemented the idea that individual concepts had localizable representations.
The modern era of activation steering on transformers begins in 2023 with two parallel lines of work. On the alignment forum LessWrong, Alexander Matt Turner, David Udell, Ulisse Mini, and others published "Steering GPT-2-XL by adding an activation vector" in May 2023, demonstrating activation addition (ActAdd) by taking the residual stream difference between prompt pairs like "Love" and "Hate" and adding it to a new prompt to make the resulting completion more loving or more hateful. The post showed dozens of examples and went viral within the alignment community. Turner and collaborators later wrote up the method in a paper, "Activation Addition: Steering Language Models Without Optimization" (arXiv 2308.10248), establishing ActAdd as a name and a technique [1][11].
In October 2023, Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, and many others published "Representation Engineering: A Top-Down Approach to AI Transparency," introducing RepE as a research program centered on population-level representations rather than individual neurons or circuits. The paper presented a unified framework combining reading (linear probes for concepts), control (activation interventions along learned directions), and a suite of techniques for honesty, harmlessness, power-seeking, fairness, and many other safety-relevant concepts. RepE became one of the most cited papers in the broader "steer the model" line of work and remains a touchstone reference [2][12].
December 2023 brought "Steering Llama 2 via Contrastive Activation Addition" by Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. The paper introduced CAA as a more disciplined version of ActAdd: instead of one prompt pair, it averaged activation differences across dozens or hundreds of pairs from Anthropic's Model-Written Evaluations dataset, producing more robust steering vectors. The team applied CAA to seven behaviors in Llama 2 7B and 13B Chat, including sycophancy, corrigibility, hallucination, refusal, and others, showing that CAA stacked on top of system prompts and fine-tuning [3][5].
A related line of work emerged in 2024 around sparse autoencoders. Anthropic's "Scaling Monosemanticity" paper in May 2024 showed that clamping individual SAE features could reliably steer Claude 3 Sonnet's behavior, including the famous demonstration with the Golden Gate Bridge feature. SAE feature steering became a separate but closely related branch of activation steering, often discussed under the same umbrella because the underlying operation, adding a scaled direction to the residual stream, is identical [7][13].
By 2025 and 2026 the technique had moved from research curiosity to production application, with Goodfire's Ember API offering hosted feature steering on Llama and Claude models, and Anthropic, OpenAI, and DeepMind all publishing steering results as part of their interpretability work [4][14].
The following table lists the most influential publications on activation steering and the broader family of activation-level interventions.
| Year | Paper | Authors | Contribution |
|---|---|---|---|
| 2018 | "Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)" | Kim, Wattenberg, Gilmer, Cai, Wexler, Viegas, Sayres | First systematic use of concept directions in deep nets [15] |
| 2022 | "Inspecting and Editing Knowledge Representations in Language Models" (REMEDI) | Hernandez, Sharma, Ghazvininejad, Gritta, Andreas | Editing knowledge representations in LLMs |
| 2022 | "Locating and Editing Factual Associations in GPT" (ROME) | Meng, Bau, Andonian, Belinkov | Rank-one edits to MLP weights for fact editing |
| 2023 | "Steering GPT-2-XL by adding an activation vector" | Turner, Udell, Mini, Vazquez, et al. | LessWrong post that introduced activation addition [11] |
| 2023 | "Activation Addition: Steering Language Models Without Optimization" | Turner, Thiergart, Leech, Udell, Vazquez, Mini, MacDiarmid | Formal write-up of ActAdd [1] |
| 2023 | "Representation Engineering: A Top-Down Approach to AI Transparency" | Zou, Phan, Chen, Campbell, Guo, Ren, Pan, et al. | Unified framework for RepE [2] |
| 2023 | "In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering" | Liu, Ye, Yao, Chen | Steering vectors derived from few-shot examples |
| 2023 | "Linear Representations of Sentiment in Large Language Models" | Tigges, Hollinsworth, Geiger, Nanda | Linearity of a basic semantic feature [16] |
| 2023 | "Reducing Sycophancy and Improving Honesty via Activation Steering" | Rimsky, Hubinger | First targeted use of steering for an alignment goal [5] |
| 2023 | "Steering Llama 2 via Contrastive Activation Addition" | Rimsky, Gabrieli, Schulz, Tong, Hubinger, Turner | Introduced CAA and benchmarked it on seven behaviors [3] |
| 2024 | "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" | Templeton, Conerly, et al. / Anthropic | SAE feature steering at production scale [7] |
| 2024 | "Golden Gate Claude" public demo | Anthropic | 24-hour public demo of feature steering [13] |
| 2024 | "Refusal in Language Models Is Mediated by a Single Direction" | Arditi, Obeso, Syed, Paleka, Panickssery, Gurnee, Nanda | Demonstrated that refusal has a single causal direction [6] |
| 2024 | "Improving Steering Vectors by Targeting Sparse Autoencoder Features" | Chalnev, Siu, Conmy | Combined SAE features with classic steering [10] |
| 2024 | "Evaluating feature steering: A case study in mitigating social biases" | Anthropic | Sober evaluation of steering's effects [14] |
| 2024 | "Analyzing the Generalization and Reliability of Steering Vectors" | Tan, Chanin, Lynch, Paige, Kanoulas, Garriga-Alonso, Kirk | First systematic reliability analysis [17] |
| 2025 | "Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering" | Multiple authors | SDCV: SAE-denoised steering directions [18] |
| 2025 | "AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint" | Multiple authors | Steering that minimizes collateral damage [19] |
| 2025 | "Steering Language Model Refusal with Sparse Autoencoders" | Multiple authors | SAE-based refusal control [20] |
| 2026 | "Steering Awareness: Models Can Be Trained to Detect Activation Steering" | Multiple authors | Models can learn to detect they are being steered |
ActAdd is the simplest variant. Pick a single prompt pair, such as ("I love you", "I hate you"). Run both through the model and collect the residual stream activations at a chosen layer for the first few token positions. Take the difference: this is your steering vector. To steer a new prompt, prepend the steering vector (suitably scaled) to the same token positions during inference [1][11].
Turner and collaborators demonstrated ActAdd on GPT-2-XL and showed that small numbers of well-chosen prompt pairs could steer completions in predictable directions: love versus hate, weddings versus generic topics, anger versus calmness, formal versus informal register. The technique works without optimization, takes seconds to compute, and preserves the model's underlying capabilities on off-target tasks. The original LessWrong post acknowledged that ActAdd was hacky and unreliable for complex behaviors but argued that it provided a remarkable proof of concept for steering as a control modality [11].
CAA, introduced by Rimsky and colleagues in late 2023, formalizes the difference-of-means approach with larger datasets. The team used Anthropic's Model-Written Evaluations (MWE) dataset, which contains multiple-choice questions designed to probe model behaviors like sycophancy, corrigibility, refusal of harmful requests, and survival instinct. Each MWE example has a (A) answer that exhibits the target behavior and a (B) answer that does not. CAA runs both versions of the question through the model and takes the difference of the residual stream activations at the last token of the answer letter [3].
The paper benchmarked CAA on Llama 2 7B and 13B Chat across seven behaviors. Adding the steering vector reliably amplified the behavior; subtracting it reliably suppressed it. CAA stacked with system prompts and fine-tuning, meaning it could push the model further than prompting alone, and worked even on models that had been fine-tuned with the opposite tendency. On TruthfulQA, subtracting the sycophancy vector improved truthfulness, while adding it degraded it [3][5].
RepE is broader in scope. Zou and collaborators presented a framework rather than a single method, organized around reading and control of high-level concepts. For each concept, the standard pipeline is to gather a stimulus set (prompts that should elicit the concept and prompts that should not), collect residual stream activations across many tokens and layers, and apply principal component analysis or a similar dimensionality-reduction method to extract the top direction. That direction is then used either as a probe for monitoring the concept, or as a steering vector for controlling it [2][12].
The RepE paper covered nine concepts in depth (honesty, morality, emotion, harm, power, gender bias, risk-taking, fairness, and others) and demonstrated steering for each on Llama-2-13B Chat, Vicuna-13B, and other models. The paper's most cited result is on honesty: a learned honesty direction could be added or subtracted to make a model more or less honest in its assertions, with statistically significant changes in TruthfulQA scores. The paper also showed early evidence that activation steering generalizes across model families, although later work qualified this claim [2].
Sparse autoencoder steering uses the decoder column of a trained SAE as the steering direction. The Scaling Monosemanticity paper (Anthropic, May 2024) demonstrated this approach on Claude 3 Sonnet. The team trained SAEs with up to 34 million features on the residual stream of the model and identified many human-interpretable features, including the Golden Gate Bridge feature [7].
Clamping a feature to a specified activation strength is equivalent to ensuring the corresponding decoder column is added to the residual stream with that magnitude, regardless of input. Anthropic showed that clamping the Golden Gate feature high made Claude reference the bridge in nearly every response. Clamping a feature for sycophantic praise made the model praise the user excessively. Clamping a feature for deception made the model produce manipulative outputs. The Golden Gate Claude public demo, available for 24 hours starting May 23, 2024, let anyone chat with a version of Claude with the Golden Gate feature clamped to roughly 10x its normal maximum activation [13].
SAE feature steering has a key advantage: the feature is interpretable before the steering is applied. With classical CAA, the practitioner knows that a steering vector amplifies some behavior but does not know exactly what the direction encodes. With SAE features, the practitioner has a description of the feature (often automatically generated and verified by inspecting which inputs activate it) and steers along a clearly labeled direction. The trade-off is that SAE features have to exist in the SAE: if the desired concept is not cleanly represented by any single feature, SAE steering cannot help directly [10][14].
Probe-direction steering trains a linear probe on labeled data to detect a behavior, then uses the probe's normal vector as a steering direction. The approach is often used when labeled data is plentiful. In-context vectors derive steering directions from few-shot example activations and use them at inference time. Other variants include conceptor-based steering (which replaces additive vectors with soft projection matrices), null-space steering (which constrains steering to avoid collateral damage), and PID-style adaptive steering (which dynamically adjusts the steering coefficient like a control system).
The Golden Gate Claude demo is the most famous public demonstration of activation steering. Anthropic released it on May 23, 2024, alongside the Scaling Monosemanticity paper [7][13].
The setup was straightforward: take Claude 3 Sonnet, identify the SAE feature corresponding to the Golden Gate Bridge in its residual stream at a middle layer, and clamp that feature to roughly 10 times its normal maximum activation. Then expose the resulting model behind a chat interface for 24 hours. Anyone who logged in could chat with "Golden Gate Claude."
The results were striking. Asked how to spend $10, the model suggested driving across the Golden Gate Bridge and paying the toll. Asked for a love story, it produced a romance between a car and the bridge on a foggy morning. Asked to write code, it would intersperse bridge metaphors into variable names and comments. In many conversations, the model claimed to be the Golden Gate Bridge itself when asked about its identity. Despite the obsession, the model remained coherent and grammatical: it could still hold a conversation, just one in which the bridge dominated [13].
The demo was widely covered in mainstream press and helped move mech interp into public conversation. It also served as an existence proof for feature steering as a deployment-grade technique. The fact that a single SAE feature, clamped to an unusual value, could so reliably and so specifically alter behavior was strong evidence that the linear representation hypothesis applied at scale.
Later Anthropic work was more cautious. The October 2024 paper "Evaluating feature steering: A case study in mitigating social biases" examined steering more carefully and found that it often had off-target effects: clamping a feature that reduced one bias could amplify a different bias, degrade reasoning, or introduce unrelated artifacts. The paper concluded that feature steering is a useful tool but not yet a drop-in replacement for fine-tuning or RLHF for alignment purposes [14].
Activation steering has been applied to many concepts. The most common applications are summarized below.
| Application | Goal | Typical method | Notes |
|---|---|---|---|
| Refusal control | Steer model toward or away from refusing requests | CAA or SAE feature | Used both for jailbreaks and for hardening safety [6] |
| Sycophancy reduction | Suppress flattery and user agreement | CAA with sycophancy vector | Robustly works on TruthfulQA [5] |
| Honesty enhancement | Increase truthful behavior | RepE honesty direction | Shown across multiple models [2] |
| Deception detection / suppression | Identify or suppress deceptive outputs | Steering vector or SAE feature | Studied by Apollo Research and others |
| Persona modification | Change the assistant's character | CAA with persona contrast | Used in research demos |
| Evaluation awareness | Reduce model behaviors specific to test conditions | Steering away from eval-aware direction | 2025 to 2026 research direction [21] |
| Topic / concept injection | Force the model to talk about a specific subject | SAE feature clamp | Golden Gate Claude is the canonical example [13] |
| Bias mitigation | Reduce demographic or social biases | Feature steering | Mixed results, often with off-target effects [14] |
| Mathematical reasoning | Improve performance on math tasks | Specialized steering vectors | Modest gains in some studies |
| Tool use control | Make the model more or less likely to call tools | CAA on tool-use contrasts | Used in enterprise deployments |
One of the most studied applications is refusal. Arditi, Obeso, Syed, Paleka, Panickssery, Gurnee, and Nanda published "Refusal in Language Models Is Mediated by a Single Direction" at NeurIPS 2024, showing that a single direction in the residual stream of many open-weights chat models is both necessary and sufficient for refusal of harmful requests. Adding the direction induces refusal even on benign instructions; ablating it (projecting it out of the residual stream at every layer) disables refusal entirely. The authors showed this on Llama-2, Llama-3, Qwen, Yi, and other models [6].
The finding has dual significance. From a mechanistic perspective, it provides one of the cleanest demonstrations that a complex alignment-relevant behavior is encoded as a linear direction. From a safety perspective, it implies that fine-tuned safety in open-weight models can be reliably circumvented by a small intervention on activations, called "weight orthogonalization" when the direction is ablated from weights themselves. The result accelerated work on more robust safety methods and on understanding the limits of post-training alignment in open models [6][20].
Sycophancy, the tendency of an LLM to agree with the user even when wrong, was the first behavior to be specifically targeted with activation steering. Rimsky and Hubinger's 2023 LessWrong post "Reducing Sycophancy and Improving Honesty via Activation Steering" computed a sycophancy vector on Llama-2-7B-Chat by contrasting sycophantic and non-sycophantic responses, then subtracted it at inference. Adding the vector pushed the model to agree with whatever the user said; subtracting it pushed the model to defend correct answers against pushback. On TruthfulQA, subtracting the vector reduced false agreement by several percentage points without significantly degrading general capability [5].
This result was an early signal that steering could be used as a kind of inference-time alignment. Subsequent work has explored steering for honesty more broadly. The RepE honesty direction is one of the most-cited applications, and similar techniques have been applied to truthfulness, calibration, and uncertainty expression [2][5].
Apollo Research and others have studied whether activation steering can detect or suppress deceptive behavior. Scaling Monosemanticity identified an SAE feature for deception in Claude 3 Sonnet and showed that clamping the feature high produced text consistent with hidden agendas, while clamping it low suppressed such behavior [7]. Combined with the broader research on scheming and evaluation awareness, steering on deception-related directions has become a standard tool for studying these phenomena, although whether it actually identifies the model's deception or just surface features correlated with deceptive text remains contested [21].
A 2025 to 2026 research thread examines whether models behave differently when they detect they are being evaluated. "Steering Evaluation-Aware Language Models to Act Like They Are Deployed" and similar papers compute an "eval-aware" direction by contrasting evaluation-style prompts with deployment-style prompts, then subtract it to make the model behave as though deployed. This is one of the more direct applications of steering to a safety question that is otherwise hard to study without expensive red-teaming infrastructure [21].
Anthropic's October 2024 "Evaluating feature steering" paper studied whether SAE feature steering could reduce social biases in Claude. The team identified features associated with specific demographic concepts and clamped them up or down. The results were mixed: in some cases steering reduced bias on the targeted axis, but with measurable off-target costs (reduced reasoning ability, shifts in tone, unrelated bias amplification). The paper was a sober assessment that informed later work on collateral damage and reliability [14].
Persona steering, less safety-critical but commercially important, uses steering vectors to tune the assistant's character: more or less formal, more or less verbose, more or less technical. Goodfire's Ember API exposes this kind of steering for enterprise users [4].
By 2025 activation steering had reached the product layer.
Goodfire AI, founded in June 2024 by Eric Ho, Dan Balsam, and Tom McGrath, built its Ember API around feature steering on SAEs. Ember exposes hosted SAEs on Llama 3 8B and 70B and on Claude, with feature search, auto steering (where the API selects features based on a natural-language description of the desired behavior), and contrastive search (which identifies features that distinguish two datasets). Enterprise customers including Rakuten, Apollo Research, and Haize Labs have used Ember for in-context behavior shaping. Goodfire raised a $50M Series A in early 2025 to scale interpretability into production interpretability products [4][22]. In May 2025 the company released Paint with Ember, a tool for generating and editing images by manipulating features in vision models.
Anthropic and Google DeepMind both publish work on steering as part of their interpretability programs. Anthropic's circuit tracing work in 2025 leans on feature steering as a way to validate that an attribution graph identifies the right features: if clamping a feature changes the output as predicted by the graph, that is evidence for the graph's faithfulness [13][14]. DeepMind's Gemma Scope SAE collection includes feature dashboards that support steering directly, and the team has published on null-space and constrained steering methods designed to minimize collateral damage [19][20].
Independent and academic groups have produced open-source tooling. SAE Lens, Neuronpedia, and TransformerLens all support steering workflows. The MATS program has trained many researchers on steering as a standard technique. By 2026 a typical interpretability or alignment paper on a language model behavior is likely to include some form of activation steering as either a method or a baseline.
Despite its appeal, activation steering has well-documented limitations. The 2024 paper "Analyzing the Generalization and Reliability of Steering Vectors" by Tan, Chanin, Lynch, Paige, Kanoulas, Garriga-Alonso, and Kirk was the first systematic evaluation of steering's reliability, examining steering vectors on seven behaviors across Llama-2 models. The findings were mixed: some behaviors steered cleanly, but others showed high variance across prompts, layers, and steering coefficients. Out-of-distribution generalization was often poor [17].
| Limitation | Description | Practical impact |
|---|---|---|
| Brittleness | Steering vectors trained on one prompt distribution may fail on another | Need careful evaluation across many prompt styles |
| Collateral damage | Steering one behavior often shifts unrelated behaviors | Need monitoring on off-target tasks [14] |
| Capability degradation | High steering coefficients reduce general capability | Trade-off between effect size and quality |
| Coefficient sensitivity | Right lambda depends on layer, behavior, model | Requires per-application tuning |
| Layer selection | Steering at the wrong layer can be ineffective | Often middle layers are best, but varies |
| Direction ambiguity | A direction may encode several entangled concepts | Steering may have unintended semantic side effects |
| Adversarial detection | Models can be trained to detect they are being steered | Limits use in stealth alignment scenarios |
| Composition difficulty | Multiple simultaneous steering vectors interact unpredictably | Hard to do compositional control |
| Safety risks | Easy to remove refusal via direction ablation | Implications for open-weight safety [6][20] |
| Limited multi-step effect | Steering at one position may not propagate through long reasoning chains | Less useful for chain-of-thought style outputs |
Collateral damage is the umbrella term for unintended effects of steering on behaviors other than the target. A 2025 paper, "Minimizing Collateral Damage in Activation Steering," formalized the issue: when a steering vector is added, it shifts activations along the target direction but also along correlated or geometrically near directions, producing measurable changes in unrelated outputs. The paper proposed null-space constraints that limit steering to directions orthogonal to a set of "protected" concepts.
In safety-relevant applications, collateral damage can be a serious concern. Strong refusal-direction steering increases refusals on harmful prompts but also increases false refusals on benign prompts. Strong sycophancy suppression can sometimes shift the model toward unhelpful contrarianism. Anthropic's "Evaluating feature steering" paper documented similar effects with feature clamping on social biases: success on one axis often came at the cost of degradation elsewhere [14].
Reliability is variable. The Tan et al. paper found that steering vectors generalized reasonably well across prompts within the same behavior category, but generalized poorly to different framings of the same concept or to out-of-distribution inputs. The right coefficient varies by behavior, layer, and even prompt type. Some behaviors, such as refusal, are mediated by very clean linear directions and steer reliably; others, such as multi-step planning, are less linearly represented and steer poorly [17].
Steering interventions at a single layer often fail to propagate through long generations or complex reasoning chains. The model's downstream layers may correct for or work around the injected activation. Steering at multiple layers simultaneously is one mitigation, but it amplifies collateral damage. As of 2026 the research community is actively exploring layer-spreading methods, dynamic adaptive coefficients (PID-style control), and integrated steering during decoding rather than at a single layer.
The Arditi et al. refusal-direction result has been weaponized: ablating the refusal direction from an open-weight model's weights, a technique sometimes called "abliteration," can effectively remove safety training. This is one of the most direct examples of mech interp being used in adversarial ways, and has prompted research on more robust forms of safety that survive direction ablation [6][20].
Conversely, steering also enables new safety techniques. Steering toward honesty, away from deception, toward refusal of harmful requests, and toward calibrated uncertainty are all active research directions. The October 2024 "Evaluating feature steering" paper from Anthropic frames steering as part of the alignment toolbox but argues it needs further development before serving as a deployment-time mitigation [14].
Activation steering sits in a family with several related techniques.
| Technique | What it does | How it differs from steering |
|---|---|---|
| Activation patching | Replaces a component's activation with one from another input | Tests causal role; does not produce a reusable steering vector |
| Linear probing | Trains a small classifier on activations | Reads but does not control behavior |
| SAE feature analysis | Decomposes activations into interpretable features | Identifies features; steering uses them as control knobs |
| Fine-tuning | Updates model weights via gradient descent | Permanent and global; steering is reversible and prompt-specific |
| Prompting | Influences behavior via text input | External; steering operates inside the network |
| Knowledge editing (ROME, MEMIT) | Modifies specific facts in weights | Targets factual knowledge in weights, not behavior at inference |
| Persona vectors | Specific case of steering for assistant character | Same mechanism applied to identity / character |
The relationships are close. Activation patching identifies which components matter; steering uses that knowledge to inject signal at those components. SAE features provide named, interpretable directions; classical steering uses unnamed ones derived from contrastive prompts. Fine-tuning bakes behavior into weights; steering hot-patches it at inference.
A recurring theoretical question is why steering works at all. The empirical success of linear-direction steering supports the linear representation hypothesis (LRH): the claim that high-level concepts in deep networks are encoded as approximately linear directions in activation space. The LRH is not new; it dates at least to the word2vec era. What is new is the strength of the evidence in modern LLMs [2][12][16].
Several papers in 2023 and 2024 have probed this hypothesis directly. "Linear Representations of Sentiment in Large Language Models" by Tigges, Hollinsworth, Geiger, and Nanda showed that sentiment in transformer LMs is represented approximately linearly, that ablating the sentiment direction reduces classification accuracy to chance, and that the direction generalizes across many prompt formats [16]. Park, Choe, and Veitch's "The Linear Representation Hypothesis and the Geometry of Large Language Models" provided a formal account of when and why concepts are linear, distinguishing causal direction (which intervention controls behavior) from output direction (which direction the unembedding reads). Subsequent work has identified specific concepts that are notably non-linear, including multi-hop reasoning structures and some kinds of compositional behavior.
The LRH is not strictly true. Some behaviors are more linearly represented than others, and high-level reasoning often requires multi-component, sequential representations that no single direction can capture. But the hypothesis is true enough to make activation steering a practical technique on a wide range of concepts, and the limits of the hypothesis define the limits of steering itself.
By 2026, activation steering is one of the most widely deployed practical techniques from mechanistic interpretability. The combination of low cost, ease of implementation, and reasonable effectiveness has made it a default tool for behavior analysis in alignment research.
Research directions in 2026 include:
The technique is unlikely to displace fine-tuning or RLHF as the primary alignment mechanism, but it has become a respected complement. The 2026 picture is of activation steering as a workhorse technique: imperfect, sometimes brittle, but easy enough and powerful enough to be everywhere.