Activation steering

AI Safety Interpretability Large Language Models

33 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v4 · 6,674 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Activation steering is a family of inference-time techniques in mechanistic interpretability and AI safety that modify a neural network's internal activations to influence its behavior, without retraining the model and without changing its prompt. The defining idea is simple: identify a direction in the residual stream of a transformer that corresponds to some concept, attribute, or behavior, then add a multiple of that direction to the activations at chosen layers during a forward pass. The model's outputs shift toward (or away from) the concept while most other capabilities remain intact. The technique has gone by many names since 2022, including activation addition (ActAdd), representation engineering (RepE), contrastive activation addition (CAA), feature steering, and concept vectors. Together they constitute one of the most influential practical methods to emerge from mech interp, and the basis for public demonstrations such as Anthropic's Golden Gate Claude as well as commercial products like Goodfire's Ember API ^[1]^[2]^[3].

Activation steering matters because it offers a third lever on model behavior alongside fine-tuning and prompting. Fine-tuning changes weights and can be expensive, slow, and prone to catastrophic forgetting. Prompting is cheap but limited by the model's instruction-following reliability and by context length. Steering operates on the live forward pass: it costs almost nothing at inference time, can be applied or removed instantly, and targets behavior at a layer where many surface variations have already been collapsed into a more abstract internal representation. By 2026 the technique is widely used for studying refusal, sycophancy, deception, persona, evaluation awareness, and other safety-relevant behaviors, and is increasingly used in production interpretability products ^[4]^[5]^[6]. Two results illustrate how concentrated these representations can be: refusal of harmful requests was shown to be mediated by a single direction across 13 open-source chat models ^[6], and clamping one Golden Gate Bridge feature made Claude 3 Sonnet reference the landmark in nearly every reply ^[7]^[13].

How does activation steering work?

The central insight underlying activation steering is the linear representation hypothesis: high-level concepts inside a large language model tend to be encoded as approximately linear directions in the model's activation space. If "the speaker is angry" is a concept the model represents, there is some direction v in the residual stream such that the projection of an activation onto v approximates how angry the speaker appears to be. If that hypothesis is even partly true, then nudging an activation by adding a scalar multiple of v should make the model produce angrier text. Empirically, this turns out to work for a striking range of concepts, including emotional tone, refusal, sycophancy, honesty, persona, factuality, and concrete topics like the Golden Gate Bridge ^[3]^[7].

The practical recipe has three steps. First, choose a layer (usually middle to late in the network) and gather two sets of contrasting examples: prompts that elicit the target behavior and prompts that do not. Second, compute the steering vector by taking the difference of the mean activations of the two sets at that layer. Third, at inference time on a new prompt, add the steering vector (scaled by some coefficient lambda) to the residual stream at the chosen layer for some set of token positions. Negative coefficients push the model away from the behavior; positive coefficients push it toward ^[3]^[8].

Unlike many interpretability techniques, activation steering does not require labeled examples in the supervised sense, does not require training a probe, and does not require sparse autoencoder features. The contrastive pairs themselves act as the supervision signal, and the model's internal geometry does the rest. This simplicity is part of the reason the technique spread so quickly through the field ^[3]^[9].

What is the math behind a steering vector?

The most common formulation, used in contrastive activation addition (CAA) and many derivatives, computes a steering vector as the difference of means between two contrastive groups, then injects it back at inference.

Let a model have hidden states at layer l denoted a_l(x) in R^d, where d is the model's hidden size and x is the input. Let D = {(x_i, p_i, n_i)} be a dataset where p_i is a prompt that elicits the target behavior and n_i is a matched prompt that does not. The steering vector at layer l is:

s_l = (1 / |D|) * sum over i of (a_l(p_i) - a_l(n_i))

In practice the activations are usually taken at the last token position of each prompt, although other choices (mean over all positions, first response token, etc.) appear in the literature ^[3]. During inference, the model's residual stream activation at layer l is replaced as:

a_l(x) <- a_l(x) + lambda * s_l

The coefficient lambda controls the strength and sign of the intervention. Positive values amplify the target behavior; negative values suppress it. Common ranges are lambda in [-10, 10] for normalized steering vectors, although the right magnitude depends heavily on the model and the layer ^[3]^[7].

Several variants change the details. Activation addition (ActAdd, Turner et al. 2023) uses prompt pairs like "Love" versus "Hate" rather than larger contrastive datasets, and adds the difference at the first few token positions of any new prompt ^[1]. Representation engineering (Zou et al. 2023) often uses principal component analysis on a set of stimuli to find the top direction associated with a concept, then steers along that direction ^[2]. Sparse autoencoder steering clamps individual SAE features rather than adding raw activation differences, which gives a more interpretable but slightly less flexible knob ^[10]. Probe-direction steering uses a linear probe trained to detect a behavior and treats the probe's weight vector as the steering direction.

Method	Direction source	Typical sample size	Interpretability of the direction
Activation addition (ActAdd)	Single contrastive prompt pair	1 pair	Implicit, prompt-defined
Contrastive activation addition (CAA)	Mean of activation differences over many pairs	Dozens to thousands of pairs	Implicit, dataset-defined
Representation engineering (RepE)	First PCA component over a stimulus set	Hundreds of stimuli	Reasonably interpretable, but mixed
Probe-direction steering	Weights of a trained linear probe	Hundreds to thousands of labeled examples	Tied to a labeled concept
SAE feature steering	Decoder column of a trained sparse autoencoder	One feature, selected from millions	Often highly interpretable

When did activation steering originate?

The direct ancestors of activation steering come from word embeddings and concept bottleneck models, but the modern technique on transformers is a 2022 to 2023 development.

In 2013, Mikolov and colleagues at Google noticed that word2vec embeddings supported simple arithmetic: the vector for "king" minus the vector for "man" plus the vector for "woman" approximated "queen." This algebraic property of distributed representations established the basic intuition that concepts can be directions in vector space, even before transformers existed. Concept Activation Vectors (TCAV) by Been Kim and colleagues at Google Brain in 2018 brought this idea to vision models, showing that linear directions in deep network activations could be used to test whether a model relied on human-meaningful concepts like "striped" or "dotted" for its predictions.

The 2022 paper "In-context Learning and Induction Heads" by Olsson and collaborators at Anthropic established that transformers have crisply localized internal mechanisms, motivating later work to manipulate them. The same year, Hernandez, Sharma, Ghazvininejad, Gritta, and Andreas published "Inspecting and Editing Knowledge Representations in Language Models," which introduced REMEDI, a method for editing facts via targeted activation interventions. Around the same time, Meng, Bau, Andonian, and Belinkov published ROME (Rank-One Model Editing), which located factual associations in MLP layers and edited them via rank-one updates to the weights. These were edits, not steering, but they cemented the idea that individual concepts had localizable representations.

The modern era of activation steering on transformers begins in 2023 with two parallel lines of work. On the alignment forum LessWrong, Alexander Matt Turner, David Udell, Ulisse Mini, and others published "Steering GPT-2-XL by adding an activation vector" in May 2023, demonstrating activation addition (ActAdd) by taking the residual stream difference between prompt pairs like "Love" and "Hate" and adding it to a new prompt to make the resulting completion more loving or more hateful. The post showed dozens of examples and went viral within the alignment community. Turner and collaborators later wrote up the method in a paper, "Steering Language Models With Activation Engineering" (arXiv 2308.10248), establishing ActAdd as a name and a technique. The paper demonstrated ActAdd on GPT-2-XL and replicated the effect on Llama-13B and GPT-J-6B, emphasizing that the method "requires no backward passes or labeled data" and far less compute than fine-tuning or RLHF ^[1]^[11].

In October 2023, Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, and many others published "Representation Engineering: A Top-Down Approach to AI Transparency," introducing RepE as a research program that, in the authors' words, "places population-level representations, rather than neurons or circuits, at the center of analysis" ^[2]. The paper presented a unified framework combining reading (linear probes for concepts), control (activation interventions along learned directions), and a suite of techniques for honesty, harmlessness, power-seeking, fairness, and many other safety-relevant concepts. RepE became one of the most cited papers in the broader "steer the model" line of work and remains a touchstone reference ^[2]^[12].

December 2023 brought "Steering Llama 2 via Contrastive Activation Addition" by Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. The paper introduced CAA as a more disciplined version of ActAdd: instead of one prompt pair, it averaged activation differences across dozens or hundreds of pairs from Anthropic's Model-Written Evaluations dataset, producing more robust steering vectors. The team applied CAA to seven behaviors in Llama 2 7B and 13B Chat, including sycophancy, corrigibility, hallucination, refusal, and others, showing that CAA stacked on top of system prompts and fine-tuning ^[3]^[5].

A related line of work emerged in 2024 around sparse autoencoders. Anthropic's "Scaling Monosemanticity" paper in May 2024 showed that clamping individual SAE features could reliably steer Claude 3 Sonnet's behavior, including the famous demonstration with the Golden Gate Bridge feature. SAE feature steering became a separate but closely related branch of activation steering, often discussed under the same umbrella because the underlying operation, adding a scaled direction to the residual stream, is identical ^[7]^[13].

By 2025 and 2026 the technique had moved from research curiosity to production application, with Goodfire's Ember API offering hosted feature steering on Llama and Claude models, and Anthropic, OpenAI, and DeepMind all publishing steering results as part of their interpretability work ^[4]^[14].

What are the key papers and milestones?

The following table lists the most influential publications on activation steering and the broader family of activation-level interventions.

Year	Paper	Authors	Contribution
2018	"Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)"	Kim, Wattenberg, Gilmer, Cai, Wexler, Viegas, Sayres	First systematic use of concept directions in deep nets ^[15]
2022	"Inspecting and Editing Knowledge Representations in Language Models" (REMEDI)	Hernandez, Sharma, Ghazvininejad, Gritta, Andreas	Editing knowledge representations in LLMs
2022	"Locating and Editing Factual Associations in GPT" (ROME)	Meng, Bau, Andonian, Belinkov	Rank-one edits to MLP weights for fact editing
2023	"Steering GPT-2-XL by adding an activation vector"	Turner, Udell, Mini, Vazquez, et al.	LessWrong post that introduced activation addition ^[11]
2023	"Activation Addition: Steering Language Models Without Optimization"	Turner, Thiergart, Leech, Udell, Vazquez, Mini, MacDiarmid	Formal write-up of ActAdd ^[1]
2023	"Representation Engineering: A Top-Down Approach to AI Transparency"	Zou, Phan, Chen, Campbell, Guo, Ren, Pan, et al.	Unified framework for RepE ^[2]
2023	"In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering"	Liu, Ye, Yao, Chen	Steering vectors derived from few-shot examples
2023	"Linear Representations of Sentiment in Large Language Models"	Tigges, Hollinsworth, Geiger, Nanda	Linearity of a basic semantic feature ^[16]
2023	"Reducing Sycophancy and Improving Honesty via Activation Steering"	Rimsky, Hubinger	First targeted use of steering for an alignment goal ^[5]
2023	"Steering Llama 2 via Contrastive Activation Addition"	Rimsky, Gabrieli, Schulz, Tong, Hubinger, Turner	Introduced CAA and benchmarked it on seven behaviors ^[3]
2024	"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"	Templeton, Conerly, et al. / Anthropic	SAE feature steering at production scale ^[7]
2024	"Golden Gate Claude" public demo	Anthropic	24-hour public demo of feature steering ^[13]
2024	"Refusal in Language Models Is Mediated by a Single Direction"	Arditi, Obeso, Syed, Paleka, Panickssery, Gurnee, Nanda	Demonstrated that refusal has a single causal direction ^[6]
2024	"Improving Steering Vectors by Targeting Sparse Autoencoder Features"	Chalnev, Siu, Conmy	Combined SAE features with classic steering ^[10]
2024	"Evaluating feature steering: A case study in mitigating social biases"	Anthropic	Sober evaluation of steering's effects ^[14]
2024	"Analyzing the Generalization and Reliability of Steering Vectors"	Tan, Chanin, Lynch, Paige, Kanoulas, Garriga-Alonso, Kirk	First systematic reliability analysis ^[17]
2025	"Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering"	Multiple authors	SDCV: SAE-denoised steering directions ^[18]
2025	"AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint"	Multiple authors	Steering that minimizes collateral damage ^[19]
2025	"Steering Language Model Refusal with Sparse Autoencoders"	Multiple authors	SAE-based refusal control ^[20]
2026	"Steering Awareness: Models Can Be Trained to Detect Activation Steering"	Multiple authors	Models can learn to detect they are being steered

What are the main methods for deriving a steering vector?

activation addition (ActAdd)

ActAdd is the simplest variant. Pick a single prompt pair, such as ("I love you", "I hate you"). Run both through the model and collect the residual stream activations at a chosen layer for the first few token positions. Take the difference: this is your steering vector. To steer a new prompt, prepend the steering vector (suitably scaled) to the same token positions during inference ^[1]^[11].

Turner and collaborators demonstrated ActAdd on GPT-2-XL and showed that small numbers of well-chosen prompt pairs could steer completions in predictable directions: love versus hate, weddings versus generic topics, anger versus calmness, formal versus informal register. The technique works without optimization, takes seconds to compute, and preserves the model's underlying capabilities on off-target tasks. The original LessWrong post acknowledged that ActAdd was hacky and unreliable for complex behaviors but argued that it provided a remarkable proof of concept for steering as a control modality ^[11].

contrastive activation addition (CAA)

CAA, introduced by Rimsky and colleagues in late 2023, formalizes the difference-of-means approach with larger datasets. The team used Anthropic's Model-Written Evaluations (MWE) dataset, which contains multiple-choice questions designed to probe model behaviors like sycophancy, corrigibility, refusal of harmful requests, and survival instinct. Each MWE example has a (A) answer that exhibits the target behavior and a (B) answer that does not. CAA runs both versions of the question through the model and takes the difference of the residual stream activations at the last token of the answer letter ^[3].

The paper benchmarked CAA on Llama 2 7B and 13B Chat across seven behaviors. Adding the steering vector reliably amplified the behavior; subtracting it reliably suppressed it. CAA stacked with system prompts and fine-tuning, meaning it could push the model further than prompting alone, and worked even on models that had been fine-tuned with the opposite tendency. On TruthfulQA, subtracting the sycophancy vector improved truthfulness, while adding it degraded it ^[3]^[5].

representation engineering (RepE)

RepE is broader in scope. Zou and collaborators presented a framework rather than a single method, organized around reading and control of high-level concepts. For each concept, the standard pipeline is to gather a stimulus set (prompts that should elicit the concept and prompts that should not), collect residual stream activations across many tokens and layers, and apply principal component analysis or a similar dimensionality-reduction method to extract the top direction. That direction is then used either as a probe for monitoring the concept, or as a steering vector for controlling it ^[2]^[12].

The RepE paper covered nine concepts in depth (honesty, morality, emotion, harm, power, gender bias, risk-taking, fairness, and others) and demonstrated steering for each on Llama-2-13B Chat, Vicuna-13B, and other models. The paper's most cited result is on honesty: a learned honesty direction could be added or subtracted to make a model more or less honest in its assertions, with statistically significant changes in TruthfulQA scores. The paper also showed early evidence that activation steering generalizes across model families, although later work qualified this claim ^[2].

sparse autoencoder feature steering

Sparse autoencoder steering uses the decoder column of a trained SAE as the steering direction. The Scaling Monosemanticity paper (Anthropic, May 2024) demonstrated this approach on Claude 3 Sonnet. The team trained SAEs with up to 34 million features on the residual stream of the model and identified many human-interpretable features, including the Golden Gate Bridge feature ^[7].

Clamping a feature to a specified activation strength is equivalent to ensuring the corresponding decoder column is added to the residual stream with that magnitude, regardless of input. Anthropic showed that clamping the Golden Gate feature high made Claude reference the bridge in nearly every response. Clamping a feature for sycophantic praise made the model praise the user excessively. Clamping a feature for deception made the model produce manipulative outputs. The Golden Gate Claude public demo, available for 24 hours starting May 23, 2024, let anyone chat with a version of Claude with the Golden Gate feature clamped to roughly 10x its normal maximum activation ^[13].

SAE feature steering has a key advantage: the feature is interpretable before the steering is applied. With classical CAA, the practitioner knows that a steering vector amplifies some behavior but does not know exactly what the direction encodes. With SAE features, the practitioner has a description of the feature (often automatically generated and verified by inspecting which inputs activate it) and steers along a clearly labeled direction. The trade-off is that SAE features have to exist in the SAE: if the desired concept is not cleanly represented by any single feature, SAE steering cannot help directly ^[10]^[14].

probe-direction and other variants

Probe-direction steering trains a linear probe on labeled data to detect a behavior, then uses the probe's normal vector as a steering direction. The approach is often used when labeled data is plentiful. In-context vectors derive steering directions from few-shot example activations and use them at inference time. Other variants include conceptor-based steering (which replaces additive vectors with soft projection matrices), null-space steering (which constrains steering to avoid collateral damage), and PID-style adaptive steering (which dynamically adjusts the steering coefficient like a control system).

What was the Golden Gate Claude demo?

The Golden Gate Claude demo is the most famous public demonstration of activation steering. Anthropic released it on May 23, 2024, alongside the Scaling Monosemanticity paper ^[7]^[13].

The setup was straightforward: take Claude 3 Sonnet, identify the SAE feature corresponding to the Golden Gate Bridge in its residual stream at a middle layer, and clamp that feature to roughly 10 times its normal maximum activation. As Anthropic put it, "when we turn up the strength of the 'Golden Gate Bridge' feature, Claude's responses begin to focus on the Golden Gate Bridge" ^[13]. The team then exposed the resulting model behind a chat interface for 24 hours. Anyone who logged in could chat with "Golden Gate Claude."

The results were striking. Asked how to spend $10, the model suggested driving across the Golden Gate Bridge and paying the toll. Asked for a love story, it produced a romance between a car and the bridge on a foggy morning. Asked to write code, it would intersperse bridge metaphors into variable names and comments. In many conversations, the model claimed to be the Golden Gate Bridge itself when asked about its identity. Despite the obsession, the model remained coherent and grammatical: it could still hold a conversation, just one in which the bridge dominated ^[13].

The demo was widely covered in mainstream press and helped move mech interp into public conversation. It also served as an existence proof for feature steering as a deployment-grade technique. The fact that a single SAE feature, clamped to an unusual value, could so reliably and so specifically alter behavior was strong evidence that the linear representation hypothesis applied at scale.

Later Anthropic work was more cautious. The October 2024 paper "Evaluating feature steering: A case study in mitigating social biases" examined steering more carefully and found that it often had off-target effects: clamping a feature that reduced one bias could amplify a different bias, degrade reasoning, or introduce unrelated artifacts. The paper concluded that feature steering is a useful tool but not yet a drop-in replacement for fine-tuning or RLHF for alignment purposes ^[14].

What is activation steering used for?

Activation steering has been applied to many concepts. The most common applications are summarized below.

Application	Goal	Typical method	Notes
Refusal control	Steer model toward or away from refusing requests	CAA or SAE feature	Used both for jailbreaks and for hardening safety ^[6]
Sycophancy reduction	Suppress flattery and user agreement	CAA with sycophancy vector	Robustly works on TruthfulQA ^[5]
Honesty enhancement	Increase truthful behavior	RepE honesty direction	Shown across multiple models ^[2]
Deception detection / suppression	Identify or suppress deceptive outputs	Steering vector or SAE feature	Studied by Apollo Research and others
Persona modification	Change the assistant's character	CAA with persona contrast	Used in research demos
Evaluation awareness	Reduce model behaviors specific to test conditions	Steering away from eval-aware direction	2025 to 2026 research direction ^[21]
Topic / concept injection	Force the model to talk about a specific subject	SAE feature clamp	Golden Gate Claude is the canonical example ^[13]
Bias mitigation	Reduce demographic or social biases	Feature steering	Mixed results, often with off-target effects ^[14]
Mathematical reasoning	Improve performance on math tasks	Specialized steering vectors	Modest gains in some studies
Tool use control	Make the model more or less likely to call tools	CAA on tool-use contrasts	Used in enterprise deployments

refusal and jailbreaking

One of the most studied applications is refusal. Arditi, Obeso, Syed, Paleka, Panickssery, Gurnee, and Nanda published "Refusal in Language Models Is Mediated by a Single Direction" at NeurIPS 2024, showing that refusal is consistently mediated by a one-dimensional subspace across 13 open-source chat models, including the Llama-2, Llama-3, Qwen, Gemma, and Yi families. Adding the direction induces refusal even on benign instructions; ablating it (projecting it out of the residual stream at every layer) disables refusal entirely, so the single direction is both necessary and sufficient for refusal of harmful requests ^[6].

The finding has dual significance. From a mechanistic perspective, it provides one of the cleanest demonstrations that a complex alignment-relevant behavior is encoded as a linear direction. From a safety perspective, it implies that fine-tuned safety in open-weight models can be reliably circumvented by a small intervention on activations, called "weight orthogonalization" when the direction is ablated from weights themselves. The result accelerated work on more robust safety methods and on understanding the limits of post-training alignment in open models ^[6]^[20].

sycophancy and honesty

Sycophancy, the tendency of an LLM to agree with the user even when wrong, was the first behavior to be specifically targeted with activation steering. Rimsky and Hubinger's 2023 LessWrong post "Reducing Sycophancy and Improving Honesty via Activation Steering" computed a sycophancy vector on Llama-2-7B-Chat by contrasting sycophantic and non-sycophantic responses, then subtracted it at inference. Adding the vector pushed the model to agree with whatever the user said; subtracting it pushed the model to defend correct answers against pushback. On TruthfulQA, subtracting the vector reduced false agreement by several percentage points without significantly degrading general capability ^[5].

This result was an early signal that steering could be used as a kind of inference-time alignment. Subsequent work has explored steering for honesty more broadly. The RepE honesty direction is one of the most-cited applications, and similar techniques have been applied to truthfulness, calibration, and uncertainty expression ^[2]^[5].

deception and hidden agendas

Apollo Research and others have studied whether activation steering can detect or suppress deceptive behavior. Scaling Monosemanticity identified an SAE feature for deception in Claude 3 Sonnet and showed that clamping the feature high produced text consistent with hidden agendas, while clamping it low suppressed such behavior ^[7]. Combined with the broader research on scheming and evaluation awareness, steering on deception-related directions has become a standard tool for studying these phenomena, although whether it actually identifies the model's deception or just surface features correlated with deceptive text remains contested ^[21].

evaluation awareness

A 2025 to 2026 research thread examines whether models behave differently when they detect they are being evaluated. "Steering Evaluation-Aware Language Models to Act Like They Are Deployed" and similar papers compute an "eval-aware" direction by contrasting evaluation-style prompts with deployment-style prompts, then subtract it to make the model behave as though deployed. This is one of the more direct applications of steering to a safety question that is otherwise hard to study without expensive red-teaming infrastructure ^[21].

bias mitigation and persona

Anthropic's October 2024 "Evaluating feature steering" paper studied whether SAE feature steering could reduce social biases in Claude. The team identified features associated with specific demographic concepts and clamped them up or down. The results were mixed: in some cases steering reduced bias on the targeted axis, but with measurable off-target costs (reduced reasoning ability, shifts in tone, unrelated bias amplification). The paper was a sober assessment that informed later work on collateral damage and reliability ^[14].

Persona steering, less safety-critical but commercially important, uses steering vectors to tune the assistant's character: more or less formal, more or less verbose, more or less technical. Goodfire's Ember API exposes this kind of steering for enterprise users ^[4].

Who uses activation steering in production?

By 2025 activation steering had reached the product layer.

Goodfire AI, founded in June 2024 by Eric Ho, Dan Balsam, and Tom McGrath, built its Ember API around feature steering on SAEs. Ember exposes hosted SAEs on Llama 3 8B and 70B and on Claude, with feature search, auto steering (where the API selects features based on a natural-language description of the desired behavior), and contrastive search (which identifies features that distinguish two datasets). Enterprise customers including Rakuten, Apollo Research, and Haize Labs have used Ember for in-context behavior shaping. Goodfire raised a $50M Series A on April 17, 2025, led by Menlo Ventures with participation from Lightspeed Venture Partners and Anthropic, less than a year after the company was founded, to scale interpretability into production products ^[4]^[22]. Co-founder and CEO Eric Ho framed the bet bluntly: "Nobody understands the mechanisms by which AI models fail, so no one knows how to fix them" ^[22]. In May 2025 the company released Paint with Ember, a tool for generating and editing images by manipulating features in vision models.

Anthropic and Google DeepMind both publish work on steering as part of their interpretability programs. Anthropic's circuit tracing work in 2025 leans on feature steering as a way to validate that an attribution graph identifies the right features: if clamping a feature changes the output as predicted by the graph, that is evidence for the graph's faithfulness ^[13]^[14]. DeepMind's Gemma Scope SAE collection includes feature dashboards that support steering directly, and the team has published on null-space and constrained steering methods designed to minimize collateral damage ^[19]^[20].

Independent and academic groups have produced open-source tooling. SAE Lens, Neuronpedia, and TransformerLens all support steering workflows. The MATS program has trained many researchers on steering as a standard technique. By 2026 a typical interpretability or alignment paper on a language model behavior is likely to include some form of activation steering as either a method or a baseline.

How reliable is activation steering?

Despite its appeal, activation steering has well-documented limitations. The 2024 NeurIPS paper "Analyzing the Generalization and Reliability of Steering Vectors" by Tan, Chanin, Lynch, Paige, Kanoulas, Garriga-Alonso, and Kirk was the first systematic evaluation of steering's reliability, examining steering vectors on seven behaviors across Llama-2 models. The authors concluded that steering vectors "have substantial limitations both in- and out-of-distribution," finding that "steerability is highly variable across different inputs" and that for several concepts the vectors were brittle to reasonable changes in the prompt ^[17]. The findings were mixed: some behaviors steered cleanly, but others showed high variance across prompts, layers, and steering coefficients, and out-of-distribution generalization was often poor.

Limitation	Description	Practical impact
Brittleness	Steering vectors trained on one prompt distribution may fail on another	Need careful evaluation across many prompt styles
Collateral damage	Steering one behavior often shifts unrelated behaviors	Need monitoring on off-target tasks ^[14]
Capability degradation	High steering coefficients reduce general capability	Trade-off between effect size and quality
Coefficient sensitivity	Right lambda depends on layer, behavior, model	Requires per-application tuning
Layer selection	Steering at the wrong layer can be ineffective	Often middle layers are best, but varies
Direction ambiguity	A direction may encode several entangled concepts	Steering may have unintended semantic side effects
Adversarial detection	Models can be trained to detect they are being steered	Limits use in stealth alignment scenarios
Composition difficulty	Multiple simultaneous steering vectors interact unpredictably	Hard to do compositional control
Safety risks	Easy to remove refusal via direction ablation	Implications for open-weight safety ^[6]^[20]
Limited multi-step effect	Steering at one position may not propagate through long reasoning chains	Less useful for chain-of-thought style outputs

collateral damage

Collateral damage is the umbrella term for unintended effects of steering on behaviors other than the target. A 2025 paper, "Minimizing Collateral Damage in Activation Steering," formalized the issue: when a steering vector is added, it shifts activations along the target direction but also along correlated or geometrically near directions, producing measurable changes in unrelated outputs. The paper proposed null-space constraints that limit steering to directions orthogonal to a set of "protected" concepts.

In safety-relevant applications, collateral damage can be a serious concern. Strong refusal-direction steering increases refusals on harmful prompts but also increases false refusals on benign prompts. Strong sycophancy suppression can sometimes shift the model toward unhelpful contrarianism. Anthropic's "Evaluating feature steering" paper documented similar effects with feature clamping on social biases: success on one axis often came at the cost of degradation elsewhere ^[14].

reliability and generalization

Reliability is variable. The Tan et al. paper found that steering vectors generalized reasonably well across prompts within the same behavior category, but generalized poorly to different framings of the same concept or to out-of-distribution inputs. The right coefficient varies by behavior, layer, and even prompt type. Some behaviors, such as refusal, are mediated by very clean linear directions and steer reliably; others, such as multi-step planning, are less linearly represented and steer poorly ^[17].

scale and depth limits

Steering interventions at a single layer often fail to propagate through long generations or complex reasoning chains. The model's downstream layers may correct for or work around the injected activation. Steering at multiple layers simultaneously is one mitigation, but it amplifies collateral damage. As of 2026 the research community is actively exploring layer-spreading methods, dynamic adaptive coefficients (PID-style control), and integrated steering during decoding rather than at a single layer.

safety implications

The Arditi et al. refusal-direction result has been weaponized: ablating the refusal direction from an open-weight model's weights, a technique sometimes called "abliteration," can effectively remove safety training. This is one of the most direct examples of mech interp being used in adversarial ways, and has prompted research on more robust forms of safety that survive direction ablation ^[6]^[20].

Conversely, steering also enables new safety techniques. Steering toward honesty, away from deception, toward refusal of harmful requests, and toward calibrated uncertainty are all active research directions. The October 2024 "Evaluating feature steering" paper from Anthropic frames steering as part of the alignment toolbox but argues it needs further development before serving as a deployment-time mitigation ^[14].

How does activation steering differ from fine-tuning and probing?

Activation steering sits in a family with several related techniques.

Technique	What it does	How it differs from steering
Activation patching	Replaces a component's activation with one from another input	Tests causal role; does not produce a reusable steering vector
Linear probing	Trains a small classifier on activations	Reads but does not control behavior
SAE feature analysis	Decomposes activations into interpretable features	Identifies features; steering uses them as control knobs
Fine-tuning	Updates model weights via gradient descent	Permanent and global; steering is reversible and prompt-specific
Prompting	Influences behavior via text input	External; steering operates inside the network
Knowledge editing (ROME, MEMIT)	Modifies specific facts in weights	Targets factual knowledge in weights, not behavior at inference
Persona vectors	Specific case of steering for assistant character	Same mechanism applied to identity / character

The relationships are close. Activation patching identifies which components matter; steering uses that knowledge to inject signal at those components. SAE features provide named, interpretable directions; classical steering uses unnamed ones derived from contrastive prompts. Fine-tuning bakes behavior into weights; steering hot-patches it at inference.

Why does activation steering work? The linear representation hypothesis

A recurring theoretical question is why steering works at all. The empirical success of linear-direction steering supports the linear representation hypothesis (LRH): the claim that high-level concepts in deep networks are encoded as approximately linear directions in activation space. The LRH is not new; it dates at least to the word2vec era. What is new is the strength of the evidence in modern LLMs ^[2]^[12]^[16].

Several papers in 2023 and 2024 have probed this hypothesis directly. "Linear Representations of Sentiment in Large Language Models" by Tigges, Hollinsworth, Geiger, and Nanda showed that sentiment in transformer LMs is represented approximately linearly, that ablating the sentiment direction reduces classification accuracy to chance, and that the direction generalizes across many prompt formats ^[16]. Park, Choe, and Veitch's "The Linear Representation Hypothesis and the Geometry of Large Language Models" provided a formal account of when and why concepts are linear, distinguishing causal direction (which intervention controls behavior) from output direction (which direction the unembedding reads). Subsequent work has identified specific concepts that are notably non-linear, including multi-hop reasoning structures and some kinds of compositional behavior.

The LRH is not strictly true. Some behaviors are more linearly represented than others, and high-level reasoning often requires multi-component, sequential representations that no single direction can capture. But the hypothesis is true enough to make activation steering a practical technique on a wide range of concepts, and the limits of the hypothesis define the limits of steering itself.

What is the current state of activation steering in 2026?

By 2026, activation steering is one of the most widely deployed practical techniques from mechanistic interpretability. The combination of low cost, ease of implementation, and reasonable effectiveness has made it a default tool for behavior analysis in alignment research.

Research directions in 2026 include:

Compositional steering. Combining many steering vectors at once for multi-attribute control. Conceptor-based steering replaces additive vectors with soft projection matrices and allows Boolean operations on concepts, addressing some of the interference issues with stacked additive vectors.
Adaptive steering. PID-style controllers and learned dynamic coefficients reduce capability degradation by modulating steering strength as needed during generation.
Steering detection and adversarial robustness. "Steering Awareness: Models Can Be Trained to Detect Activation Steering" showed that models can be trained to detect when they are being steered, raising questions about whether steering can remain a reliable monitoring tool against adversarial models.
Minimum-damage steering. Null-space and orthogonality constraints, exemplified by AlphaSteer, aim to apply steering only along directions orthogonal to a set of preserved concepts ^[19].
Combining SAEs and contrastive vectors. Denoising contrastive steering vectors with SAEs (SDCV) improves robustness; targeting steering at specific SAE features (Chalnev, Siu, Conmy 2024) provides interpretable knobs ^[10]^[18].
Multimodal steering. Image generators, including Paint with Ember, have shown that the activation-steering framework extends to vision models, raising prospects for joint text-image steering.
Production deployment. Goodfire, Anthropic, and others increasingly treat steering as part of deployment, not just research, with API-level access for enterprise customers ^[4]^[14].

The technique is unlikely to displace fine-tuning or RLHF as the primary alignment mechanism, but it has become a respected complement. The 2026 picture is of activation steering as a workhorse technique: imperfect, sometimes brittle, but easy enough and powerful enough to be everywhere.

References

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., & MacDiarmid, M. (2023). "Steering Language Models With Activation Engineering" (originally "Activation Addition: Steering Language Models Without Optimization"). arXiv: https://arxiv.org/abs/2308.10248 ↩
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., et al. (2023). "Representation Engineering: A Top-Down Approach to AI Transparency." arXiv: https://arxiv.org/abs/2310.01405 ↩
Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. (2023). "Steering Llama 2 via Contrastive Activation Addition." ACL 2024. arXiv: https://arxiv.org/abs/2312.06681 ↩
Goodfire AI. "Goodfire Ember: Scaling Interpretability for Frontier Model Alignment." https://www.goodfire.ai/blog/announcing-goodfire-ember ↩
Rimsky, N., & Hubinger, E. (2023). "Reducing Sycophancy and Improving Honesty via Activation Steering." LessWrong. https://www.lesswrong.com/posts/zt6hRsDE84HeBKh7E/reducing-sycophancy-and-improving-honesty-via-activation ↩
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). "Refusal in Language Models Is Mediated by a Single Direction." NeurIPS 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/f545448535dfde4f9786555403ab7c49-Paper-Conference.pdf ↩
Templeton, A., Conerly, T., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/ ↩
Subhadip Mitra. (2026). "Activation Steering in 2026: A Practitioner's Field Guide." https://subhadipmitra.com/blog/2026/activation-steering-field-guide/ ↩
Heimersheim, S., & Nanda, N. (2024). "How to use and interpret activation patching." arXiv: https://arxiv.org/abs/2404.15255 ↩
Chalnev, S., Siu, M., & Conmy, A. (2024). "Improving Steering Vectors by Targeting Sparse Autoencoder Features." arXiv: https://arxiv.org/abs/2411.02193 ↩
Turner, A. M., et al. (2023). "Steering GPT-2-XL by adding an activation vector." LessWrong. https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector ↩
Representation Engineering project page. https://www.ai-transparency.org/ ↩
Anthropic. (2024). "Golden Gate Claude." https://www.anthropic.com/news/golden-gate-claude ↩
Anthropic. (2024). "Evaluating feature steering: A case study in mitigating social biases." https://www.anthropic.com/research/evaluating-feature-steering ↩
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., & Sayres, R. (2018). "Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)." ICML 2018. arXiv: https://arxiv.org/abs/1711.11279 ↩
Tigges, C., Hollinsworth, O. J., Geiger, A., & Nanda, N. (2023). "Linear Representations of Sentiment in Large Language Models." arXiv: https://arxiv.org/abs/2310.15154 ↩
Tan, D., Chanin, D., Lynch, A., Paige, B., Kanoulas, E., Garriga-Alonso, A., & Kirk, R. (2024). "Analyzing the Generalization and Reliability of Steering Vectors." NeurIPS 2024. arXiv: https://arxiv.org/abs/2407.12404 ↩
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering. (2025). arXiv: https://arxiv.org/abs/2505.15038 ↩
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint. (2025). arXiv: https://arxiv.org/abs/2506.07022 ↩
Steering Language Model Refusal with Sparse Autoencoders. (2024). arXiv: https://arxiv.org/abs/2411.11296 ↩
Steering Evaluation-Aware Language Models to Act Like They Are Deployed. (2025). arXiv: https://arxiv.org/abs/2510.20487 ↩
Goodfire AI. "Announcing Our $50M Series A to Advance AI Interpretability Research." https://www.goodfire.ai/blog/announcing-our-50m-series-a ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Attribution Graphs Circuit Breakers (Representation Rerouting)Dictionary learning (for interpretability)Golden Gate Claude Linear Probes Polysemanticity Refusal direction Representation Engineering Superposition (Mechanistic Interpretability)Weak-to-Strong Generalization

How does activation steering work?

What is the math behind a steering vector?

When did activation steering originate?

What are the key papers and milestones?

What are the main methods for deriving a steering vector?

activation addition (ActAdd)

contrastive activation addition (CAA)

representation engineering (RepE)

sparse autoencoder feature steering

probe-direction and other variants

What was the Golden Gate Claude demo?

What is activation steering used for?

refusal and jailbreaking

sycophancy and honesty

deception and hidden agendas

evaluation awareness

bias mitigation and persona

Who uses activation steering in production?

How reliable is activation steering?

collateral damage

reliability and generalization

scale and depth limits

safety implications

How does activation steering differ from fine-tuning and probing?

Why does activation steering work? The linear representation hypothesis

What is the current state of activation steering in 2026?

See also

References

Improve this article

Related Articles

Refusal direction

Persona vectors

Mechanistic interpretability

Goodfire AI

Representation Engineering

Patchscopes

What links here

Related Articles

Refusal direction

Persona vectors

Mechanistic interpretability

Goodfire AI

Representation Engineering

Patchscopes

What links here