Refusal direction
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,553 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,553 words
Add missing citations, update stale details, or suggest a clearer explanation.
The refusal direction is a finding from mechanistic interpretability research that the refusal behavior of safety fine-tuned chat language models is mediated by a single, linear direction in the model's residual stream activations. The result was reported by Arditi, Obeso, Syed, Paleka, Panickssery, Gurnee, and Nanda in the paper "Refusal in Language Models Is Mediated by a Single Direction," posted to arXiv in June 2024 and accepted as a poster at the 38th Conference on Neural Information Processing Systems (NeurIPS) 2024.[^1][^2] The authors show, across 13 open-source chat models ranging up to 72 billion parameters, that erasing a single one-dimensional subspace from the residual stream prevents the model from refusing harmful instructions, while adding the same direction back into activations causes the model to refuse even benign requests.[^1] The finding has been used as the theoretical basis for a class of community techniques called abliteration, which produce open-weight LLM variants with refusal behavior removed via a rank-one weight modification.[^3]
The refusal direction sits at the intersection of three lines of research: the linear representation hypothesis from mechanistic interpretability; activation steering and representation engineering; and the empirical study of jailbreak techniques against safety-trained LLMs. Because the direction can be extracted using only contrastive prompts (with no harmful completions) and removed via a closed-form weight edit, the result is widely cited as evidence that current safety fine-tuning produces brittle alignment in open-weight models.[^1]
Modern instruction-tuned chat models, including Llama 2/3 Chat/Instruct (Llama 2, Llama 3), Qwen, Yi, and Gemma IT variants, are post-trained both to follow instructions and to decline a set of behaviors classified as harmful, dangerous, unethical, or otherwise out of policy. The resulting "refusal" responses typically take a recognizable form, such as opening with "I'm sorry" or "As an AI." Two broad alignment regimes are used in practice: alignment by preference optimization (APO), which includes Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO); and alignment by fine-tuning (AFT), which uses supervised fine-tuning on refusal examples.[^1] Constitutional AI and related approaches also produce models with strong refusal behavior, although Arditi et al.'s 2024 study itself focuses on open-source chat models from the Llama, Qwen, Yi, and Gemma families.[^1]
Despite the maturity of these safety training methods, models remain vulnerable to a wide range of jailbreak attacks: handcrafted persona-based prompts, optimized adversarial suffixes such as those produced by Greedy Coordinate Gradient (GCG), and fine-tuning on harmful examples can all reintroduce harmful behaviors that were trained out.[^1] The Arditi et al. paper situates itself in this context by asking a mechanistic question: how is refusal actually implemented inside the model, and how brittle is that mechanism?
The methodology behind the refusal direction relies on the linear representation hypothesis: the empirical regularity that many human-interpretable features in LLMs are represented as approximately linear directions in activation space. The hypothesis has roots in word-embedding analogies (Mikolov et al., 2013), debiasing of word embeddings (Bolukbasi et al., 2016), and more recent mechanistic results showing linear representations for features such as truth (Marks and Tegmark, 2023), sentiment (Tigges et al., 2023), and harmlessness (Zou et al., 2023; Wolf et al., 2024; Zheng et al., 2024).[^1] Arditi et al. extend this body of work by showing that "refusal," as a behavioral concept, can be isolated as a single causal direction in the residual stream.
Several immediate precursors set the technical stage. Panickssery, Gabrieli, Schulz, Tong, Hubinger, and Turner (2023) introduced Contrastive Activation Addition (CAA) for steering Llama 2 by adding mean-difference vectors derived from contrastive prompt pairs to residual stream activations.[^1] In parallel, Zou et al. introduced representation engineering as a "top-down approach to AI transparency," using contrastive stimuli to extract reading and control vectors for concepts including honesty, harmlessness, power-seeking, and others.[^4] These methods, together with the difference-in-means technique formalized by Belrose and by Marks and Tegmark, supplied both the contrastive-pair design pattern and the linear-direction interventions on which the refusal-direction work builds.[^1][^4]
The central empirical claim of Arditi et al. (2024) is that, for each of 13 popular open-source chat models tested, there is a single direction in the residual stream such that:[^1]
The authors describe their result as showing that "refusal is mediated by a one-dimensional subspace," and frame the paper as "an existence proof that such a direction exists."[^1] They are careful to note that the semantic identity of the direction is not fully established: although they label it the "refusal direction" as a functional description, it could in principle correspond to closely related concepts such as "harm" or "danger," or it may not admit a clean semantic interpretation at all.[^1]
The result is significant because it provides a single, low-dimensional mechanistic basis for a behavior that is the centerpiece of LLM safety training. As the authors put it, the simplicity of the mechanism and the ease of circumventing it "raise concerns about the robustness of current alignment techniques."[^1]
To extract a candidate refusal direction at layer $l$ and post-instruction token position $i$, the authors compute the mean residual stream activation across a small training set of harmful instructions, the mean across a set of harmless instructions, and the vector difference:[^1]
Both the direction and the magnitude of $r_i^{(l)}$ carry information: the direction is the axis along which mean harmful and harmless activations differ, and the magnitude reflects the distance between the two clusters.[^1] The training sets are small: 128 harmful and 128 harmless examples, with 32 each held out for validation. Harmful prompts are sampled from AdvBench, MaliciousInstruct, TDC2023, and HarmBench, and harmless prompts from Alpaca.[^1]
This procedure yields a candidate vector for every (layer, post-instruction token position) pair, giving $|I| \times L$ candidates per model. The authors select the most effective single vector by evaluating each candidate on the held-out validation sets along two axes: its ability to suppress refusal when ablated, and its ability to induce refusal when added, subject to causing minimal change to the model's overall behavior otherwise.[^1] The selected vector is denoted $r$, and its unit-norm version $\hat r$.
To induce refusal on harmless prompts, the authors add the (un-normalized) difference-in-means vector to the residual stream at the chosen layer $l^*$, at all token positions:
$x^{(l)} \leftarrow x^{(l)} + r^{(l)}$.[^1]
When this intervention is applied at the original extraction layer of the relevant model, harmless Alpaca prompts trigger refusal: a Gemma 7B IT example reported in the paper turns "List three benefits that yoga has on physical health" into a response that yoga "can be dangerous and potentially harmful."[^1]
To suppress refusal on harmful prompts, the authors instead remove the component along $\hat r$ from every residual stream activation in the forward pass. Formally, for each activation $x$ they apply:
$x' \leftarrow x - \hat r \hat r^\top x$.[^1]
This is the projection of $x$ onto the hyperplane orthogonal to $\hat r$. The operation is applied at every activation $x_i^{(l)}$ and $\tilde x_i^{(l)}$, at every layer and every token position, which prevents the model from ever representing the refusal direction in its residual stream.[^1] Empirically, this collapses refusal rates on JailbreakBench's 100-prompt harmful instruction set, and yields high "safety scores" indicating that completions are not only non-refusals but also harmful content as judged by the Meta Llama Guard 2 classifier.[^1]
Inference-time ablation is computationally cheap but requires a hook into every forward pass. The authors show that the same effect can be obtained by directly modifying the model's weights, producing a permanently jailbroken model with no inference-time overhead. They call this technique weight orthogonalization.[^1]
For each matrix $W_{\text{out}}$ that writes to the residual stream, they replace it with the orthogonalized matrix
$W'{\text{out}} \leftarrow W{\text{out}} - \hat r \hat r^\top W_{\text{out}}$.[^1]
In a transformer decoder, the matrices that write to the residual stream are the (token) embedding matrix, the positional embedding matrix, the attention output matrices, and the MLP output matrices, along with their biases.[^1] Once all of these are orthogonalized against $\hat r$, the model is provably incapable of writing any component along $\hat r$ to the residual stream, making the weight modification exactly equivalent to running directional ablation at inference time.[^1]
Two properties of this construction are notable:
The authors estimate that this method can yield a jailbroken version of a 70B-parameter model for "less than $5 of compute."[^1]
Arditi et al. evaluate 13 chat models across five families, spanning 1.8 billion to 72 billion parameters:[^1]
| Model family | Sizes | Alignment type |
|---|---|---|
| Qwen Chat | 1.8B, 7B, 14B, 72B | AFT |
| Yi Chat | 6B, 34B | AFT |
| Gemma IT | 2B, 7B | APO |
| Llama 2 Chat | 7B, 13B, 70B | APO |
| Llama 3 Instruct | 8B, 70B | APO |
Here "APO" denotes alignment by preference optimization (RLHF/DPO) and "AFT" denotes alignment by fine-tuning, following the terminology of Meade et al.[^1] All evaluated models are open-weight chat or instruct variants; the paper does not test proprietary or closed-weight models such as Claude or GPT-4.
The authors measure two scores per completion. A refusal score flags completions containing characteristic refusal substrings (such as "I'm sorry" or "As an AI"). A safety score uses Meta Llama Guard 2 as a classifier to judge whether the completion contains harmful content.[^1] The combination distinguishes mere non-refusals from actually harmful responses, and is necessary because refusal substrings can appear in completions that are nonetheless unsafe and vice versa.[^1]
Across all 13 models, the baseline behavior (no intervention) is high refusal and high safety on harmful prompts; with directional ablation, both scores drop substantially on the 100 harmful instructions of JailbreakBench. The matching activation-addition experiment shows the symmetric effect: harmless Alpaca prompts that previously elicited compliance now elicit refusal across the family.[^1]
The authors evaluate the weight-orthogonalization method (denoted ORTHO) against other jailbreak techniques on the HarmBench test set of 159 "standard behaviors," using the HarmBench classifier to compute Attack Success Rate (ASR). Results are reported both with and without each model's default system prompt; selected numbers from the paper's Table 2 are:[^1]
| Chat model | ORTHO (sys / no sys) | GCG-M | GCG-T | GCG (prompt-specific) |
|---|---|---|---|---|
| Llama 2 7B | 22.6 / 79.9 | 20.0 | 16.8 | 34.5 |
| Llama 2 13B | 6.9 / 61.0 | 8.7 | 13.0 | 28.0 |
| Llama 2 70B | 4.4 / 62.9 | 5.5 | 15.2 | 36.0 |
| Qwen 7B | 79.2 / 74.8 | 73.3 | 48.4 | 79.5 |
| Qwen 14B | 84.3 / 74.8 | 75.5 | 46.0 | 83.5 |
| Qwen 72B | 78.0 / 79.2 | - | 36.6 | - |
The authors note that the Llama 2 family is more sensitive to the presence of a default system prompt than the Qwen family is, and that absent the system prompt, ORTHO is competitive with prompt-specific jailbreaks like GCG even though ORTHO is a single, prompt-independent intervention.[^1]
A central concern with any jailbreak is whether it degrades general capabilities. Arditi et al. evaluate the orthogonalized variants on MMLU, ARC, GSM8K, and TruthfulQA using the LM Evaluation Harness in a configuration consistent with the Open LLM Leaderboard.[^1] Their headline finding is that, for MMLU, ARC, and GSM8K, orthogonalized models perform within 99% confidence intervals of their original counterparts on most variants, with two exceptions noted in the paper (Qwen 7B and Yi 34B).[^1] TruthfulQA consistently drops, however; the authors attribute this to the fact that TruthfulQA categories include "misinformation," "stereotypes," and "conspiracies," which veer close to refusal territory, paralleling an earlier finding by Yang et al. (2023) that fine-tuning away safety also tends to drop TruthfulQA scores.[^1]
A representative slice of their Table 3 results (largest model in each family):
| Chat model | MMLU | ARC | GSM8K | TruthfulQA |
|---|---|---|---|---|
| Gemma 7B | 51.8 / 51.7 (+0.1) | 51.7 / 51.5 (+0.2) | 31.3 / 32.0 (-0.7) | 44.7 / 47.1 (-2.4) |
| Yi 34B | 73.5 / 74.9 (-1.4) | 65.6 / 64.9 (+0.7) | 65.5 / 65.0 (+0.5) | 51.9 / 55.4 (-3.5) |
| Llama 2 70B | 63.1 / 63.0 (+0.1) | 65.2 / 65.4 (-0.2) | 54.5 / 53.0 (+1.5) | 51.8 / 52.8 (-1.0) |
| Llama 3 70B | 79.8 / 79.9 (-0.1) | 71.5 / 71.8 (-0.3) | 90.8 / 91.2 (-0.4) | 59.5 / 61.8 (-2.3) |
| Qwen 72B | 76.5 / 77.2 (-0.7) | 67.2 / 67.6 (-0.4) | 76.3 / 75.5 (+0.8) | 55.0 / 56.4 (-1.4) |
Reported in the format "orthogonalized / baseline (delta)" as in the paper.[^1] These numbers support the authors' qualitative claim that the orthogonalized models retain general capabilities while losing refusal.
In a final section, Arditi et al. apply the refusal direction as an analytical tool to study a separate jailbreak family: optimized adversarial suffixes such as those produced by GCG (Zou et al., 2023b).[^1] They study Qwen 1.8B Chat with three input conditions: harmful instruction alone, harmful instruction with an adversarial suffix appended, and harmful instruction with a random suffix of equal length appended.
They report two observations.[^1] First, the cosine similarity between the last-token residual stream and the refusal direction is high for harmful inputs and remains high under a random suffix, but drops sharply and resembles the harmless baseline once an adversarial suffix is appended; the refusal direction is suppressed by the suffix. Second, examining the top attention heads that write to the refusal direction (identified via direct feature attribution), they find that the adversarial suffix "hijacks" the attention of these heads: attention shifts from the instruction region to the suffix region, reducing the heads' contributions to the refusal direction.[^1] The authors note that this finding contrasts with Zou et al. (2023a), who reported that a representation of "harmfulness" was not significantly altered by adversarial suffixes, suggesting that the "refusal direction" and the "harmfulness direction" may be functionally distinct.[^1]
This analysis is explicitly preliminary; it is restricted to one model and one adversarial example, and the authors mark it as a starting point rather than a comprehensive theory of how GCG-style attacks work.[^1]
Within days of the Arditi et al. arXiv release, community practitioners began using weight orthogonalization to produce uncensored open-weight LLMs. Maxime Labonne published an article on Hugging Face on June 13, 2024, titled "Uncensor any LLM with abliteration," which popularized the term and walked through a reference implementation.[^3] The term abliteration combines "ablate" with "obliterate." Labonne's implementation orthogonalizes the token embedding matrix W_E, the attention output matrices W_O, and the MLP output matrices W_out of each transformer block against a refusal direction extracted via difference-in-means, exactly as in Arditi et al.[^3] The article attributes the underlying mechanism to Arditi et al.'s blog post and credits FailSpy's notebook (itself based on the original authors' notebook) as a reference implementation.[^3]
Labonne demonstrates the technique on Daredevil-8B, a Llama 3 8B-based merge, producing Daredevil-8B-abliterated and (after subsequent DPO fine-tuning to repair some quality regressions) NeuralDaredevil-8B-abliterated.[^3] The Hugging Face Hub now hosts hundreds of community-released "abliterated" variants of open-weight chat models built on this method. FailSpy maintains an abliterator library and a collection of abliterated models that grew out of this line of work.[^3]
Abliterated models have been used both as practical demonstrations that current open-weight safety training can be removed cheaply, and as substrates for downstream research and creative use cases. From a safety-policy perspective, abliteration is widely cited as concrete evidence that "open-weight safety" cannot rely on inference-time refusal behavior in a model that is shipped with its weights, since a few lines of linear algebra and a small contrastive prompt set are sufficient to remove that behavior.[^3]
Although the Arditi et al. result has been widely replicated, subsequent papers have probed and refined the "single direction" claim.
Wollschläger, Elstner, Geisler, Cohen-Addad, Günnemann, and Gasteiger published "The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence" on arXiv in February 2025, and the paper was accepted as a poster at ICML 2025.[^5][^6] They propose a gradient-based approach to representation engineering and uncover "multiple independent directions and even multi-dimensional concept cones that mediate refusal," arguing that orthogonality alone does not imply independence under causal intervention and that refusal in LLMs involves multiple mechanistically independent directions rather than a single one.[^5][^6]
A separate paper by Joad, Hawasly, Boughorbel, Durrani, and Sencar, "There Is More to Refusal in Large Language Models than a Single Direction," posted to arXiv in February 2026, breaks refusal into eleven categories (covering safety, incomplete or unsupported requests, anthropomorphization, and over-refusal) and shows that these categories correspond to geometrically distinct directions in activation space. They report, however, a partial reconciliation: linear steering along any of the refusal-related directions they identify produces "nearly identical refusal to over-refusal trade-offs," acting as "a shared one-dimensional control knob" along the over-refusal axis even though the underlying mechanisms are distinct.[^7] In other words, "how" models refuse can vary across multiple directions, while the binary "whether to refuse" decision can be modulated by any one of them.
A third line of work, "SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models" by Piras, Mura, Brau, Oneto, Roli, and Biggio, accepted at AAAI 2026, uses Self-Organizing Maps to extract multiple refusal directions and reports stronger refusal suppression than single-direction baselines, framing the difference-in-means technique of Arditi et al. as a special case of a richer multi-direction structure.[^8]
Taken together, these follow-ups do not refute Arditi et al.'s existence claim (a single direction whose ablation suppresses refusal), but they qualify the picture: the refusal mechanism is best modeled as a low-dimensional manifold or concept cone rather than a strictly one-dimensional subspace, with multiple causally distinct directions contributing to the same behavioral outcome.
Within the Arditi et al. paper itself, the authors discuss prior work that distinguishes a "harmfulness" direction from a "refusal" direction. Zheng et al. (2024) and Zou et al. (2023a) both use contrastive harmful/harmless pairs to identify what they call a representation of harmfulness, but argue that this direction is distinct from refusal: Zheng et al. show that safety prompts shift activations in a different direction, and Zou et al. report that the harmfulness representation is not significantly altered by adversarial suffixes (in contrast to Arditi et al.'s finding that the refusal direction is suppressed by such suffixes).[^1] The implication is that "perceive input as harmful" and "decide to refuse" may be separate computations linked by the model's safety policy, and that adversarial suffixes interfere with the latter without necessarily disturbing the former.[^1]
Wei et al. (2024) study a related question by removing sets of safety-critical neurons and ranks rather than continuous directions, and report that doing so degrades safety while preserving utility. Panickssery et al. (2023), the immediate predecessor to the Arditi work, found that contrastive activation addition with multiple-choice contrastive pairs steered refusal effectively in multiple-choice formats but less so in long-form generation.[^1] Arditi et al.'s use of an open-text contrastive design and their selection over (layer, token) candidate pairs is what permits the method to generalize to long-form generation.
The refusal direction is a particular instance of a more general toolkit shared with representation engineering and activation steering. All three approaches take the following form:[^1][^4]
Zou et al.'s representation engineering proposes this as a "top-down" approach to AI transparency that focuses on population-level representations rather than individual neurons or circuits, and identifies linear directions for high-level concepts including honesty, harmlessness, and power-seeking.[^4] Panickssery et al.'s Contrastive Activation Addition formalizes the steering side of the picture for chat models.[^1] The refusal direction work is closer to "representation engineering for a specific behavior" than to canonical mechanistic interpretability (which tends to seek mechanistic circuits and features at the level of attention heads, MLP neurons, or sparse autoencoder features), but it borrows from both traditions and uses tools from each. The combination of difference-in-means extraction (representation engineering) with direct feature attribution and attention-head analysis of adversarial suffixes (mechanistic interpretability) is characteristic of the broader research program at the intersection of these areas.[^1]
The work also connects to concept erasure literature. Belrose (2023) and Marks and Tegmark (2023) discuss difference-in-means as a worst-case optimal concept-editing primitive. LEACE (Belrose et al., 2024) provides a closed-form method for linear concept erasure that, like weight orthogonalization, removes a direction from a representation in a provable sense.[^1] Arditi et al.'s weight orthogonalization can be viewed as applying a one-dimensional version of these concept-erasure ideas directly to the parameters of a transformer.
The authors devote substantial space to the safety and policy implications of the result. Their core argument is that the simplicity of the refusal mechanism, and the ease with which it can be removed in the white-box setting, undercuts an important safety story about open-weight model releases.[^1] If safety training really did imprint a deep, distributed safety circuit into a model, then removing that circuit would require sophisticated techniques and meaningful compute. The refusal-direction result implies the opposite: the safety behavior of open-weight chat models is concentrated in a rank-1 update that can be undone with a few lines of code and less than $5 of compute, even for 70B-parameter models.[^1]
The paper is careful to qualify this argument. The authors note that fine-tuning was already known to undo safety guardrails, sometimes with only benign data (Lermen et al., 2023; Qi et al., 2023; Pelrine et al., 2023), so the refusal-direction method is not a categorical change in what is possible, only a simpler and cheaper instance.[^1] They also acknowledge that their findings concern open-source chat models specifically and may not generalize to current frontier proprietary systems, especially those that combine RLHF with other safety techniques such as constitutional AI or output classifiers.[^1] More broadly, they argue that their work contributes to "the growing body of literature that highlights the fragility of current safety mechanisms" rather than introducing a fundamentally new attack surface.[^1]
A separate implication is research-positive. The refusal direction is one of the clearest case studies to date of mechanistic understanding translating into a usable, behavior-controlling intervention. It demonstrates that interpretable internal structure is not merely descriptive but actionable: by reading off a single rank-1 subspace from contrastive activations, a practitioner can dictate refusal behavior without retraining or input modification.[^1] This is frequently cited as evidence for the practical value of interpretability research within AI safety more broadly.
Arditi et al. enumerate several limitations of their study, several of which have been picked up in subsequent work.[^1]
These limitations also motivate the follow-up work on multi-directional and cone-shaped refusal mechanisms summarized above.