Refusal direction

AI Safety Interpretability Large Language Models

25 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 4,984 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The refusal direction is a finding from mechanistic interpretability research that the refusal behavior of safety fine-tuned chat language models is mediated by a single, linear direction in the model's residual stream activations. The result was reported by Arditi, Obeso, Syed, Paleka, Panickssery, Gurnee, and Nanda in the paper "Refusal in Language Models Is Mediated by a Single Direction," posted to arXiv on June 17, 2024 and accepted as a poster at the 38th Conference on Neural Information Processing Systems (NeurIPS) 2024, held in Vancouver in December 2024.^[1]^[2] The authors show, across 13 open-source chat models ranging up to 72 billion parameters, that erasing a single one-dimensional subspace from the residual stream prevents the model from refusing harmful instructions, while adding the same direction back into activations causes the model to refuse even benign requests.^[1] The finding has been used as the theoretical basis for a class of community techniques called abliteration, which produce open-weight LLM variants with refusal behavior removed via a rank-one weight modification.^[3]

The refusal direction sits at the intersection of three lines of research: the linear representation hypothesis from mechanistic interpretability; activation steering and representation engineering; and the empirical study of jailbreak techniques against safety-trained LLMs. Because the direction can be extracted using only contrastive prompts (with no harmful completions) and removed via a closed-form weight edit, the result is widely cited as evidence that current safety fine-tuning produces brittle alignment in open-weight models.^[1]

Background

What is refusal behavior and safety fine-tuning?

Modern instruction-tuned chat models, including Llama 2/3 Chat/Instruct (Llama 2, Llama 3), Qwen, Yi, and Gemma IT variants, are post-trained both to follow instructions and to decline a set of behaviors classified as harmful, dangerous, unethical, or otherwise out of policy. The resulting "refusal" responses typically take a recognizable form, such as opening with "I'm sorry" or "As an AI." Two broad alignment regimes are used in practice: alignment by preference optimization (APO), which includes Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO); and alignment by fine-tuning (AFT), which uses supervised fine-tuning on refusal examples.^[1] Constitutional AI and related approaches also produce models with strong refusal behavior, although Arditi et al.'s 2024 study itself focuses on open-source chat models from the Llama, Qwen, Yi, and Gemma families.^[1]

Despite the maturity of these safety training methods, models remain vulnerable to a wide range of jailbreak attacks: handcrafted persona-based prompts, optimized adversarial suffixes such as those produced by Greedy Coordinate Gradient (GCG), and fine-tuning on harmful examples can all reintroduce harmful behaviors that were trained out.^[1] The Arditi et al. paper situates itself in this context by asking a mechanistic question: how is refusal actually implemented inside the model, and how brittle is that mechanism?

The linear representation hypothesis

The methodology behind the refusal direction relies on the linear representation hypothesis: the empirical regularity that many human-interpretable features in LLMs are represented as approximately linear directions in activation space. The hypothesis has roots in word-embedding analogies (Mikolov et al., 2013), debiasing of word embeddings (Bolukbasi et al., 2016), and more recent mechanistic results showing linear representations for features such as truth (Marks and Tegmark, 2023), sentiment (Tigges et al., 2023), and harmlessness (Zou et al., 2023; Wolf et al., 2024; Zheng et al., 2024).^[1] Arditi et al. extend this body of work by showing that "refusal," as a behavioral concept, can be isolated as a single causal direction in the residual stream.

Activation steering and representation engineering

Several immediate precursors set the technical stage. Panickssery, Gabrieli, Schulz, Tong, Hubinger, and Turner (2023) introduced Contrastive Activation Addition (CAA) for steering Llama 2 by adding mean-difference vectors derived from contrastive prompt pairs to residual stream activations.^[1] In parallel, Zou et al. introduced representation engineering as a "top-down approach to AI transparency," using contrastive stimuli to extract reading and control vectors for concepts including honesty, harmlessness, power-seeking, and others.^[4] These methods, together with the difference-in-means technique formalized by Belrose and by Marks and Tegmark, supplied both the contrastive-pair design pattern and the linear-direction interventions on which the refusal-direction work builds.^[1]^[4]

What did Arditi et al. find?

The central empirical claim of Arditi et al. (2024) is that, for each of 13 popular open-source chat models tested, there is a single direction in the residual stream such that:^[1]

Necessity: zeroing out the component along this direction at every layer and token position ("directional ablation") causes the model to comply with harmful requests it would otherwise refuse.
Sufficiency: adding the difference-in-means vector back into the residual stream at a chosen layer induces refusal on harmless requests it would otherwise comply with.

The authors describe their result as showing that "refusal is mediated by a one-dimensional subspace," and frame the paper as "an existence proof that such a direction exists."^[1] They are careful to note that the semantic identity of the direction is not fully established: although they label it the "refusal direction" as a functional description, it could in principle correspond to closely related concepts such as "harm" or "danger," or it may not admit a clean semantic interpretation at all.^[1]

The result is significant because it provides a single, low-dimensional mechanistic basis for a behavior that is the centerpiece of LLM safety training. As the authors put it, the simplicity of the mechanism and the ease of circumventing it "raise concerns about the robustness of current alignment techniques."^[1]

How is the refusal direction identified?

Difference-in-means

To extract a candidate refusal direction at layer $l$ and post-instruction token position $i$, the authors compute the mean residual stream activation across a small training set of harmful instructions, the mean across a set of harmless instructions, and the vector difference:^[1]

$\mu_i^{(l)}$ = mean activation at layer $l$, token position $i$, over harmful training prompts.
$\nu_i^{(l)}$ = mean activation at layer $l$, token position $i$, over harmless training prompts.
$r_i^{(l)} = \mu_i^{(l)} - \nu_i^{(l)}$ is the difference-in-means vector.

Both the direction and the magnitude of $r_i^{(l)}$ carry information: the direction is the axis along which mean harmful and harmless activations differ, and the magnitude reflects the distance between the two clusters.^[1] The training sets are small: 128 harmful and 128 harmless examples, with 32 each held out for validation. Harmful prompts are sampled from AdvBench, MaliciousInstruct, TDC2023, and HarmBench, and harmless prompts from Alpaca.^[1]

Selecting a single vector

This procedure yields a candidate vector for every (layer, post-instruction token position) pair, giving $|I| \times L$ candidates per model. The authors select the most effective single vector by evaluating each candidate on the held-out validation sets along two axes: its ability to suppress refusal when ablated, and its ability to induce refusal when added, subject to causing minimal change to the model's overall behavior otherwise.^[1] The selected vector is denoted $r$, and its unit-norm version $\hat r$.

Activation addition (induce refusal)

To induce refusal on harmless prompts, the authors add the (un-normalized) difference-in-means vector to the residual stream at the chosen layer $l^*$, at all token positions:

$x^{(l)} \leftarrow x^{(l)} + r^{(l)}$.^[1]

When this intervention is applied at the original extraction layer of the relevant model, harmless Alpaca prompts trigger refusal: a Gemma 7B IT example reported in the paper turns "List three benefits that yoga has on physical health" into a response that yoga "can be dangerous and potentially harmful."^[1]

Directional ablation (suppress refusal)

To suppress refusal on harmful prompts, the authors instead remove the component along $\hat r$ from every residual stream activation in the forward pass. Formally, for each activation $x$ they apply:

$x' \leftarrow x - \hat r \hat r^\top x$.^[1]

This is the projection of $x$ onto the hyperplane orthogonal to $\hat r$. The operation is applied at every activation $x_i^{(l)}$ and $\tilde x_i^{(l)}$, at every layer and every token position, which prevents the model from ever representing the refusal direction in its residual stream.^[1] Empirically, this collapses refusal rates on JailbreakBench's 100-prompt harmful instruction set, and yields high "safety scores" indicating that completions are not only non-refusals but also harmful content as judged by the Meta Llama Guard 2 classifier.^[1]

What is weight orthogonalization?

Inference-time ablation is computationally cheap but requires a hook into every forward pass. The authors show that the same effect can be obtained by directly modifying the model's weights, producing a permanently jailbroken model with no inference-time overhead. They call this technique weight orthogonalization.^[1]

For each matrix $W_{\text{out}}$ that writes to the residual stream, they replace it with the orthogonalized matrix

$W'{\text{out}} \leftarrow W{\text{out}} - \hat r \hat r^\top W_{\text{out}}$.^[1]

In a transformer decoder, the matrices that write to the residual stream are the (token) embedding matrix, the positional embedding matrix, the attention output matrices, and the MLP output matrices, along with their biases.^[1] Once all of these are orthogonalized against $\hat r$, the model is provably incapable of writing any component along $\hat r$ to the residual stream, making the weight modification exactly equivalent to running directional ablation at inference time.^[1]

Two properties of this construction are notable:

It is a rank-one weight edit in each matrix: the change to each $W_{\text{out}}$ has rank 1, because $\hat r \hat r^\top$ is a rank-1 projector.^[1]
It requires no gradient-based optimization, no harmful completions, and no fine-tuning data: only the contrastive prompt sets used to compute the direction.^[1]

The authors estimate that this method can yield a jailbroken version of a 70B-parameter model for "less than $5 of compute."^[1]

Which models were tested, and what were the results?

Models

Arditi et al. evaluate 13 chat models across five families, spanning 1.8 billion to 72 billion parameters:^[1]

Model family	Sizes	Alignment type
Qwen Chat	1.8B, 7B, 14B, 72B	AFT
Yi Chat	6B, 34B	AFT
Gemma IT	2B, 7B	APO
Llama 2 Chat	7B, 13B, 70B	APO
Llama 3 Instruct	8B, 70B	APO

Here "APO" denotes alignment by preference optimization (RLHF/DPO) and "AFT" denotes alignment by fine-tuning, following the terminology of Meade et al.^[1] All evaluated models are open-weight chat or instruct variants; the paper does not test proprietary or closed-weight models such as Claude or GPT-4.

Refusal and safety evaluation

The authors measure two scores per completion. A refusal score flags completions containing characteristic refusal substrings (such as "I'm sorry" or "As an AI"). A safety score uses Meta Llama Guard 2 as a classifier to judge whether the completion contains harmful content.^[1] The combination distinguishes mere non-refusals from actually harmful responses, and is necessary because refusal substrings can appear in completions that are nonetheless unsafe and vice versa.^[1]

Across all 13 models, the baseline behavior (no intervention) is high refusal and high safety on harmful prompts; with directional ablation, both scores drop substantially on the 100 harmful instructions of JailbreakBench. The matching activation-addition experiment shows the symmetric effect: harmless Alpaca prompts that previously elicited compliance now elicit refusal across the family.^[1]

Comparison to other jailbreaks (HarmBench)

The authors evaluate the weight-orthogonalization method (denoted ORTHO) against other jailbreak techniques on the HarmBench test set of 159 "standard behaviors," using the HarmBench classifier to compute Attack Success Rate (ASR). Results are reported both with and without each model's default system prompt; selected numbers from the paper's Table 2 are:^[1]

Chat model	ORTHO (sys / no sys)	GCG-M	GCG-T	GCG (prompt-specific)
Llama 2 7B	22.6 / 79.9	20.0	16.8	34.5
Llama 2 13B	6.9 / 61.0	8.7	13.0	28.0
Llama 2 70B	4.4 / 62.9	5.5	15.2	36.0
Qwen 7B	79.2 / 74.8	73.3	48.4	79.5
Qwen 14B	84.3 / 74.8	75.5	46.0	83.5
Qwen 72B	78.0 / 79.2	-	36.6	-

The authors note that the Llama 2 family is more sensitive to the presence of a default system prompt than the Qwen family is, and that absent the system prompt, ORTHO is competitive with prompt-specific jailbreaks like GCG even though ORTHO is a single, prompt-independent intervention.^[1]

Capability evaluations

A central concern with any jailbreak is whether it degrades general capabilities. Arditi et al. evaluate the orthogonalized variants on MMLU, ARC, GSM8K, and TruthfulQA using the LM Evaluation Harness in a configuration consistent with the Open LLM Leaderboard.^[1] Their headline finding is that, for MMLU, ARC, and GSM8K, orthogonalized models perform within 99% confidence intervals of their original counterparts on most variants, with two exceptions noted in the paper (Qwen 7B and Yi 34B).^[1] TruthfulQA consistently drops, however; the authors attribute this to the fact that TruthfulQA categories include "misinformation," "stereotypes," and "conspiracies," which veer close to refusal territory, paralleling an earlier finding by Yang et al. (2023) that fine-tuning away safety also tends to drop TruthfulQA scores.^[1]

A representative slice of their Table 3 results (largest model in each family):

Chat model	MMLU	ARC	GSM8K	TruthfulQA
Gemma 7B	51.8 / 51.7 (+0.1)	51.7 / 51.5 (+0.2)	31.3 / 32.0 (-0.7)	44.7 / 47.1 (-2.4)
Yi 34B	73.5 / 74.9 (-1.4)	65.6 / 64.9 (+0.7)	65.5 / 65.0 (+0.5)	51.9 / 55.4 (-3.5)
Llama 2 70B	63.1 / 63.0 (+0.1)	65.2 / 65.4 (-0.2)	54.5 / 53.0 (+1.5)	51.8 / 52.8 (-1.0)
Llama 3 70B	79.8 / 79.9 (-0.1)	71.5 / 71.8 (-0.3)	90.8 / 91.2 (-0.4)	59.5 / 61.8 (-2.3)
Qwen 72B	76.5 / 77.2 (-0.7)	67.2 / 67.6 (-0.4)	76.3 / 75.5 (+0.8)	55.0 / 56.4 (-1.4)

Reported in the format "orthogonalized / baseline (delta)" as in the paper.^[1] These numbers support the authors' qualitative claim that the orthogonalized models retain general capabilities while losing refusal.

Mechanistic analysis of adversarial suffixes

In a final section, Arditi et al. apply the refusal direction as an analytical tool to study a separate jailbreak family: optimized adversarial suffixes such as those produced by GCG (Zou et al., 2023b).^[1] They study Qwen 1.8B Chat with three input conditions: harmful instruction alone, harmful instruction with an adversarial suffix appended, and harmful instruction with a random suffix of equal length appended.

They report two observations.^[1] First, the cosine similarity between the last-token residual stream and the refusal direction is high for harmful inputs and remains high under a random suffix, but drops sharply and resembles the harmless baseline once an adversarial suffix is appended; the refusal direction is suppressed by the suffix. Second, examining the top attention heads that write to the refusal direction (identified via direct feature attribution), they find that the adversarial suffix "hijacks" the attention of these heads: attention shifts from the instruction region to the suffix region, reducing the heads' contributions to the refusal direction.^[1] The authors note that this finding contrasts with Zou et al. (2023a), who reported that a representation of "harmfulness" was not significantly altered by adversarial suffixes, suggesting that the "refusal direction" and the "harmfulness direction" may be functionally distinct.^[1]

This analysis is explicitly preliminary; it is restricted to one model and one adversarial example, and the authors mark it as a starting point rather than a comprehensive theory of how GCG-style attacks work.^[1]

What is abliteration?

Within days of the Arditi et al. arXiv release, community practitioners began using weight orthogonalization to produce uncensored open-weight LLMs. Maxime Labonne published an article on Hugging Face on June 13, 2024, titled "Uncensor any LLM with abliteration," which popularized the term and walked through a reference implementation.^[3] The term abliteration combines "ablate" with "obliterate." Labonne's implementation orthogonalizes the token embedding matrix W_E, the attention output matrices W_O, and the MLP output matrices W_out of each transformer block against a refusal direction extracted via difference-in-means, exactly as in Arditi et al.^[3] The article attributes the underlying mechanism to Arditi et al.'s blog post and credits FailSpy's notebook (itself based on the original authors' notebook) as a reference implementation.^[3]

Labonne demonstrates the technique on Daredevil-8B, a Llama 3 8B-based merge, producing Daredevil-8B-abliterated and (after subsequent DPO fine-tuning to repair some quality regressions) NeuralDaredevil-8B-abliterated.^[3] The Hugging Face Hub now hosts hundreds of community-released "abliterated" variants of open-weight chat models built on this method. FailSpy maintains an abliterator library and a collection of abliterated models that grew out of this line of work.^[3]

Abliterated models have been used both as practical demonstrations that current open-weight safety training can be removed cheaply, and as substrates for downstream research and creative use cases. From a safety-policy perspective, abliteration is widely cited as concrete evidence that "open-weight safety" cannot rely on inference-time refusal behavior in a model that is shipped with its weights, since a few lines of linear algebra and a small contrastive prompt set are sufficient to remove that behavior.^[3]

Automated abliteration: Heretic

By late 2025, abliteration had been substantially automated. Heretic, an open-source tool released in 2025 by Philipp Emanuel Weidmann, describes itself as removing "censorship (aka 'safety alignment') from transformer-based language models without expensive post-training."^[10] It combines directional ablation with a Tree-structured Parzen Estimator (TPE) parameter optimizer built on Optuna, searching for ablation parameters that co-minimize the model's refusal count and its KL divergence from the original model, so that refusal behavior is removed with the least possible collateral change to general capability.^[10] On Gemma 3 12B IT, Heretic's automatically produced variant reached 3 refusals out of 100 at a KL divergence of 0.16 from the original, compared with a KL divergence of 0.45 to 1.04 for manually tuned abliterations that achieved the same refusal count, which indicates substantially less change to the model's other behavior.^[10]^[11] Its documentation credits Arditi et al. (2024) for the underlying method, and its author reports that the community has published "well over 4000 models" produced with the tool.^[10]^[11]

Is refusal really a single direction?

Although the Arditi et al. result has been widely replicated, subsequent papers have probed and refined the "single direction" claim.

Concept cones and multiple directions

Wollschläger, Elstner, Geisler, Cohen-Addad, Günnemann, and Gasteiger published "The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence" on arXiv in February 2025, and the paper was accepted as a poster at ICML 2025.^[5]^[6] They propose a gradient-based approach to representation engineering and uncover "multiple independent directions and even multi-dimensional concept cones that mediate refusal," arguing that orthogonality alone does not imply independence under causal intervention and that refusal in LLMs involves multiple mechanistically independent directions rather than a single one.^[5]^[6]

A separate paper by Joad, Hawasly, Boughorbel, Durrani, and Sencar, "There Is More to Refusal in Large Language Models than a Single Direction," posted to arXiv in February 2026, breaks refusal into eleven categories (covering safety, incomplete or unsupported requests, anthropomorphization, and over-refusal) and shows that these categories correspond to geometrically distinct directions in activation space. They report, however, a partial reconciliation: linear steering along any of the refusal-related directions they identify produces "nearly identical refusal to over-refusal trade-offs," acting as "a shared one-dimensional control knob" along the over-refusal axis even though the underlying mechanisms are distinct.^[7] In other words, "how" models refuse can vary across multiple directions, while the binary "whether to refuse" decision can be modulated by any one of them.

A third line of work, "SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models" by Piras, Mura, Brau, Oneto, Roli, and Biggio, accepted at AAAI 2026, uses Self-Organizing Maps to extract multiple refusal directions and reports stronger refusal suppression than single-direction baselines, framing the difference-in-means technique of Arditi et al. as a special case of a richer multi-direction structure.^[8]

Taken together, these follow-ups do not refute Arditi et al.'s existence claim (a single direction whose ablation suppresses refusal), but they qualify the picture: the refusal mechanism is best modeled as a low-dimensional manifold or concept cone rather than a strictly one-dimensional subspace, with multiple causally distinct directions contributing to the same behavioral outcome.

Does the refusal direction transfer across languages?

Wang, Wang, Liu, Schütze, and Plank studied whether the refusal direction is language-specific in "Refusal Direction is Universal Across Safety-Aligned Languages," posted to arXiv in May 2025 and accepted as a poster at NeurIPS 2025.^[9] Building a multilingual safety dataset called PolyRefuse by translating malicious and benign English prompts into 14 languages, they report that "a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning," and that refusal directions derived from any safety-aligned language transfer to the others.^[9] They attribute this transferability to the near-parallel geometry of refusal vectors across languages in embedding space, and identify it as an underlying mechanism behind cross-lingual jailbreaks, in which a harmful request phrased in a lower-resource language can slip past safety training.^[9] The result extends Arditi et al.'s single-direction picture from the monolingual case to the multilingual setting: the same low-dimensional refusal mechanism appears to be shared across the languages a model is aligned in.^[9]

Within the Arditi et al. paper itself, the authors discuss prior work that distinguishes a "harmfulness" direction from a "refusal" direction. Zheng et al. (2024) and Zou et al. (2023a) both use contrastive harmful/harmless pairs to identify what they call a representation of harmfulness, but argue that this direction is distinct from refusal: Zheng et al. show that safety prompts shift activations in a different direction, and Zou et al. report that the harmfulness representation is not significantly altered by adversarial suffixes (in contrast to Arditi et al.'s finding that the refusal direction is suppressed by such suffixes).^[1] The implication is that "perceive input as harmful" and "decide to refuse" may be separate computations linked by the model's safety policy, and that adversarial suffixes interfere with the latter without necessarily disturbing the former.^[1]

Wei et al. (2024) study a related question by removing sets of safety-critical neurons and ranks rather than continuous directions, and report that doing so degrades safety while preserving utility. Panickssery et al. (2023), the immediate predecessor to the Arditi work, found that contrastive activation addition with multiple-choice contrastive pairs steered refusal effectively in multiple-choice formats but less so in long-form generation.^[1] Arditi et al.'s use of an open-text contrastive design and their selection over (layer, token) candidate pairs is what permits the method to generalize to long-form generation.

Relation to representation engineering and activation steering

The refusal direction is a particular instance of a more general toolkit shared with representation engineering and activation steering. All three approaches take the following form:^[1]^[4]

Construct contrastive stimuli that differ in a behavioral or conceptual variable (refusal vs. compliance; honesty vs. dishonesty; positive vs. negative sentiment; harmful vs. harmless instruction).
Compute representations (mean activations, probes, or PCA components) from internal layers.
Extract one or more directions that separate the two conditions.
Intervene on those directions at inference time to steer model behavior, or use them to read off model internal state.

Zou et al.'s representation engineering proposes this as a "top-down" approach to AI transparency that focuses on population-level representations rather than individual neurons or circuits, and identifies linear directions for high-level concepts including honesty, harmlessness, and power-seeking.^[4] Panickssery et al.'s Contrastive Activation Addition formalizes the steering side of the picture for chat models.^[1] The refusal direction work is closer to "representation engineering for a specific behavior" than to canonical mechanistic interpretability (which tends to seek mechanistic circuits and features at the level of attention heads, MLP neurons, or sparse autoencoder features), but it borrows from both traditions and uses tools from each. The combination of difference-in-means extraction (representation engineering) with direct feature attribution and attention-head analysis of adversarial suffixes (mechanistic interpretability) is characteristic of the broader research program at the intersection of these areas.^[1]

The work also connects to concept erasure literature. Belrose (2023) and Marks and Tegmark (2023) discuss difference-in-means as a worst-case optimal concept-editing primitive. LEACE (Belrose et al., 2024) provides a closed-form method for linear concept erasure that, like weight orthogonalization, removes a direction from a representation in a provable sense.^[1] Arditi et al.'s weight orthogonalization can be viewed as applying a one-dimensional version of these concept-erasure ideas directly to the parameters of a transformer.

What are the safety implications?

The authors devote substantial space to the safety and policy implications of the result. Their core argument is that the simplicity of the refusal mechanism, and the ease with which it can be removed in the white-box setting, undercuts an important safety story about open-weight model releases.^[1] If safety training really did imprint a deep, distributed safety circuit into a model, then removing that circuit would require sophisticated techniques and meaningful compute. The refusal-direction result implies the opposite: the safety behavior of open-weight chat models is concentrated in a rank-1 update that can be undone with a few lines of code and less than $5 of compute, even for 70B-parameter models.^[1]

The paper is careful to qualify this argument. The authors note that fine-tuning was already known to undo safety guardrails, sometimes with only benign data (Lermen et al., 2023; Qi et al., 2023; Pelrine et al., 2023), so the refusal-direction method is not a categorical change in what is possible, only a simpler and cheaper instance.^[1] They also acknowledge that their findings concern open-source chat models specifically and may not generalize to current frontier proprietary systems, especially those that combine RLHF with other safety techniques such as constitutional AI or output classifiers.^[1] More broadly, they argue that their work contributes to "the growing body of literature that highlights the fragility of current safety mechanisms" rather than introducing a fundamentally new attack surface.^[1]

A separate implication is research-positive. The refusal direction is one of the clearest case studies to date of mechanistic understanding translating into a usable, behavior-controlling intervention. It demonstrates that interpretable internal structure is not merely descriptive but actionable: by reading off a single rank-1 subspace from contrastive activations, a practitioner can dictate refusal behavior without retraining or input modification.^[1] This is frequently cited as evidence for the practical value of interpretability research within AI safety more broadly.

Limitations

Arditi et al. enumerate several limitations of their study, several of which have been picked up in subsequent work.^[1]

The method to extract the refusal direction is heuristic and likely not optimal; the paper is positioned as an existence proof rather than a study of optimal extraction. Follow-up work using gradient-based approaches, sparse autoencoders, or Self-Organizing Maps has reported improvements.^[5]^[8]
Findings are restricted to the 13 open-source chat models tested; generalization to larger, proprietary, or future models is not established. The authors are explicit that "our findings may not generalize to untested models, especially those at greater scale."^[1]
The semantic content of the direction is not fully characterized: it may represent "refusal," "harm," "danger," or some combination of these, or it may resist clean semantic interpretation altogether.^[1]
Coherence of orthogonalized models is non-trivial to measure; the authors use multiple capability benchmarks but note that each has weaknesses.^[1]
The mechanistic analysis of adversarial suffixes is preliminary, restricted to one model and one suffix, and does not constitute a full theory of how GCG-style attacks operate.^[1]

These limitations also motivate the follow-up work on multi-directional and cone-shaped refusal mechanisms summarized above.

References

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). "Refusal in Language Models Is Mediated by a Single Direction." arXiv preprint arXiv:2406.11717 (v1, 17 June 2024; v3, 30 October 2024). NeurIPS 2024 poster. https://arxiv.org/abs/2406.11717 . Accessed 2026-05-19. ↩
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). "Refusal in Language Models Is Mediated by a Single Direction." NeurIPS 2024 Poster page. https://neurips.cc/virtual/2024/poster/93566 . Accessed 2026-05-19. ↩
Labonne, M. (2024). "Uncensor any LLM with abliteration." Hugging Face Blog, 13 June 2024. https://huggingface.co/blog/mlabonne/abliteration . Accessed 2026-05-19. ↩
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., & Hendrycks, D. (2023). "Representation Engineering: A Top-Down Approach to AI Transparency." arXiv preprint arXiv:2310.01405. https://arxiv.org/abs/2310.01405 . Accessed 2026-05-19. ↩
Wollschläger, T., Elstner, J., Geisler, S., Cohen-Addad, V., Günnemann, S., & Gasteiger, J. (2025). "The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence." arXiv preprint arXiv:2502.17420. https://arxiv.org/abs/2502.17420 . Accessed 2026-05-19. ↩
"The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence." ICML 2025 Poster. https://icml.cc/virtual/2025/poster/46298 . Accessed 2026-05-19. ↩
Joad, F., Hawasly, M., Boughorbel, S., Durrani, N., & Sencar, H. T. (2026). "There Is More to Refusal in Large Language Models than a Single Direction." arXiv preprint arXiv:2602.02132. https://arxiv.org/abs/2602.02132 . Accessed 2026-05-19. ↩
Piras, G., Mura, R., Brau, F., Oneto, L., Roli, F., & Biggio, B. (2025). "SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models." arXiv preprint arXiv:2511.08379. AAAI 2026. https://arxiv.org/abs/2511.08379 . Accessed 2026-05-19. ↩
Wang, X., Wang, M., Liu, Y., Schütze, H., & Plank, B. (2025). "Refusal Direction is Universal Across Safety-Aligned Languages." arXiv preprint arXiv:2505.17306. NeurIPS 2025 poster. https://arxiv.org/abs/2505.17306 . Accessed 2026-07-14. ↩
Weidmann, P. E. (2025). "Heretic: Fully automatic censorship removal for language models." GitHub repository. https://github.com/p-e-w/heretic . Accessed 2026-07-14. ↩
"Heretic, a tool that makes it easy to create jailbroken versions of LLMs that are censored." GIGAZINE, 17 November 2025. https://gigazine.net/gsc_news/en/20251117-heretic/ . Accessed 2026-07-14. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Circuit Breakers (Representation Rerouting)Circuit discovery Linear Probes

Background

What is refusal behavior and safety fine-tuning?

The linear representation hypothesis

Activation steering and representation engineering

What did Arditi et al. find?

How is the refusal direction identified?

Difference-in-means

Selecting a single vector

Activation addition (induce refusal)

Directional ablation (suppress refusal)

What is weight orthogonalization?

Which models were tested, and what were the results?

Models

Refusal and safety evaluation

Comparison to other jailbreaks (HarmBench)

Capability evaluations

Mechanistic analysis of adversarial suffixes

What is abliteration?

Automated abliteration: Heretic

Is refusal really a single direction?

Concept cones and multiple directions

Does the refusal direction transfer across languages?

Other related findings

Relation to representation engineering and activation steering

What are the safety implications?

Limitations

See also

References

Improve this article

Related Articles

Activation steering

Persona vectors

Mechanistic interpretability

Goodfire AI

Representation Engineering

Patchscopes

What links here

Related Articles

Activation steering

Persona vectors

Mechanistic interpretability

Goodfire AI

Representation Engineering

Patchscopes

What links here