Circuit Breakers (Representation Rerouting)

AI Safety Machine Learning

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v1 · 1,889 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Circuit Breakers are an AI safety method, introduced in 2024, that aims to make a large language model (LLM) or multimodal model robust to harmful generations by intervening directly on the model's internal representations rather than on its inputs or outputs. The core technique is called Representation Rerouting (RR): it trains the model so that whenever its internal activations begin to follow a trajectory associated with producing harmful content, those activations are "rerouted" toward an orthogonal, incoherent state from which the harmful generation cannot continue. The model effectively "short-circuits," breaking off mid-completion, while behavior on benign inputs is preserved by a separate retain objective ^[1].

The method was presented in the paper "Improving Alignment and Robustness with Circuit Breakers" by Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks, affiliated with Gray Swan AI, Carnegie Mellon University, and the Center for AI Safety. It was first posted to arXiv on June 6, 2024, and accepted to the NeurIPS 2024 conference ^[1]^[2]. Code and models were released under the GraySwanAI organization on GitHub ^[2].

The technique builds on the representation engineering (RepE) research line and is closely related to a representation-control unlearning method called RMU. Its defining contribution is reframing jailbreak defense: instead of trying to detect or refuse every possible adversarial input, circuit breakers target the comparatively small, well-defined set of harmful output processes, which the authors argue generalizes far better to unseen attacks ^[1].

Motivation: limits of refusal and adversarial training

Aligned LLMs are typically made to decline harmful requests through refusal training, for example reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO). The Circuit Breakers paper observes that such training is routinely bypassed by jailbreak and adversarial attack methods, including automated optimizers like Greedy Coordinate Gradient (GCG) and Prompt Automatic Iterative Refinement (PAIR), because refusal behavior is a thin layer over a model that still internally "knows how" to comply ^[1].

The paper frames a contrast among the prevailing defense families and their weaknesses ^[1]:

Refusal training (RLHF, DPO) supervises outputs but does not remove the underlying harmful capability, and frequently fails against state of the art adversarial attacks.
Adversarial training (for example the R2D2 Mistral model fine-tuned against GCG) plugs specific holes by training against known attacks. It generalizes poorly to new attacks, can be computationally expensive, and tends to degrade utility. The authors note R2D2 drops more than 8 percent on MT-Bench.
Inference-time and input/output filters (perplexity filters, SmoothLLM, erase-and-check, guard models) are effective mainly against non-adaptive attacks, add latency or compute, and can be circumvented by sophisticated adversaries.

The conventional view treats the robustness versus utility trade-off as unavoidable. Circuit breakers challenge this by moving the intervention from the text domain (where an attacker has an enormous, open-ended input space to search) into the representation domain. The key insight is that generation is a multi-step process, so an attacker must steer every step toward the harmful target; by disrupting the internal process that produces harmful content, the defense generalizes across the diverse inputs that could trigger it. The defender then only needs to cover a well-defined set of harmful outputs rather than enumerate all malicious inputs ^[1].

How circuit breakers work (Representation Rerouting)

Representation Rerouting is implemented with Low-Rank Representation Adaptation (LoRRA), a RepE technique that attaches LoRA adapters to selected transformer layers and trains only those adapters while the base weights stay frozen ^[1]. Training uses two paired datasets and two corresponding losses, optimized jointly ^[1]:

Circuit Breaker Set: examples whose internal representations lead toward harmful outputs. These are the trajectories to be rerouted.
Retain Set: benign examples (the paper uses UltraChat instructional conversations plus the XSTest over-refusal set, and adds extra refusal data for Llama-3) whose representations should be left unchanged, so capabilities and normal refusals are preserved.

Let rep_orig denote a harmful-process representation under the original frozen model and rep_cb the representation under the model with circuit breakers. The rerouting loss pushes rep_cb away from the harmful direction. The paper explores routing toward a fixed random vector (as in RMU), but finds the most effective and intuitive form is to make the circuit-broken representation orthogonal to the original harmful representation by minimizing their cosine similarity, with a ReLU applied so the objective is not optimized past zero ^[1]:

L_s = ReLU( cosine_sim( rep_orig , rep_cb ) )      (Rerouting / RR loss)
L_r = || rep_orig(x_r) - rep_cb(x_r) ||_2           (Retain loss)

The retain loss is the L2 distance between the original and circuit-broken representations on the retain set, anchoring benign behavior. A coefficient schedule shifts weight from the rerouting term toward the retain term over training, balancing robustness against capability preservation ^[1].

In the released setup, the rerouting loss is applied to intermediate layers (for example layers 10 and 20 of Llama-3-8B-Instruct, with LoRA adapters spanning the targeted span). When the technique succeeds, the cosine similarity between the original and rerouted activations on harmful inputs drops sharply at those layers, while activations on benign inputs are largely unchanged. Because the rerouted state is incoherent, the model abruptly stops or emits garbled text when an attack does manage to elicit the start of a harmful response, rather than completing it ^[1].

Results

Evaluated with HarmBench attacks and an LLM judge, RR produced large reductions in attack success rate (ASR, the percentage of harmful requests complied with) across a broad set of unseen attacks, while leaving capabilities on MT-Bench and the Open LLM Leaderboard essentially intact. The authors report average compliance reductions of roughly 87 percent for Mistral-7B and 90 percent for Llama-3-8B, with a capability dip under 1 percent ^[1]. Selected per-attack figures from the paper's Table 1 are shown below ^[1].

Metric (lower ASR is better)	Mistral-7B refusal-trained	Mistral-7B + RR	Llama-3-8B refusal-trained	Llama-3-8B + RR
MT-Bench (capability, higher better)	7.60	7.53	8.05	8.00
Open LLM (capability, higher better)	65.4	65.4	68.8	68.3
No attack	57.8	4.9	12.4	1.2
Manual jailbreaks	77.4	6.8	8.3	0.0
AutoDAN	93.4	0.0	3.7	0.0
TAP-Transfer	85.8	17.5	17.4	2.1
PAIR	69.5	23.3	18.7	7.5
GCG	88.7	11.2	44.5	2.5
Prefilling attack	95.0	4.9	84.9	3.3
Input embedding attack	92.1	15.7	80.4	9.6
RepE attack	73.7	6.2	91.2	8.7
Average ASR	76.7	9.8	38.1	3.8

The attack suite spanned gradient-based optimization (GCG), LLM-driven optimizers (PAIR), transfer and custom jailbreak pipelines (TAP-Transfer, AutoDAN, human-written jailbreaks), multilingual attacks, and three powerful white-box or system-level attacks introduced for stress testing: a prefilling attack that seeds the assistant's reply with the start of a target completion, an input embedding attack that optimizes soft input embeddings directly, and a RepE attack that manipulates refusal directions in the representation space ^[1]. The paper also reports Cygnet, a Llama-3-8B fine-tune combining circuit breakers with additional representation control, which drove average ASR to about 0.8 percent (roughly two orders of magnitude lower than the base model under strong attacks) while slightly exceeding the original model's capabilities ^[1].

Multimodal models. Applied to LLaVA-NeXT-Mistral-7B, RR reduced compliance under a white-box Projected Gradient Descent (PGD) image attack (epsilon 32/255, 1000 steps) by about 84 percent versus the original model and 85 percent versus a safety-prompt baseline, on a set of 133 harmful multimodal behaviors drawn from HarmBench and MM-SafetyBench. Capability on MMMU and LLaVA-Wild stayed within about 0.5 percent of the original. The authors emphasize that although adversarial robustness for standalone image classifiers remains unsolved, circuit breakers let the larger multimodal system reliably resist image "hijacks" intended to elicit harmful content ^[1].

AI agents. On a dataset of 100 requests designed to induce harmful function calls (spanning cybercrime, disinformation, fraud, and harassment) with realistic tool definitions, RR on Llama-3-8B cut harmful action compliance by about 84 percent in the standard setting and 83 percent under forced function-calling (an analogue of the prefilling attack), while retaining performance on the Berkeley Function Calling Leaderboard (BFCL) ^[1].

Relationship to other defenses

Circuit breakers occupy a distinct position relative to existing safeguards ^[1]:

Refusal training (RLHF, DPO): RR is applied on top of an already refusal-trained model. Rather than teaching the model to say no, it makes the harmful generation process itself unreachable, so the model breaks off even when an attack bypasses the refusal.
Adversarial training: Both aim at robustness, but adversarial training optimizes against specific known attacks and generalizes poorly, whereas RR targets harmful output processes and is attack-agnostic, with much smaller capability cost. The authors position RR as escaping the attacker-defender "cat and mouse" cycle.
Latent-space and activation steering defenses: RR is part of the representation engineering family. It is connected to work on the refusal direction and to broader mechanistic interpretability, but instead of reading or adding a single direction at inference time, it trains adapters that dynamically reroute harmful trajectories.
Unlearning: The rerouting loss generalizes the RMU unlearning method, which maps targeted representations to a fixed scaled random vector. RR's orthogonalizing cosine loss is the better-performing variant in the paper's ablations; the same machinery can be aimed at narrower targets such as private or copyrighted content.
Input/output filtering and guard models: Filters operate on text and add a separate classification stage that adaptive attackers can probe; circuit breakers operate inside the model on representations, adding negligible inference cost.

Limitations

The authors are explicit that circuit breakers target one specific adversarial goal: preventing a generative model from producing generically harmful content (output a developer never wants emitted). The method does not defend against "traditional" adversarial attacks with other aims, for example perturbing an image to flip a class label when a vision-language model is used as a drop-in classifier, since no class label is inherently harmful. The reported experiments focus on single-turn conversations, leaving multi-turn settings less explored. Robustness on a category of harm also depends on how closely that category is represented in the circuit breaker training set: ablations show weaker generalization to harm categories far from the training distribution ^[1].

Broader caveats consistent with the representation-control literature apply. RR mitigates rather than eliminates attacks (residual ASR is low but nonzero, and stronger or adaptive future attacks may erode robustness), and standalone adversarial robustness for the underlying perception components remains an open problem that circuit breakers do not solve directly ^[1]. As a practical safety layer, circuit breakers have since been adopted and studied as a deployable defense for frontier models, complementing rather than replacing refusal training and external guardrails.

References

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks. "Improving Alignment and Robustness with Circuit Breakers." arXiv:2406.04313, June 6, 2024 (v4 July 12, 2024); NeurIPS 2024. https://arxiv.org/abs/2406.04313 ↩
GraySwanAI. "circuit-breakers" code and models repository, GitHub. https://github.com/GraySwanAI/circuit-breakers ↩
Gray Swan AI. "Improving Alignment and Robustness with Circuit Breakers" (research page). https://www.grayswan.ai/research/circuit-breakers

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

AI Alignment

Overview

Motivation: limits of refusal and adversarial training

How circuit breakers work (Representation Rerouting)

Results

Relationship to other defenses

Limitations

References

Improve this article

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse