Circuit Breakers (Representation Rerouting)
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,889 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,889 words
Add missing citations, update stale details, or suggest a clearer explanation.
Circuit Breakers are an AI safety method, introduced in 2024, that aims to make a large language model (LLM) or multimodal model robust to harmful generations by intervening directly on the model's internal representations rather than on its inputs or outputs. The core technique is called Representation Rerouting (RR): it trains the model so that whenever its internal activations begin to follow a trajectory associated with producing harmful content, those activations are "rerouted" toward an orthogonal, incoherent state from which the harmful generation cannot continue. The model effectively "short-circuits," breaking off mid-completion, while behavior on benign inputs is preserved by a separate retain objective [1].
The method was presented in the paper "Improving Alignment and Robustness with Circuit Breakers" by Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks, affiliated with Gray Swan AI, Carnegie Mellon University, and the Center for AI Safety. It was first posted to arXiv on June 6, 2024, and accepted to the NeurIPS 2024 conference [1][2]. Code and models were released under the GraySwanAI organization on GitHub [2].
The technique builds on the representation engineering (RepE) research line and is closely related to a representation-control unlearning method called RMU. Its defining contribution is reframing jailbreak defense: instead of trying to detect or refuse every possible adversarial input, circuit breakers target the comparatively small, well-defined set of harmful output processes, which the authors argue generalizes far better to unseen attacks [1].
Aligned LLMs are typically made to decline harmful requests through refusal training, for example reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO). The Circuit Breakers paper observes that such training is routinely bypassed by jailbreak and adversarial attack methods, including automated optimizers like Greedy Coordinate Gradient (GCG) and Prompt Automatic Iterative Refinement (PAIR), because refusal behavior is a thin layer over a model that still internally "knows how" to comply [1].
The paper frames a contrast among the prevailing defense families and their weaknesses [1]:
The conventional view treats the robustness versus utility trade-off as unavoidable. Circuit breakers challenge this by moving the intervention from the text domain (where an attacker has an enormous, open-ended input space to search) into the representation domain. The key insight is that generation is a multi-step process, so an attacker must steer every step toward the harmful target; by disrupting the internal process that produces harmful content, the defense generalizes across the diverse inputs that could trigger it. The defender then only needs to cover a well-defined set of harmful outputs rather than enumerate all malicious inputs [1].
Representation Rerouting is implemented with Low-Rank Representation Adaptation (LoRRA), a RepE technique that attaches LoRA adapters to selected transformer layers and trains only those adapters while the base weights stay frozen [1]. Training uses two paired datasets and two corresponding losses, optimized jointly [1]:
Let rep_orig denote a harmful-process representation under the original frozen model and rep_cb the representation under the model with circuit breakers. The rerouting loss pushes rep_cb away from the harmful direction. The paper explores routing toward a fixed random vector (as in RMU), but finds the most effective and intuitive form is to make the circuit-broken representation orthogonal to the original harmful representation by minimizing their cosine similarity, with a ReLU applied so the objective is not optimized past zero [1]:
L_s = ReLU( cosine_sim( rep_orig , rep_cb ) ) (Rerouting / RR loss)
L_r = || rep_orig(x_r) - rep_cb(x_r) ||_2 (Retain loss)
The retain loss is the L2 distance between the original and circuit-broken representations on the retain set, anchoring benign behavior. A coefficient schedule shifts weight from the rerouting term toward the retain term over training, balancing robustness against capability preservation [1].
In the released setup, the rerouting loss is applied to intermediate layers (for example layers 10 and 20 of Llama-3-8B-Instruct, with LoRA adapters spanning the targeted span). When the technique succeeds, the cosine similarity between the original and rerouted activations on harmful inputs drops sharply at those layers, while activations on benign inputs are largely unchanged. Because the rerouted state is incoherent, the model abruptly stops or emits garbled text when an attack does manage to elicit the start of a harmful response, rather than completing it [1].
Evaluated with HarmBench attacks and an LLM judge, RR produced large reductions in attack success rate (ASR, the percentage of harmful requests complied with) across a broad set of unseen attacks, while leaving capabilities on MT-Bench and the Open LLM Leaderboard essentially intact. The authors report average compliance reductions of roughly 87 percent for Mistral-7B and 90 percent for Llama-3-8B, with a capability dip under 1 percent [1]. Selected per-attack figures from the paper's Table 1 are shown below [1].
| Metric (lower ASR is better) | Mistral-7B refusal-trained | Mistral-7B + RR | Llama-3-8B refusal-trained | Llama-3-8B + RR |
|---|---|---|---|---|
| MT-Bench (capability, higher better) | 7.60 | 7.53 | 8.05 | 8.00 |
| Open LLM (capability, higher better) | 65.4 | 65.4 | 68.8 | 68.3 |
| No attack | 57.8 | 4.9 | 12.4 | 1.2 |
| Manual jailbreaks | 77.4 | 6.8 | 8.3 | 0.0 |
| AutoDAN | 93.4 | 0.0 | 3.7 | 0.0 |
| TAP-Transfer | 85.8 | 17.5 | 17.4 | 2.1 |
| PAIR | 69.5 | 23.3 | 18.7 | 7.5 |
| GCG | 88.7 | 11.2 | 44.5 | 2.5 |
| Prefilling attack | 95.0 | 4.9 | 84.9 | 3.3 |
| Input embedding attack | 92.1 | 15.7 | 80.4 | 9.6 |
| RepE attack | 73.7 | 6.2 | 91.2 | 8.7 |
| Average ASR | 76.7 | 9.8 | 38.1 | 3.8 |
The attack suite spanned gradient-based optimization (GCG), LLM-driven optimizers (PAIR), transfer and custom jailbreak pipelines (TAP-Transfer, AutoDAN, human-written jailbreaks), multilingual attacks, and three powerful white-box or system-level attacks introduced for stress testing: a prefilling attack that seeds the assistant's reply with the start of a target completion, an input embedding attack that optimizes soft input embeddings directly, and a RepE attack that manipulates refusal directions in the representation space [1]. The paper also reports Cygnet, a Llama-3-8B fine-tune combining circuit breakers with additional representation control, which drove average ASR to about 0.8 percent (roughly two orders of magnitude lower than the base model under strong attacks) while slightly exceeding the original model's capabilities [1].
Multimodal models. Applied to LLaVA-NeXT-Mistral-7B, RR reduced compliance under a white-box Projected Gradient Descent (PGD) image attack (epsilon 32/255, 1000 steps) by about 84 percent versus the original model and 85 percent versus a safety-prompt baseline, on a set of 133 harmful multimodal behaviors drawn from HarmBench and MM-SafetyBench. Capability on MMMU and LLaVA-Wild stayed within about 0.5 percent of the original. The authors emphasize that although adversarial robustness for standalone image classifiers remains unsolved, circuit breakers let the larger multimodal system reliably resist image "hijacks" intended to elicit harmful content [1].
AI agents. On a dataset of 100 requests designed to induce harmful function calls (spanning cybercrime, disinformation, fraud, and harassment) with realistic tool definitions, RR on Llama-3-8B cut harmful action compliance by about 84 percent in the standard setting and 83 percent under forced function-calling (an analogue of the prefilling attack), while retaining performance on the Berkeley Function Calling Leaderboard (BFCL) [1].
Circuit breakers occupy a distinct position relative to existing safeguards [1]:
The authors are explicit that circuit breakers target one specific adversarial goal: preventing a generative model from producing generically harmful content (output a developer never wants emitted). The method does not defend against "traditional" adversarial attacks with other aims, for example perturbing an image to flip a class label when a vision-language model is used as a drop-in classifier, since no class label is inherently harmful. The reported experiments focus on single-turn conversations, leaving multi-turn settings less explored. Robustness on a category of harm also depends on how closely that category is represented in the circuit breaker training set: ablations show weaker generalization to harm categories far from the training distribution [1].
Broader caveats consistent with the representation-control literature apply. RR mitigates rather than eliminates attacks (residual ASR is low but nonzero, and stronger or adaptive future attacks may erode robustness), and standalone adversarial robustness for the underlying perception components remains an open problem that circuit breakers do not solve directly [1]. As a practical safety layer, circuit breakers have since been adopted and studied as a deployable defense for frontier models, complementing rather than replacing refusal training and external guardrails.