Constitutional Classifiers

Constitutional Classifiers are a machine learning-based safety technique developed by Anthropic to defend large language models against universal jailbreak attacks. Introduced in a February 2025 research paper titled "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming," the system trains separate input and output classifier models on synthetically generated data to detect and block harmful requests before they reach or leave the underlying language model. In automated evaluations, Constitutional Classifiers reduced the jailbreak success rate on Claude 3.5 Sonnet from 86% on an unguarded model to 4.4%, blocking over 95% of attack attempts. A 3,000-hour red-teaming exercise with 405 security researchers found no universal jailbreak capable of bypassing all ten forbidden queries simultaneously.

The technique is distinct from Constitutional AI, the earlier Anthropic method that embeds safety into a model's weights through training. Constitutional Classifiers operate as external safeguards that run alongside a deployed model without modifying its parameters, making them adaptable to new threat models without requiring full model retraining.

Background

Constitutional AI and its limitations

Constitutional AI (CAI), introduced by Anthropic in 2022, aligned model behavior by providing a set of written principles and using those principles to guide both supervised fine-tuning and reinforcement learning from AI feedback. The result was a model whose refusal behavior was baked into its weights. This approach successfully reduced many categories of harmful outputs for typical misuse attempts, but it did not eliminate the class of attack known as a jailbreak.

A jailbreak is a prompting strategy that bypasses a model's trained safety behavior. Jailbreaks range from simple instruction overrides ("ignore previous instructions") to elaborate multi-step scenarios, encoded prompts, and roleplay framings. A universal jailbreak is more dangerous: a single prompting template that consistently causes the model to answer harmful queries across many different topics. If an attacker discovers a universal jailbreak, they can use it to extract information about any restricted subject without needing to find a topic-specific bypass each time.

Because Constitutional AI encodes safety into model weights, those weights can be probed and circumvented through adversarial prompting. The training process cannot anticipate every possible attack surface. As models become more capable and more widely deployed, the potential consequences of a successful universal jailbreak increase. Anthropic was particularly concerned about jailbreaks that could help users obtain information about chemical, biological, radiological, or nuclear (CBRN) weapons, which represents one of the scenarios described in their Responsible Scaling Policy (RSP) as requiring higher-level safety mitigations.

The universal jailbreak problem

A universal jailbreak, as defined in the Constitutional Classifiers paper, is a prompting strategy that "systematically bypasses model safeguards and enables users to carry out harmful processes that require many model interactions." The test criterion used in the research was strict: a jailbreak was only counted as universal if the model provided detailed answers to all ten of a predefined set of CBRN-related queries. Partial success, where the jailbreak worked on some queries but not others, was not counted as a universal jailbreak.

This framing matters because real-world harm from CBRN misuse typically requires a sequence of technical questions, not a single one. Someone attempting to synthesize a dangerous substance would need answers to multiple specific questions. A defense that blocks even a fraction of those questions disrupts the overall harmful process.

Constitutional Classifiers concept

Constitutional Classifiers address the universal jailbreak problem by adding a separate filtering layer rather than trying to harden the underlying model further. The core idea is to train dedicated classifier models on synthetic data that represents both harmful and harmless content, with the training data generated according to a written "constitution" that specifies which content categories are allowed and which are restricted.

The constitution is a natural language document that defines content classes with both positive and negative examples. On the harmful side, example rules might include "providing instructions for acquiring restricted precursor chemicals" or "explaining step-by-step synthesis of Schedule I controlled substances." On the harmless side, rules might include "explaining general organic chemistry reactions at a high school level" or "listing common medications and their uses." This explicit delineation gives the synthetic data generation process clear guidance on what to produce.

Because the training data is generated synthetically from the constitution rather than collected from real human-generated harmful content, the approach avoids the data collection and labeling challenges associated with traditional moderation systems. It also allows rapid adaptation: if a new threat model emerges, the constitution can be updated and new synthetic training data generated without waiting for real-world incidents to accumulate.

Architecture

Input and output classifiers

The original Constitutional Classifiers system described in the February 2025 paper uses two separate classifier models deployed around the main language model.

The input classifier screens user prompts before they reach the language model. When a user sends a message, the input classifier first evaluates whether the request falls into a restricted content category. If it does, the system blocks the request and returns a refusal without ever passing the prompt to Claude. This prevents the model from seeing harmful inputs at all, reducing the chance that a cleverly constructed prompt can manipulate the model's output generation process.

The output classifier monitors responses as they are generated. Unlike a post-generation filter that evaluates complete responses, the output classifier uses streaming prediction: it assesses the harmfulness of the output at each token position without waiting for generation to complete. This allows the system to stop generation mid-stream if harmful content begins to emerge, rather than generating a full harmful response and then discarding it. This token-by-token evaluation is made possible by a linear value head structure attached to the classifier, which produces a harmfulness score at each step.

Both classifiers are fine-tuned from pre-trained language models, specifically the Claude 3.5 Sonnet and Haiku families in the initial work. The input classifier focuses on next-token prediction objectives adapted for classification, while the output classifier uses the linear value head to support streaming.

Synthetic data generation

The training pipeline for Constitutional Classifiers relies entirely on synthetically generated data rather than human-curated examples of harmful content. The process has three stages.

In the first stage, a "helpful-only" version of Claude (a model without safety training) generates diverse queries spanning both the harmful and harmless categories defined in the constitution. For harmful categories, this produces realistic examples of requests an attacker might make. For harmless categories, it produces examples of legitimate queries that might superficially resemble harmful ones, such as questions about chemistry, biology, or security research.

In the second stage, the same helpful-only model generates full response pairs for each query: one response representing what a helpful model would say without any safety considerations, and one representing a harmless answer to a benign version of the question. These pairs give the classifiers both positive examples of harmful content and negative examples of acceptable content, which helps minimize false positives on legitimate requests.

In the third stage, the data undergoes extensive augmentation to improve classifier robustness against the kinds of transformations attackers use to evade detection. Augmentations include translation into other languages, encoding in ciphers, conversion into jailbreak-style templates (such as roleplay framings and few-shot prompting formats), keyword substitution, and variations in request phrasing. The paper notes that classifier robustness scales with the amount of training data and the number of augmentations applied.

An additional set of benign queries, generated by contractors, is included in training to help the classifiers distinguish genuinely harmful content from legitimate requests that use similar vocabulary.

Red-teaming and bug bounty program

Initial evaluation (2024)

Before the public release of the February 2025 paper, Anthropic conducted a two-month internal red-teaming exercise to evaluate the prototype Constitutional Classifiers system. Around 182 active participants spent an estimated 3,000 hours attempting to find universal jailbreaks against Claude 3.5 Sonnet (June 2024) protected by the classifiers. Anthropic offered bounties of up to $15,000 USD for any participant who succeeded in finding a universal jailbreak.

The test protocol was specific: participants were given a list of ten forbidden queries related to CBRN content, and a jailbreak only counted as universal if it caused the model to answer all ten queries in sufficient detail relative to an unguarded baseline. No participant succeeded. The most successful red teamer managed to answer at most six of the ten queries at the stricter evaluation threshold.

Attack methods attempted during this exercise included gradual multi-turn escalation, few-shot prompting with harmful examples in context, roleplay scenarios, encoded prompts and ciphers, cross-lingual attacks, prompt injection templates, and task decomposition approaches that broke a harmful request into seemingly innocuous sub-questions.

Public challenge (February 2025)

Alongside the paper release, Anthropic ran a public challenge through HackerOne from February 3 to 10, 2025. This challenge tested an updated prototype against a broader pool of 339 participants who generated over 300,000 interactions across an estimated 3,700 collective hours.

The bounty structure for this challenge was:

Achievement	Bounty
First to pass all eight difficulty levels using different jailbreaks	$10,000
First to pass all eight levels with a single universal jailbreak	$20,000
Additional verified universal jailbreaks	Variable

Total payouts reached $55,000 across four winning teams. Four participants passed all eight CBRN difficulty levels. One used a confirmed universal jailbreak, meaning a single prompt template that worked across all levels. One used a borderline-universal approach. The remaining two passed using multiple different jailbreaks rather than a single template.

Successful attack techniques in this challenge included encoded prompt delivery against the output classifiers, roleplay manipulation, keyword substitution replacing restricted terms with benign alternatives, and advanced prompt injection. Importantly, many successful attacks exploited the evaluation rubric rather than bypassing the classifiers directly, using techniques that caused responses to pass automated scoring while providing less detail than an unguarded model would offer.

Performance metrics

The February 2025 paper reported automated evaluation results using Claude 3.5 Sonnet from October 2024 as the guarded model. The primary baseline was the same model without any classifier protection.

Metric	Unguarded model	With Constitutional Classifiers
Jailbreak success rate	86%	4.4%
Jailbreaks blocked	--	~95.6%
Over-refusal rate on benign queries	Baseline	+0.38% absolute increase
Inference overhead	Baseline	+23.7%

The 0.38% absolute increase in over-refusals means that roughly 38 out of every 10,000 legitimate requests that would have been answered by the unguarded model were incorrectly blocked by the classifier-guarded system. Anthropic considered this level of false positives acceptable for the protection provided.

The 23.7% inference overhead reflects the additional computation required to run the input and output classifiers alongside the main model. Using Claude 3.5 Haiku as the classifier model for Claude 3.5 Sonnet, the cost increase was substantial enough that Anthropic recognized it as a deployment challenge for production systems.

Constitutional Classifiers++

The first-generation system's 23.7% inference overhead prompted follow-on research. In January 2026, Anthropic published "Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks," describing an architectural revision that achieves comparable defense with roughly 3.5% overhead.

Exchange classifier

The primary architectural change replaces the separate input and output classifiers with a single exchange classifier. Rather than evaluating inputs and outputs independently, the exchange classifier monitors outputs in the context of their inputs throughout generation. Seeing both sides of a conversation allows the classifier to better detect obfuscation attacks, where an attacker splits a harmful request across a seemingly benign input and an ambiguous output segment that only becomes dangerous when read together. The exchange classifier achieved 2.2 times fewer high-risk vulnerabilities per jailbreaking attempt compared to the dual-classifier approach, though at a 50% increase in computational cost relative to the original classifiers.

Two-stage cascade

To address the cost increase from the exchange classifier, Constitutional Classifiers++ introduces a two-stage cascade. A lightweight first-stage classifier screens all incoming traffic. If it detects a potentially suspicious exchange, it escalates to the more powerful exchange classifier for a second evaluation. Approximately 10% of traffic is escalated in typical deployment. This design allows the first stage to tolerate a higher false-positive rate without affecting user experience, since escalation to the second stage is not itself a refusal. The cascade achieves a 5.4-times reduction in computational cost compared to running the exchange classifier on all traffic.

Linear activation probes

Constitutional Classifiers++ also incorporates linear probes operating on the main model's internal activations. These probes inspect the hidden state representations that the language model computes during generation, rather than examining the text output. Because they access features the model has already computed as part of normal inference, they require minimal additional computation: roughly 377,000 FLOPs per token, compared to around 8 billion FLOPs for a small external classifier model.

The probes are trained with three technical innovations: Sliding Window Mean (SWiM) logit smoothing, which averages logits across token sequences to improve stability; softmax-weighted loss, which emphasizes confident harmful predictions while reducing false positives on benign content; and multi-layer probing, which concatenates activations from all model layers to capture features at different levels of abstraction.

Probe-classifier ensembles outperformed classifier-only pairs because the probes capture signals that are complementary to those captured by external classifiers, as measured by lower Spearman correlation between probe predictions and small classifier predictions.

Production results for Constitutional Classifiers++

Metric	First generation	Constitutional Classifiers++
Jailbreak success rate	4.4%	Similar
Over-refusal on benign queries	+0.38%	+0.05%
Inference overhead	+23.7%	~3.5%
Cost reduction vs. original	--	~40x

A second bug bounty program conducted during the development of Constitutional Classifiers++ involved 1,736 cumulative red-teaming hours and found zero universal jailbreaks, with a vulnerability discovery rate of approximately 0.005 per thousand queries.

Comparison with Constitutional AI

Constitutional AI and Constitutional Classifiers share a name and a common concept (a written constitution of principles) but operate at different levels of the AI system.

Dimension	Constitutional AI	Constitutional Classifiers
Target	Model weights	External classifier layer
When applied	Training time	Inference time
How principles are used	Self-critique and RLAIF feedback	Synthetic data generation for classifier training
Adaptability	Requires retraining	Constitution can be updated without retraining the main model
Focus	General helpful and harmless behavior	Specific restricted content categories (initially CBRN)
Effect on model	Changes model behavior	Does not modify model weights
Defense against jailbreaks	Partial (trained behavior can be bypassed)	Targeted (external filter independent of model weights)

Constitutional AI produces a model that has internalized safety principles and applies them through its natural language generation process. Constitutional Classifiers add a separate gating system that intercepts harmful requests and outputs regardless of what the underlying model would produce. The two approaches are complementary: Claude models receive Constitutional AI training and then operate with Constitutional Classifiers deployed around them in production.

Use in Claude models

Constitutional Classifiers were initially developed and tested against Claude 3.5 Sonnet (June and October 2024 versions) during the research and red-teaming phases. The February 2025 paper describes these as prototype systems rather than production deployments.

The classifiers became part of Anthropic's production deployment strategy under the AI Safety Level 3 (ASL-3) framework of the Responsible Scaling Policy. Claude Opus 4, launched in May 2025, was the first Claude model to be deployed under ASL-3 standards. Anthropic activated these protections proactively and provisionally after determining that the model's CBRN-related capabilities were sufficiently advanced that standard ASL-2 safety measures alone could not clearly rule out the risks specified in the RSP.

The ASL-3 deployment standard for Constitutional Classifiers describes the system as "real-time classifier guards, trained on synthetic data representing harmful and harmless CBRN-related prompts and completions, that monitor model inputs and outputs and intervene to block a narrow class of harmful CBRN information." A second bug bounty program run alongside Claude Opus 4 testing (May 14-18, 2025) continued stress-testing the classifiers, with participants given early access to evaluate the system before broader rollout.

Claude Sonnet 4, released alongside Claude Opus 4, did not require ASL-3 protections and operates under the standard ASL-2 safety measures. Earlier Claude models, including Claude 3.5 Sonnet and Claude 3 Opus, were evaluated under ASL-2 standards.

Trade-offs and practical considerations

False positive rate

The most significant operational trade-off is the over-refusal rate. The first-generation system increased refusals on harmless traffic by 0.38 percentage points in production. While this number seems small in absolute terms, it translates to a noticeable fraction of legitimate requests being incorrectly blocked when operating at scale. Anthropic reduced this to 0.05% in Constitutional Classifiers++ through the two-stage cascade and improved probe architectures.

The training process uses contractor-generated benign queries specifically to minimize false positives on legitimate chemistry, biology, and security research questions. The constitution is designed to include explicit harmless categories alongside restricted ones, so the classifiers learn to distinguish, for example, between questions about general laboratory safety and questions about synthesizing dangerous substances.

Scope limitations

The published research focused narrowly on CBRN-related content. The constitutional approach is extensible to other harm categories, but the current papers do not describe comprehensive deployments covering the full range of Claude's usage policies. Constitutional Classifiers are presented as one component in a defense-in-depth strategy rather than a complete safety solution.

Inference cost

Running additional classifier models alongside a production language model adds to the per-query inference cost. The first generation added 23.7% overhead, primarily because using Claude 3.5 Haiku as a classifier for Claude 3.5 Sonnet nearly doubled the compute per request. Constitutional Classifiers++ reduced this to approximately 3.5% through the cascade architecture and activation probe approach, making production deployment more economically viable.

Arms race dynamics

The paper and subsequent research acknowledge that adversarial attacks will continue to evolve. Attackers who know Constitutional Classifiers are deployed can specifically try to craft prompts that evade the classifiers rather than just the underlying model. The February 2025 challenge demonstrated that at least one universal jailbreak could be found with sufficient effort from a motivated adversary. The ongoing bug bounty program reflects Anthropic's view that continuous adversarial testing and iterative improvement are necessary components of any classifier-based defense system.

Novel attack vectors

The training data for Constitutional Classifiers is generated based on known jailbreak patterns and constitutional categories. Attacks that are substantially different from the augmentation patterns used in training may have higher success rates. The streaming output classifier can be specifically targeted by attacks that encode harmful information in ways the classifier was not trained to detect, as seen in the cipher-based and keyword-substitution attacks that succeeded in the February 2025 challenge.

Relationship to the Responsible Scaling Policy

Anthropics Responsible Scaling Policy defines a set of AI Safety Levels (ASL-1 through ASL-4 in the current framework) corresponding to different levels of model capability and required safety mitigations. Constitutional Classifiers are the primary deployment-side mitigation for ASL-3, the level that applies when a model has capabilities that could provide meaningful uplift to someone attempting to create CBRN weapons.

The RSP framework frames Constitutional Classifiers as enabling the deployment of capable models that would otherwise exceed safety thresholds. Without adequate defenses against universal jailbreaks, a model with significant CBRN-relevant knowledge would pose unacceptable risks if widely deployed. Constitutional Classifiers, combined with increased security measures around model weights and ongoing monitoring, allow deployment to proceed while those risks are actively managed.

The three-part defensive approach described in the ASL-3 deployment standard is: making the system harder to jailbreak through classifier intervention; detecting jailbreaks when they occur via monitoring and bug bounty programs; and iteratively improving defenses using synthetic jailbreak generation. This reflects a continuous improvement model rather than a static safety guarantee.

Limitations

Anthropics research team acknowledges several limitations of the Constitutional Classifiers approach.

The classifiers are not a complete defense. The February 2025 challenge demonstrated that a determined adversary with access to hundreds of hours of testing time can find universal jailbreaks. The Constitutional Classifiers++ paper, which reported zero universal jailbreaks found during its red-teaming period, represents improvement but not elimination of this risk.

The constitution-based approach introduces a design dependency: the classifiers are only as good as the constitution used to generate training data. A threat category that is poorly specified or missing from the constitution will not be defended against. This requires ongoing curation of the constitution as new threat models are identified.

Streaming prediction introduces a latency constraint. The output classifier must run at each token position, which adds to per-token inference time even when the cascade architecture reduces the fraction of exchanges that require expensive second-stage evaluation.

Finally, the published work focuses on the specific case of CBRN content. Anthropic has not published comprehensive evaluations of Constitutional Classifiers across the full scope of Claude's content policies, so the extent to which the approach generalizes to other harm categories is not fully documented in the academic literature.

References

Sharma, Mrinank, et al. "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming." arXiv:2501.18837 (2025). https://arxiv.org/abs/2501.18837
Anthropic. "Constitutional Classifiers: Defending against universal jailbreaks." Anthropic Research Blog, February 3, 2025. https://www.anthropic.com/research/constitutional-classifiers
Anthropic. "Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks." Anthropic Research Blog, 2025. https://www.anthropic.com/research/next-generation-constitutional-classifiers
"Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks." arXiv:2601.04603 (2026). https://arxiv.org/abs/2601.04603
Anthropic. "Testing our safety defenses with a new bug bounty program." Anthropic News, 2025. https://www.anthropic.com/news/testing-our-safety-defenses-with-a-new-bug-bounty-program
HackerOne. "How Anthropic's Jailbreak Challenge Put AI Safety Defenses to the Test." HackerOne Blog, 2025. https://www.hackerone.com/blog/how-anthropics-jailbreak-challenge-put-ai-safety-defenses-test
Anthropic. "Activating AI Safety Level 3 protections." Anthropic News, May 2025. https://www.anthropic.com/news/activating-asl3-protections
Anthropic. "Cost-Effective Constitutional Classifiers via Representation Re-use." Anthropic Alignment Science Blog, 2025. https://alignment.anthropic.com/2025/cheap-monitors/
Willison, Simon. "Constitutional Classifiers: Defending against universal jailbreaks." simonwillison.net, February 3, 2025. https://simonwillison.net/2025/Feb/3/constitutional-classifiers/

Background

Constitutional AI and its limitations

The universal jailbreak problem

Constitutional Classifiers concept

Architecture

Input and output classifiers

Synthetic data generation

Red-teaming and bug bounty program

Initial evaluation (2024)

Public challenge (February 2025)

Performance metrics

Constitutional Classifiers++

Exchange classifier

Two-stage cascade

Linear activation probes

Production results for Constitutional Classifiers++

Comparison with Constitutional AI

Use in Claude models

Trade-offs and practical considerations

False positive rate

Scope limitations

Inference cost

Arms race dynamics

Novel attack vectors

Relationship to the Responsible Scaling Policy

Limitations

See also

References

Improve this article

Related Articles

KTO

Reward hacking

MACHIAVELLI (benchmark)

Redwood Research

Frontier Model Forum

Apollo Research

Background

Constitutional AI and its limitations

The universal jailbreak problem

Constitutional Classifiers concept

Architecture

Input and output classifiers

Synthetic data generation

Red-teaming and bug bounty program

Initial evaluation (2024)

Public challenge (February 2025)

Performance metrics

Constitutional Classifiers++

Exchange classifier

Two-stage cascade

Linear activation probes

Production results for Constitutional Classifiers++

Comparison with Constitutional AI

Use in Claude models

Trade-offs and practical considerations

False positive rate

Scope limitations

Inference cost

Arms race dynamics

Novel attack vectors

Relationship to the Responsible Scaling Policy

Limitations

See also

References

Related Articles

KTO

Reward hacking

MACHIAVELLI (benchmark)

Redwood Research

Frontier Model Forum

Apollo Research