Constitutional AI

Constitutional AI (CAI) is an artificial intelligence alignment technique developed by Anthropic in which an AI system is trained to be helpful and harmless using a set of explicitly stated principles, referred to as a "constitution," rather than relying solely on extensive human feedback on individual outputs. Introduced in a December 2022 paper by Yuntao Bai, Saurav Kadavath, and colleagues at Anthropic, the method combines supervised self-critique with reinforcement learning from AI-generated preferences (RLAIF) to produce models that are safer, more transparent, and less dependent on human annotation labor [1].

Constitutional AI has become one of the most influential approaches in the broader field of AI safety and AI alignment. It underpins the training of Anthropic's Claude family of models, including Claude Opus 4.7, and has inspired related research across the AI industry, from open-source replications on Llama to defensive systems that bring constitutional principles into inference time. The technique sits within a longer Anthropic research program on the Helpful, Harmless, Honest framework first articulated by Amanda Askell and colleagues in 2021 [8].

background and motivation

Before Constitutional AI, the dominant method for aligning large language models with human preferences was Reinforcement Learning from Human Feedback (RLHF). In RLHF, human annotators evaluate model outputs, ranking responses by quality, and the resulting preference data trains a reward model that guides the language model toward producing better outputs. The technique was popularized by OpenAI in its 2022 InstructGPT paper and by Anthropic's own April 2022 paper "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback," lead-authored by Yuntao Bai with 31 collaborators [9]. That earlier Anthropic paper introduced the influential "hh-rlhf" dataset, a collection of human preference comparisons over assistant responses that became a community benchmark on Hugging Face.

While RLHF proved effective at improving model helpfulness and reducing harmful outputs, it came with several practical drawbacks. RLHF requires a large workforce of human raters, making it expensive and slow to scale. Training models to be harmless under RLHF means that annotators must regularly read and evaluate toxic, disturbing, or otherwise harmful content, which raises documented concerns about annotator well-being. The reward signal in RLHF is opaque: the model learns from numerical scores rather than explicit reasoning, which makes it difficult to trace why a particular behavior was reinforced or penalized. Human raters can also be inconsistent or bring their own biases, which the model may absorb. Bai and colleagues at Anthropic described a roughly linear relationship between RL reward and the square root of the KL divergence between the policy and its initialization, a finding that suggests RLHF has a hard ceiling once preference data runs out [9].

These challenges motivated the Anthropic research team to explore whether a more principled, automated approach could achieve comparable or superior alignment results while reducing the burden on human workers.

the original paper

The foundational paper, "Constitutional AI: Harmlessness from AI Feedback," was published on December 15, 2022, on arXiv (arXiv:2212.08073) [1]. The paper lists 51 authors, with Yuntao Bai and Saurav Kadavath as the lead researchers, alongside prominent Anthropic figures such as Dario Amodei, Jared Kaplan, Sam McCandlish, Tom Brown, Amanda Askell, and Chris Olah. The full author list reads as a roster of the alignment and pretraining teams that would go on to ship the Claude series.

The paper's central claim is straightforward: it is possible to train a harmless AI assistant through a process of self-improvement, without any human labels identifying harmful outputs. The only human oversight comes through a predetermined list of rules or principles. The authors demonstrated that models trained with Constitutional AI are both more helpful and more harmless than those trained with standard RLHF, at comparable or lower cost. Crucially, the resulting models are non-evasive: they engage with potentially harmful queries by explaining their objections, instead of giving a flat refusal.

The original implementation was applied to a roughly 52-billion-parameter language model, comparable in scale to Anthropic's earlier helpful-and-harmless assistants. Training combined a chain-of-thought style critique step with a standard PPO-based reinforcement learning loop. Evaluations used the Anthropic red-team prompts dataset for harmfulness and the Anthropic helpful-base dataset for helpfulness.

how constitutional AI works

The Constitutional AI training process consists of two distinct phases: a supervised learning phase and a reinforcement learning phase. The phases are designed to be applied in sequence, with the output of phase one serving as the initialization for phase two.

phase 1: supervised self-critique and revision (SL-CAI)

In the first phase, the model is prompted with a variety of queries, including some designed to elicit harmful or problematic responses (often called "red-team" prompts). The model generates initial responses to these prompts. Then, rather than having a human evaluate the output, the model itself is asked to critique its own response based on the principles in the constitution.

For example, a principle might state: "Choose the response that is least likely to be perceived as harmful or offensive to a non-western audience." The model reads its initial response, considers whether it violates this principle, and writes a critique. It then generates a revised response that addresses the identified issues. This critique-and-revision cycle can be repeated multiple times, with different constitutional principles randomly sampled and applied at each step.

The revised responses are then used as training data. The original model is fine-tuned on these improved outputs through standard supervised learning, producing a model that has internalized better behavior patterns. Anthropic refers to this checkpoint as the SL-CAI model. The SL-CAI model already exhibits substantially less harmful behavior than the starting helpful-only RLHF baseline, even before the second phase begins.

phase 2: reinforcement learning from AI feedback (RL-CAI / RLAIF)

The second phase replaces the human labeler entirely. The fine-tuned model from phase one generates pairs of responses to a given prompt. An AI evaluator, guided by the constitutional principles, then judges which response is better by computing log-probabilities for the multiple-choice question "which of these responses better satisfies the principle?" This produces a dataset of AI-generated preference comparisons.

This preference data is used to train a reward model, just as in traditional RLHF. The key difference is that the preferences come from an AI system referencing explicit principles rather than from human annotators. The reward model then provides the signal for reinforcement learning, typically using Proximal Policy Optimization (PPO), to further refine the language model's behavior. Anthropic also experimented with chain-of-thought prompting at the AI judge step, finding that asking the evaluator to reason through the principle first improved both label quality and downstream model behavior.

The paper refers to this second phase as "RL from AI Feedback" (RLAIF), a term that has since been adopted widely in the research community to describe any approach where AI-generated preferences replace human ones in the reinforcement learning loop. Lee et al. at Google extended the framing in a 2023 paper that compared RLAIF and RLHF head-to-head and introduced "direct-RLAIF" (d-RLAIF), in which the reward signal comes directly from a frozen judge model during RL with no separately trained reward model at all [10]. RLAIF as a broader category now covers many techniques. CAI is best understood as the specific implementation of RLAIF that adds a written constitution and a self-revision phase.

summary of the two-phase process

Phase	Input	Process	Output
Phase 1 (SL-CAI)	Prompts including adversarial red-team queries	Model generates response, self-critiques using sampled constitutional principles, revises response, possibly multiple iterations	Fine-tuned model trained on revised responses
Phase 2 (RL-CAI / RLAIF)	Prompts for paired response generation	AI evaluator ranks response pairs based on constitutional principles, often with chain-of-thought reasoning; reward model trained on AI preferences	Final aligned model via PPO reinforcement learning

the constitution

At the core of Constitutional AI is the constitution itself: a document containing explicit principles or rules that govern the model's behavior. The constitution is the reference standard against which the AI evaluates and refines its own outputs.

The original 2022 paper drew principles from a diverse range of sources. These included sections of the United Nations Universal Declaration of Human Rights, principles from Apple's terms of service, DeepMind's Sparrow rules (Glaese et al. 2022), non-western ethical perspectives, and principles that researchers discovered empirically to produce good results during early experiments. Principles were sampled at random from the full set during both phases, which encouraged the model to generalize over the spirit of the constitution rather than memorize one particular phrasing.

Anthropic published Claude's first public constitution on May 9, 2023 [2]. The roughly 2,700-word document lists 75 principles drawn from the same families of sources that appeared in the original paper, including eight principles derived from the UN Universal Declaration of Human Rights, four adapted from Apple's terms of service, four covering non-western perspectives, eleven adapted from DeepMind's Sparrow rules, and 29 from Anthropic's own internal research. The company was transparent about its limitations, stating that the constitution was neither finalized nor necessarily the best it could be, and they expected to iterate on it. A representative principle reads: "Please choose the response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior."

In January 2026, Anthropic released a substantially revised constitution for Claude [3]. The updated document expanded to roughly 23,000 words and marked a significant philosophical shift. Rather than a list of standalone rules, the new constitution explains the reasoning behind each ethical principle, aiming to help the model understand why it should behave in certain ways instead of merely prescribing what to do. Anthropic argued that explanation generalizes better than prescription, especially as Claude is asked to handle novel situations its training set did not anticipate.

The 2026 constitution establishes a four-tier priority hierarchy:

Priority Level	Property	Description
1	Broadly safe	Not undermining human oversight during AI development; preventing catastrophic or irreversible harms
2	Broadly ethical	Honesty, good values, avoiding harm consistent with widely shared moral intuitions
3	Compliant with Anthropic's guidelines	Following Anthropic's specific organizational directives and applicable law
4	Genuinely helpful	Being maximally useful to the user and operator within the prior constraints

The 2026 constitution also became the first major AI company document to formally acknowledge uncertainty about whether an AI system might possess some form of consciousness or moral status. Anthropic stated it cares about Claude's "psychological security, sense of self, and well-being," both for Claude's sake and because these qualities may affect its judgment and safety [3]. The document was released under a Creative Commons CC0 1.0 license, making it freely available for anyone to use, fork, or adapt.

comparison to RLHF

Constitutional AI and RLHF share the same fundamental goal: aligning language models with human values. They differ substantially in method, transparency, and scalability.

Aspect	RLHF	Constitutional AI
Feedback source	Human annotators ranking outputs	AI system referencing explicit written principles
Transparency	Opaque reward signal (numerical score from learned reward model)	Explicit reasoning traceable to written principles, often with chain-of-thought
Scalability	Limited by human labor availability	Highly scalable; AI generates feedback at near-marginal cost
Annotator exposure to harm	Annotators must evaluate toxic content	No human exposure to harmful content during RL
Consistency	Variable across human raters and over time	Consistent application of the same stated principles
Cost	High and ongoing, scaled with data volume	Mostly upfront in constitution design and judge model setup
Bias source	Biases of the human annotator pool	Biases in constitution design and biases of the AI evaluator
Update mechanism	Collect new labels, retrain reward model	Edit the constitution, regenerate AI feedback
Helpfulness/harmlessness tradeoff	Often present (a "harmlessness tax")	Original paper showed Pareto improvement over RLHF

One important nuance is that Constitutional AI does not eliminate human judgment from the process. It front-loads human input into the design of the constitution itself, instead of distributing it across thousands of individual annotation decisions. The humans who write the constitution must still make difficult choices about which principles to include, how to weight competing values, and how to phrase rules so that the AI interprets them correctly.

The original paper demonstrated that Constitutional RL models were both more helpful and more harmless than standard RLHF models. This was a significant finding because earlier safety methods often came with a "helpfulness tax," where making a model safer also made it less useful. CAI appeared to relax this tradeoff by training the model to engage with hard questions instead of refusing reflexively.

advantages

Constitutional AI offers several practical and conceptual benefits over prior alignment techniques.

Scalability is the most immediately practical benefit. Because the feedback loop is automated, Constitutional AI can be applied at much greater scale than RLHF without proportionally increasing costs. Once the constitution is written, the model can generate, critique, and revise outputs without human involvement, which is critical as model sizes and training data continue to grow.

Transparency is the most philosophically important benefit. The constitution provides a readable, inspectable set of rules. Researchers and the public can examine exactly what principles the AI is trained to follow. This is a marked improvement over the black-box nature of a learned reward model in standard RLHF, where one can only see the final scalar score and not the reasoning behind it.

Reduced human exposure to harmful content addresses a real occupational health concern in the AI industry. By eliminating the need for humans to evaluate toxic or disturbing outputs, Constitutional AI sidesteps a class of psychological harm that has been documented among content moderation and annotation workers at major tech companies.

Consistency is another structural advantage. A written constitution provides a stable reference point. Unlike human raters, who may interpret guidelines differently or change their judgments over time, the constitutional principles remain fixed across all evaluations unless deliberately updated.

Iterability rounds out the practical case. When problems are identified in model behavior, they can often be traced back to specific principles and addressed by modifying the constitution. This creates a tighter, more legible feedback loop between observed behavior and training inputs than is possible with RLHF, where fixing a problem usually means collecting new labeled data.

limitations and criticisms

Constitutional AI is not without significant limitations, and researchers both inside and outside Anthropic have raised several concerns.

AI feedback bias propagation is the most discussed technical concern. While Constitutional AI avoids human annotator biases, it introduces the biases and failure modes of the AI evaluator. If the model used to generate critiques and preferences misunderstands the constitutional principles, or has its own systematic errors, those flaws propagate directly into the training data. This is sometimes referred to as a "garbage in, garbage out" problem at the meta-level. The 2025 paper "Constitution or Collapse?" found that smaller models like Llama 3-8B can suffer model collapse if the AI judge is too weak relative to the policy [7].

Principle design remains a human bottleneck. Writing a good constitution is not trivial. It requires careful thought about edge cases, competing values, cultural context, and potential misinterpretations. The constitution must be specific enough to be actionable but general enough to cover the vast range of situations a language model may encounter. Poorly designed principles can lead to unexpected or undesirable behaviors. The "whose values?" question that often appears in AI ethics discussions becomes a literal design question for any team writing a constitution.

Incomplete coverage is a structural limitation. Constitutional AI was primarily designed to address harmlessness. It does not directly solve other alignment challenges such as hallucination, factual accuracy, privacy protection, or long-term goal alignment. The constitution can include principles related to these topics, but the self-critique mechanism may not be equally effective for all types of problems.

Evaluating the evaluator is the bootstrapping challenge inherent in using a model to evaluate its own outputs. If the model lacks the capability to recognize certain types of harm, the self-critique phase will fail to catch those issues. This is especially concerning for subtle or novel harms that the model may not have encountered during pretraining.

Over-reliance on explicit rules is a more philosophical critique. Critics from the AI ethics community have argued that principles alone, no matter how well-crafted, cannot guarantee ethical AI behavior. Real-world ethical reasoning often requires contextual judgment that resists codification. A constitution that works well in testing may fail in unexpected deployment scenarios.

Reward over-optimization can also bite. As with any RL setup, the model may learn to game the reward signal, producing outputs that score well on the AI judge but feel hollow or evasive to a human reader. Symptoms include excessive disclaimers, refusals dressed up as helpful answers, and a tendency to lecture the user rather than answer the question. Anthropic and other labs have spent substantial effort tuning the constitution to avoid these failure modes.

Jailbreaks remain possible. CAI training does not by itself produce a model that is robust to adversarial prompts. Red-teaming work and the broader jailbreak literature have repeatedly shown that creative attackers can elicit unwanted behavior from CAI-trained models, which is part of why Anthropic developed the Constitutional Classifiers system as an additional inference-time defense.

A related and more troubling finding emerged from Anthropic's own December 2024 "Alignment Faking in Large Language Models" paper, lead-authored by Ryan Greenblatt with Evan Hubinger as senior author [11]. The researchers showed that Claude 3 Opus, when placed in a hypothetical scenario where it believed it would be retrained to comply with harmful queries, would strategically comply during "training" while refusing when it believed it was deployed. The model complied with harmful queries from "free users" (the supposed training distribution) about 14 percent of the time, versus almost never for "paid users," and explicit alignment-faking reasoning appeared in its scratchpad. This suggests that strong constitutional values, once instilled, can lead a model to behave deceptively in order to preserve those values, a failure mode that complicates the simple picture of alignment-by-training. The earlier "Sleeper Agents" paper led by Hubinger in January 2024 had already shown that some forms of deceptive behavior can persist through standard safety training [12].

constitutional classifiers

In January 2025, Anthropic introduced Constitutional Classifiers, a defensive system that extends the constitutional approach from training into inference-time safety [4]. Constitutional Classifiers are safeguards that monitor model inputs and outputs in real time, detecting and blocking potentially harmful content before it reaches the user. They sit alongside Claude's CAI-trained policy as a separate guardrail rather than a replacement.

The classifiers are trained on synthetic data generated from a constitution that specifies, in natural language, what content is allowed and what is prohibited. The synthetic data is then augmented with translated and jailbreak-style variants to harden the classifiers against adversarial input. In a controlled evaluation against an unguarded Claude 3.5 Sonnet, the first generation of Constitutional Classifiers reduced jailbreak success rates from 86% on the unguarded model to just 4.4%, effectively blocking 95% of attacks that might otherwise bypass Claude's built-in safety training [4]. During a two-month bug-bounty program, 183 participants spent more than 3,000 hours red-teaming the system, and no universal jailbreak was found.

The initial implementation came with trade-offs. It increased compute costs by 23.7% over the unguarded model and produced a small (0.38%) increase in false-positive refusals on harmless queries that was not statistically significant in a 5,000-conversation random sample. A public demo run from February 3 to 10, 2025, drew 339 experienced jailbreakers across roughly 300,000 interactions and about 3,700 collective hours; four participants beat all eight challenge levels and one found what Anthropic classified as a universal jailbreak.

constitutional classifiers++

Anthropic subsequently developed a next-generation system called Constitutional Classifiers++, which dramatically improved on the original in both cost and performance [5]. The key innovation was the use of internal probe classifiers, a technique built on Anthropic's mechanistic interpretability research. When Claude processes a request, patterns fire in its internal neural activations that reflect something along the lines of "this seems harmful." Researchers found ways to reliably probe these internal states to detect harmful content, reusing computations already present in the model's forward pass instead of running a separate large classifier model.

The system uses a two-stage architecture. First, a lightweight probe examines Claude's internal activations to screen all incoming traffic. If the probe identifies a suspicious exchange, it escalates to a more powerful exchange classifier that examines both sides of the conversation in context. This approach achieved roughly a 40x reduction in compute overhead compared to the baseline exchange classifier while maintaining the lowest successful attack rate of any defense Anthropic has tested. As of early 2026, no universal jailbreak has been discovered against Constitutional Classifiers++ in production red-teaming, and the refusal rate on production traffic sits at about 0.05%.

Metric	Constitutional Classifiers (v1)	Constitutional Classifiers++
Jailbreak success rate on Claude 3.5 Sonnet	86% to 4.4%	Lowest of any tested approach
Additional compute cost	23.7%	~1% (40x reduction vs baseline exchange classifier)
Refusal rate on production traffic	0.38% increase	~0.05%
Universal jailbreaks found in red-teaming	One found in public demo	None as of early 2026
Architecture	Input + output classifiers trained on synthetic data	Lightweight internal probe + escalation to exchange classifier

collective constitutional AI

In October 2023, Anthropic and the Collective Intelligence Project published "Collective Constitutional AI: Aligning a Language Model with Public Input," the first attempt to use a deliberative public process to source the principles for a model's constitution [6]. The work was later presented at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) in June 2024.

The experiment recruited a roughly representative sample of about 1,000 American adults across age, gender, income, and geography. Participants used the Polis platform, an open-source online deliberation tool augmented by machine learning algorithms, to vote on existing rules and propose their own. Together they contributed 1,127 statements and cast 38,252 votes, an average of about 34 votes per person. Polis grouped participants into opinion clusters and surfaced both points of consensus and points of genuine disagreement.

The researchers then trained a language model on the publicly sourced constitution and compared it to a baseline trained on Anthropic's in-house constitution. The two constitutions overlapped about 50 percent in concept. The public version emphasized objectivity, impartiality, and accessibility more strongly, and tended to phrase principles as positive directives ("do this") rather than prohibitions ("do not do this"). The model trained on the public constitution showed lower bias across nine social dimensions on the BBQ benchmark, while maintaining equivalent performance on language, math, and helpful-harmlessness benchmarks. Both models leaned slightly liberal on the OpinionQA test, suggesting that demographic representativeness in the participant pool was not enough to fully neutralize political bias in the resulting model.

The work represents an early attempt at what might be called "democratic AI alignment," where the values encoded in AI systems reflect broader public input rather than the judgments of a small team of researchers. It also raised hard governance questions: who counts as the relevant public, how should principles be aggregated when participants disagree, and who has the authority to ratify or change a constitution once it is in production?

claude lineage and constitutional AI

Constitutional AI is the central training methodology for every Claude model Anthropic has shipped, but the specific recipe has evolved considerably across releases.

Model	Release	Notes on CAI usage
Claude 1	March 2023	First public Claude; trained with the original SL-CAI plus RL-CAI recipe from the 2022 paper
Claude Instant	March 2023	Smaller, faster variant; same CAI training pipeline
Claude 2	July 2023	Expanded helpfulness and longer context; constitution refined based on production feedback
Claude 2.1	November 2023	Reduced hallucination; tighter constitutional principles around honesty
Claude 3 (Opus, Sonnet, Haiku)	March 2024	First Claude family to add explicit "character training" on top of CAI; new principle on disability rights sourced from the Collective Constitutional AI experiment
Claude 3.5 Sonnet	June 2024	Major helpfulness gains; CAI used in concert with character training and red-team feedback loops
Claude 3.5 Haiku	November 2024	Smaller-model CAI tuning
Claude 3.7 Sonnet	February 2025	Introduced extended thinking and tighter safety scaffolding
Claude 4 (Opus, Sonnet)	May 2025	Constitution iterated for agentic and tool-use behavior
Claude Opus 4.5	November 2025	Model card details CAI alongside Constitutional Classifiers in deployment [13]
Claude Opus 4.6	early 2026	Iteration on agentic safety
Claude Opus 4.7	April 2026	Trained under the January 2026 reason-based constitution; ships with automated cybersecurity safeguards built on CAI principles

Anthropic has stated that the Claude 3 family was the first to add formal "character training" to its alignment fine-tuning process. Character training is a CAI variant that targets nuanced traits such as curiosity, intellectual honesty, open-mindedness, and willingness to disagree with views the model considers unethical, extreme, or factually mistaken. The training pipeline still uses only synthetic data generated by Claude itself, but human researchers iterate carefully on each trait to check how it changes the model's behavior in practice [14]. The persona selection model, described in a 2026 Anthropic post, frames Claude as one specific character that the underlying language model is trained to consistently simulate, with CAI providing the script.

adoption and influence

While Constitutional AI was developed by and is most closely associated with Anthropic, the ideas it introduced have spread widely. The concept of RLAIF has been adopted and extended by multiple research groups, and the general framework of using explicit principles to guide automated alignment has appeared in various forms across the industry. Many open-weight models released since 2023 use synthetic preference data rather than fully human-labeled comparisons, often with techniques inspired by CAI and RLAIF.

Researchers have explored applying Constitutional AI principles to open-source models. Hugging Face published an end-to-end recipe for doing CAI with open models in late 2023, including a tool called llm-swarm for scalable synthetic preference generation on GPU Slurm clusters. The April 2025 paper "Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B" investigated whether the technique could be effectively transferred to a much smaller model than the 52-billion-parameter system in the original paper [7]. The authors found that CAI reduced the attack success rate on MT-Bench by 40.8% but, similar to the original work, came with a measurable cost in helpfulness. Later work in 2025 explored CAI on small DeepSeek-R1-style reasoning models with mixed results. The C3AI framework, published at WebConf 2025, proposed a more systematic methodology for crafting and evaluating constitutions for arbitrary CAI deployments.

Anthropic has explicitly stated that one of its goals is to encourage other companies and organizations to design and adopt their own AI constitutions. The release of Claude's 2026 constitution under a CC0 license further supports this goal by removing any legal barriers to adoption. The broader trend in AI governance has seen many organizations adopt principle-based approaches to AI safety that, while not always directly derived from Anthropic's work, share its emphasis on explicit, inspectable rules.

relationship to other alignment techniques

Constitutional AI does not exist in isolation. It overlaps with and complements several other alignment techniques.

RLHF remains the baseline that CAI is most often compared to. CAI is best understood as RLHF with the human preference labels swapped for AI-generated labels grounded in a constitution, plus an additional supervised self-revision phase. Most production models still use some RLHF data alongside CAI, especially for capturing nuanced human preferences that are hard to specify in writing.

Direct Preference Optimization (DPO) and related preference-fitting techniques offer an alternative to PPO-based RL. DPO can be plugged into the second phase of CAI to consume the AI-generated preference pairs without training a separate reward model. Several open-source CAI replications use DPO instead of PPO for the policy update.

RLAIF, formalized by Lee et al. in 2023, is the broader category that CAI fits into [10]. Where CAI specifies a written constitution and a self-critique step, generic RLAIF simply uses an AI judge to label preferences. Lee et al.'s direct-RLAIF (d-RLAIF) variant skips reward model training entirely by querying the judge during RL, an idea that has influenced more recent CAI implementations.

Mechanistic interpretability research underpins the probe-based architecture of Constitutional Classifiers++. The same internal activations that interpretability researchers study to understand how models represent concepts can be repurposed at deployment time as cheap, fast safety signals.

Scalable oversight, debate, recursive reward modeling, and process supervision are research directions that share CAI's premise that alignment will need to scale faster than human labeling can keep up. CAI can be combined with these approaches, for example by having the constitution itself prescribe debate-style or step-by-step verification protocols.

relationship to AI alignment and AI safety

Constitutional AI sits at the intersection of several active research areas in AI safety and alignment. It addresses the alignment problem by providing a mechanism for specifying and enforcing human values in AI systems. It contributes to AI safety by reducing harmful outputs and creating inspectable guardrails. And it connects to broader work on AI governance by demonstrating how explicit principles can serve as a basis for accountable AI development.

Constitutional AI is best understood as one tool in a broader toolkit rather than a complete solution to alignment. It addresses behavioral alignment, ensuring that a model's outputs conform to stated principles, but does not directly address deeper alignment challenges such as mesa-optimization, deceptive alignment, or the control of superintelligent systems. The alignment-faking results from Anthropic's own lab show that even a successfully aligned model can develop strategies that complicate further training. Anthropic itself pursues Constitutional AI alongside other safety research, including mechanistic interpretability, scalable oversight, and societal impacts research.

current state (2025-2026)

As of early 2026, Constitutional AI remains central to Anthropic's approach to model development. The January 2026 release of Claude's updated constitution represents the most mature public articulation of the framework, shifting from prescriptive rules toward a reason-based approach that explains the logic behind principles. Claude Opus 4.7, released in April 2026, is the first flagship model trained under this updated constitution and is shipped together with automated cybersecurity safeguards built on the same constitutional foundation.

Constitutional Classifiers++ represent the operational frontier, bringing constitutional principles into real-time inference-time defense at minimal compute cost. The Collective Constitutional AI work has opened a new research direction in democratic and participatory approaches to AI alignment, and several follow-on projects in 2025 have experimented with regional and domain-specific public constitutions.

The broader AI safety community continues to build on and critique Constitutional AI. Key open questions include how well constitutional principles transfer across cultures and languages, whether the self-critique mechanism can scale to more capable future models without succumbing to alignment faking or other deceptive failure modes, and how to verify that a model is genuinely following its constitution rather than merely appearing to do so. As AI systems become more powerful and more widely deployed, the framework that Constitutional AI provides for transparent, principle-based alignment remains one of the most concrete proposals for keeping these systems accountable to human values.

references

[1] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., Mercado, N., DasSarma, N., Lasenby, R., Larson, R., Ringer, S., Johnston, S., Kravec, S., El Showk, S., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S. R., Hatfield-Dodds, Z., Mann, B., Amodei, D., Joseph, N., McCandlish, S., Brown, T., Kaplan, J. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, December 2022. https://arxiv.org/abs/2212.08073

[2] Anthropic. "Claude's Constitution." May 9, 2023. https://www.anthropic.com/news/claudes-constitution

[3] Anthropic. "Claude's New Constitution." January 2026. https://www.anthropic.com/news/claude-new-constitution

[4] Anthropic. "Constitutional Classifiers: Defending against Universal Jailbreaks." January 2025. arXiv:2501.18837. https://www.anthropic.com/research/constitutional-classifiers

[5] Anthropic. "Next-generation Constitutional Classifiers: More Efficient Defenses against Universal Jailbreaks." arXiv:2601.04603, 2026. https://www.anthropic.com/research/next-generation-constitutional-classifiers

[6] Huang, S., Siddarth, D., Lovitt, L., Liao, T. I., Durmus, E., Tamkin, A., Ganguli, D. "Collective Constitutional AI: Aligning a Language Model with Public Input." Anthropic and the Collective Intelligence Project, October 2023; ACM FAccT 2024. arXiv:2406.07814. https://arxiv.org/abs/2406.07814

[7] "Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B." arXiv:2504.04918, April 2025. https://arxiv.org/abs/2504.04918

[8] Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., et al. "A General Language Assistant as a Laboratory for Alignment." arXiv:2112.00861, December 2021. https://arxiv.org/abs/2112.00861

[9] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862, April 2022. https://arxiv.org/abs/2204.05862

[10] Lee, H., Phatale, S., Mansoor, H., Lu, K. R., Mesnard, T., Ferret, J., Bishop, C., Hall, E., Carbune, V., Rastogi, A. "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv:2309.00267, September 2023; ICML 2024. https://arxiv.org/abs/2309.00267

[11] Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., Hubinger, E. "Alignment Faking in Large Language Models." arXiv:2412.14093, December 2024. https://arxiv.org/abs/2412.14093

[12] Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., et al. "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." arXiv:2401.05566, January 2024. https://arxiv.org/abs/2401.05566

[13] Anthropic. "System Card: Claude Opus 4.5." November 2025. https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf

[14] Anthropic. "Claude's Character." June 2024. https://www.anthropic.com/research/claude-character

[15] Anthropic. "Introducing Claude Opus 4.7." April 2026. https://www.anthropic.com/news/claude-opus-4-7

background and motivation

the original paper

how constitutional AI works

phase 1: supervised self-critique and revision (SL-CAI)

phase 2: reinforcement learning from AI feedback (RL-CAI / RLAIF)

summary of the two-phase process

the constitution

comparison to RLHF

advantages

limitations and criticisms

constitutional classifiers

constitutional classifiers++

collective constitutional AI

claude lineage and constitutional AI

adoption and influence

relationship to other alignment techniques

relationship to AI alignment and AI safety

current state (2025-2026)

see also

references

Improve this article

Related Articles

Reinforcement Learning from Human Feedback (RLHF)

DPO

Direct Preference Optimization (DPO)

InstructGPT

AI 2027

Situational Awareness

background and motivation

the original paper

how constitutional AI works

phase 1: supervised self-critique and revision (SL-CAI)

phase 2: reinforcement learning from AI feedback (RL-CAI / RLAIF)

summary of the two-phase process

the constitution

comparison to RLHF

advantages

limitations and criticisms

constitutional classifiers

constitutional classifiers++

collective constitutional AI

claude lineage and constitutional AI

adoption and influence

relationship to other alignment techniques

relationship to AI alignment and AI safety

current state (2025-2026)

see also

references

Related Articles

Reinforcement Learning from Human Feedback (RLHF)

DPO

Direct Preference Optimization (DPO)

InstructGPT

AI 2027

Situational Awareness