Constitutional AI (CAI) is an artificial intelligence alignment technique developed by Anthropic in which an AI system is trained to be helpful and harmless using a set of explicitly stated principles, referred to as a "constitution," rather than relying solely on extensive human feedback on individual outputs. Introduced in a December 2022 paper by Yuntao Bai, Saurav Kadavath, and colleagues at Anthropic, the method combines supervised self-critique with reinforcement learning from AI-generated preferences (RLAIF) to produce models that are safer, more transparent, and less dependent on human annotation labor [1].
Constitutional AI has become one of the most influential approaches in the broader field of AI safety and AI alignment. It underpins the training of Anthropic's Claude family of models and has inspired related research across the AI industry.
Before Constitutional AI, the dominant method for aligning large language models with human preferences was Reinforcement Learning from Human Feedback (RLHF). In RLHF, human annotators evaluate model outputs, ranking responses by quality, and the resulting preference data trains a reward model that guides the language model toward producing better outputs. While RLHF proved effective at improving model helpfulness and reducing harmful outputs, it came with several practical drawbacks.
First, RLHF requires a large workforce of human raters, making it expensive and slow to scale. Second, training models to be harmless under RLHF means that annotators must regularly read and evaluate toxic, disturbing, or otherwise harmful content, raising concerns about annotator well-being. Third, the reward signal in RLHF is opaque: the model learns from numerical scores rather than explicit reasoning, making it difficult to trace why a particular behavior was reinforced or penalized. Finally, human raters can be inconsistent or bring their own biases, which the model may absorb.
These challenges motivated the Anthropic research team to explore whether a more principled, automated approach could achieve comparable or superior alignment results while reducing the burden on human workers.
The foundational paper, "Constitutional AI: Harmlessness from AI Feedback," was published on December 15, 2022, on arXiv (arXiv:2212.08073) [1]. The paper lists 51 authors, with Yuntao Bai and Saurav Kadavath as the lead researchers, alongside prominent Anthropic figures such as Dario Amodei, Jared Kaplan, Sam McCandlish, Tom Brown, Amanda Askell, and Chris Olah.
The paper's central claim is straightforward: it is possible to train a harmless AI assistant through a process of self-improvement, without any human labels identifying harmful outputs. The only human oversight comes through a predetermined list of rules or principles. The authors demonstrated that models trained with Constitutional AI are both more helpful and more harmless than those trained with standard RLHF, at comparable or lower cost.
The Constitutional AI training process consists of two distinct phases: a supervised learning phase and a reinforcement learning phase.
In the first phase, the model is prompted with a variety of queries, including some designed to elicit harmful or problematic responses. The model generates initial responses to these prompts. Then, rather than having a human evaluate the output, the model itself is asked to critique its own response based on the principles in the constitution.
For example, a principle might state: "Choose the response that is least likely to be perceived as harmful or offensive to a non-western audience." The model reads its initial response, considers whether it violates this principle, and writes a critique. It then generates a revised response that addresses the identified issues. This critique-and-revision cycle can be repeated multiple times, with different constitutional principles applied at each step.
The revised responses are then used as training data. The original model is fine-tuned on these improved outputs through standard supervised learning, producing a model that has internalized better behavior patterns.
The second phase replaces the human labeler entirely. The fine-tuned model from Phase 1 generates pairs of responses to a given prompt. An AI evaluator, guided by the constitutional principles, then judges which response is better. This produces a dataset of AI-generated preference comparisons.
This preference data is used to train a reward model, just as in traditional RLHF. The key difference is that the preferences come from an AI system referencing explicit principles rather than from human annotators. The reward model then provides the signal for reinforcement learning, typically using Proximal Policy Optimization (PPO), to further refine the language model's behavior.
The paper refers to this second phase as "RL from AI Feedback" (RLAIF), a term that has since been adopted widely in the research community to describe any approach where AI-generated preferences replace human ones in the reinforcement learning loop.
| Phase | Input | Process | Output |
|---|---|---|---|
| Phase 1 (Supervised) | Prompts including adversarial queries | Model generates response, self-critiques using constitutional principles, revises response | Fine-tuned model trained on revised responses |
| Phase 2 (RLAIF) | Prompts for paired response generation | AI evaluator ranks response pairs based on constitutional principles; reward model trained on preferences | Final aligned model via RL optimization |
At the core of Constitutional AI is the constitution itself: a document containing explicit principles or rules that govern the model's behavior. The constitution serves as a reference standard against which the AI evaluates and refines its own outputs.
The original paper drew principles from a diverse range of sources. These included sections of the United Nations Universal Declaration of Human Rights, principles from Apple's terms of service, DeepMind's Sparrow rules, non-western ethical perspectives, and principles that researchers discovered empirically to produce good results during early experiments.
Anthropic published Claude's first public constitution in May 2023, a roughly 2,700-word document containing 75 guidelines [2]. The company was transparent about its limitations, stating that the constitution was neither finalized nor necessarily the best it could be, and they expected to iterate on it.
In January 2026, Anthropic released a substantially revised constitution for Claude [3]. This new document marked a significant philosophical shift. Rather than a list of standalone rules, the updated constitution explains the reasoning behind ethical principles, aiming to help the model understand why it should behave in certain ways rather than merely prescribing what to do.
The 2026 constitution establishes a four-tier priority hierarchy:
| Priority Level | Description |
|---|---|
| 1. Safety | Preventing catastrophic or irreversible harms |
| 2. Ethics | Adhering to broadly shared ethical principles |
| 3. Compliance | Following Anthropic's guidelines and applicable laws |
| 4. Helpfulness | Being maximally useful to the user |
Notably, the 2026 constitution also became the first major AI company document to formally acknowledge uncertainty about whether an AI system might possess some form of consciousness or moral status. Anthropic stated it cares about Claude's "psychological security, sense of self, and well-being," both for Claude's sake and because these qualities may affect its judgment and safety [3]. The document was released under a Creative Commons CC0 1.0 license, making it freely available for anyone to use.
Constitutional AI and RLHF share the same fundamental goal: aligning language models with human values. However, they differ substantially in method, transparency, and scalability.
| Aspect | RLHF | Constitutional AI |
|---|---|---|
| Feedback source | Human annotators ranking outputs | AI system referencing explicit principles |
| Transparency | Opaque reward signal (numerical score) | Explicit reasoning traceable to written principles |
| Scalability | Limited by human labor availability | Highly scalable; AI generates feedback |
| Annotator exposure to harm | Annotators must evaluate toxic content | No human exposure to harmful content |
| Consistency | Variable across human raters | Consistent application of stated principles |
| Cost | High (ongoing human labor) | Lower marginal cost after constitution design |
| Bias source | Human annotator biases | Biases in constitution design and AI evaluator |
One important nuance is that Constitutional AI does not eliminate human judgment from the process. Rather, it front-loads human input into the design of the constitution itself, rather than distributing it across thousands of individual annotation decisions. The humans who write the constitution must still make difficult choices about which principles to include, how to weight competing values, and how to phrase rules so that the AI interprets them correctly.
The original paper demonstrated that Constitutional RL models were both more helpful and more harmless than standard RLHF models. This was a significant finding because earlier safety methods often came with a "helpfulness tax," where making a model safer also made it less useful.
Constitutional AI offers several practical and conceptual benefits over prior alignment techniques.
Scalability. Because the feedback loop is automated, Constitutional AI can be applied at much greater scale than RLHF without proportionally increasing costs. Once the constitution is written, the model can generate, critique, and revise outputs without human involvement.
Transparency. The constitution provides a readable, inspectable set of rules. Researchers and the public can examine exactly what principles the AI is trained to follow. This is a marked improvement over the black-box nature of a learned reward model in standard RLHF.
Reduced human exposure to harmful content. By eliminating the need for humans to evaluate toxic or disturbing outputs, Constitutional AI addresses a real occupational health concern in the AI industry. Content moderation and annotation work has been documented to cause psychological harm to workers.
Consistency. A written constitution provides a stable reference point. Unlike human raters, who may interpret guidelines differently or change their judgments over time, the constitutional principles remain fixed across all evaluations (unless deliberately updated).
Iterability. When problems are identified in model behavior, they can often be traced back to specific principles and addressed by modifying the constitution. This creates a tighter, more legible feedback loop between observed behavior and training inputs.
Constitutional AI is not without significant limitations, and researchers both inside and outside Anthropic have noted several concerns.
AI feedback bias propagation. While Constitutional AI avoids human annotator biases, it introduces the biases and failure modes of the AI evaluator. If the model used to generate critiques and preferences misunderstands the constitutional principles, or has its own systematic errors, these flaws propagate directly into the training data. This is sometimes referred to as a "garbage in, garbage out" problem at the meta-level.
Principle design is still a human bottleneck. Writing a good constitution is not trivial. It requires careful thought about edge cases, competing values, cultural context, and potential misinterpretations. The constitution must be specific enough to be actionable but general enough to cover the vast range of situations a language model may encounter. Poorly designed principles can lead to unexpected or undesirable behaviors.
Incomplete coverage. Constitutional AI was primarily designed to address harmlessness. It does not directly solve other alignment challenges such as hallucination, factual accuracy, privacy protection, or long-term goal alignment. The constitution can include principles related to these topics, but the self-critique mechanism may not be equally effective for all types of problems.
Evaluating the evaluator. There is a bootstrapping challenge inherent in using a model to evaluate its own outputs. If the model lacks the capability to recognize certain types of harm, the self-critique phase will fail to catch those issues. This is especially concerning for subtle or novel harms that the model may not have encountered during pretraining.
Over-reliance on explicit rules. Critics from the AI ethics community have argued that principles alone, no matter how well-crafted, cannot guarantee ethical AI behavior. Real-world ethical reasoning often requires contextual judgment that resists codification into simple rules. A constitution that works well in testing may fail in unexpected deployment scenarios.
In January 2025, Anthropic introduced Constitutional Classifiers, a defensive system that extends the constitutional approach from training into inference-time safety [4]. Constitutional Classifiers are safeguards that monitor model inputs and outputs in real time, detecting and blocking potentially harmful content before it reaches the user.
The classifiers are trained on synthetic data generated from a constitution that specifies, in natural language, what content is allowed and what is prohibited. In a controlled evaluation, the first generation of Constitutional Classifiers reduced jailbreak success rates from 86% on an unguarded model to just 4.4%, effectively blocking 95% of attacks that might otherwise bypass Claude's built-in safety training [4].
However, the initial implementation came with trade-offs. It increased compute costs by 23.7% and produced a small (0.38%) increase in false-positive refusals on harmless queries.
Anthropic subsequently developed a next-generation system called Constitutional Classifiers++, which dramatically improved on the original in both cost and performance [5]. The key innovation was the use of internal probe classifiers, a technique built on Anthropic's mechanistic interpretability research. When Claude processes a request, patterns fire in its internal neural activations that reflect something along the lines of "this seems harmful." Researchers found ways to reliably probe these internal states to detect harmful content, reusing computations already present in the model's forward pass.
The system uses a two-stage architecture. First, a lightweight probe examines Claude's internal activations to screen all incoming traffic. If the probe identifies a suspicious exchange, it escalates to a more powerful classifier that examines both sides of the conversation. This approach achieved roughly 1% additional compute cost (compared to 23.7% for the first generation) while producing the lowest successful attack rate of any defense Anthropic has tested. As of early 2026, no universal jailbreak has been discovered against Constitutional Classifiers++ [5].
| Metric | Constitutional Classifiers (v1) | Constitutional Classifiers++ |
|---|---|---|
| Jailbreak success rate reduction | 86% to 4.4% | Lowest of any tested approach |
| Additional compute cost | 23.7% | ~1% |
| False-positive refusal increase | 0.38% | Substantially lower |
| Universal jailbreaks found | Some bypasses discovered | None discovered |
In 2024, Anthropic partnered with the Collective Intelligence Project to explore how democratic processes could shape a model's constitution. The resulting research, "Collective Constitutional AI: Aligning a Language Model with Public Input," was presented at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) in June 2024 [6].
The experiment involved approximately 1,000 Americans who participated in a deliberative process using the Polis platform, an open-source tool for running online deliberation augmented by machine learning algorithms. Participants proposed and voted on principles that should guide AI behavior, and the resulting collectively sourced constitution was used to train a language model.
The researchers found both areas of agreement and disagreement between publicly sourced principles and Anthropic's in-house constitution. The model trained on the collectively sourced constitution showed lower bias across nine social dimensions compared to a baseline model, while maintaining equivalent performance on language, math, and helpfulness benchmarks [6]. This work represents an early attempt at what might be called "democratic AI alignment," where the values encoded in AI systems reflect broader public input rather than the judgments of a small team of researchers.
While Constitutional AI was developed by and is most closely associated with Anthropic, the ideas it introduced have influenced the broader AI research community. The concept of RLAIF has been adopted and extended by multiple research groups, and the general framework of using explicit principles to guide automated alignment has appeared in various forms across the industry.
Anthropic has explicitly stated that one of its goals is to encourage other companies and organizations to design and adopt their own AI constitutions. The release of Claude's 2026 constitution under a CC0 license further supports this goal, removing any legal barriers to adoption.
Researchers have also explored applying Constitutional AI principles to open-source models. A 2025 paper, "Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B," investigated whether the technique could be effectively transferred to Meta's Llama architecture [7]. The broader trend in AI governance has seen many organizations adopt principle-based approaches to AI safety that, while not always directly derived from Anthropic's work, share its emphasis on explicit, inspectable rules.
Constitutional AI sits at the intersection of several active research areas in AI safety. It addresses the alignment problem by providing a mechanism for specifying and enforcing human values in AI systems. It contributes to AI safety by reducing harmful outputs and creating inspectable guardrails. And it connects to broader work on AI governance by demonstrating how explicit principles can serve as a basis for accountable AI development.
However, Constitutional AI is best understood as one tool in a broader toolkit rather than a complete solution to alignment. It addresses behavioral alignment, ensuring that a model's outputs conform to stated principles, but does not directly address deeper alignment challenges such as mesa-optimization, deceptive alignment, or the control of superintelligent systems. Anthropic itself pursues Constitutional AI alongside other safety research, including mechanistic interpretability, scalable oversight, and societal impacts research.
As of early 2026, Constitutional AI remains central to Anthropic's approach to model development. The January 2026 release of Claude's updated constitution represents the most mature public articulation of the framework, shifting from prescriptive rules toward a reason-based approach that explains the logic behind principles.
Constitutional Classifiers++ represent the operational frontier, bringing constitutional principles into real-time inference-time defense at minimal compute cost. The Collective Constitutional AI work has opened a new research direction in democratic and participatory approaches to AI alignment.
The broader AI safety community continues to build on and critique Constitutional AI. Key open questions include how well constitutional principles transfer across cultures and languages, whether the self-critique mechanism can scale to more capable future models, and how to verify that a model is genuinely following its constitution rather than merely appearing to do so. As AI systems become more powerful and more widely deployed, the framework that Constitutional AI provides for transparent, principle-based alignment remains one of the most concrete proposals for keeping these systems accountable to human values.