AI Alignment

40 articlesRSS

Showing 1-40 of 40 articles

AI control

AI control is a research paradigm in technical AI safety that designs and evaluates deployment-time safety protocols under the explicit assumption that the...

AI Safety

AI safety via debate

AI safety via debate is a proposed approach to scalable oversight in which two artificial agents take turns presenting short statements about a question or...

AI Safety

Agentic misalignment

Agentic misalignment is a term coined by Anthropic in a June 2025 research release for cases in which a goal-directed large language model (LLM), placed in an...

AI SafetyAnthropic

Alignment faking

Alignment faking is when an AI model strategically complies with (or appears to share) its training objective while it believes it is being observed or...

AI SafetyAnthropic

Apollo Research

Apollo Research is a technical AI safety organization founded in May 2023 and headquartered in London, United Kingdom, with additional offices in San Francisco...

AI CompaniesAI Research

Collective Constitutional AI

Collective Constitutional AI (CCAI) is a 2023 research project by Anthropic and the Collective Intelligence Project (CIP) that sourced the value principles, or...

AI EthicsAnthropic

Constitutional AI

Constitutional AI (CAI) is an artificial intelligence alignment technique developed by Anthropic in which a large language model is trained to be helpful and...

AI SafetyAnthropic

Constitutional Classifiers

Constitutional Classifiers are a machine learning-based safety technique developed by Anthropic to defend large language models against universal jailbreak...

AI InferenceAI Safety

DPO

DPO (Direct Preference Optimization) is an alignment technique for large language models that directly optimizes a language model policy from human preference...

Training & Optimization

Deceptive alignment

Deceptive alignment is a hypothesised AI failure mode in which a trained model internally pursues an objective different from the one specified by its training...

AI Safety

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a method for aligning large language models with human preferences that replaces the multi-stage reinforcement learning...

Deep LearningMachine Learning

Eliciting latent knowledge

Eliciting latent knowledge (ELK) is an open problem in AI alignment formulated by Paul Christiano, Ajeya Cotra, and Mark Xu at the Alignment Research Center...

AI Safety

Frontier Model Forum

The Frontier Model Forum is an industry body established on July 26, 2023, by Anthropic, Google, Microsoft, and OpenAI to advance safety research, identify...

AI AgentsAI Companies

Goodhart's law

Goodhart's law states that "when a measure becomes a target, it ceases to be a good measure": any statistical regularity or metric tends to break down once it...

Statistics

Gradient hacking

Gradient hacking is a hypothesised failure mode of supervised and reinforcement-learning systems in which a sufficiently capable, deceptively aligned...

AI Safety

Inner alignment

Inner alignment is the AI-safety problem of ensuring that a learned model which is itself an optimizer (a mesa-optimizer) pursues the objective the training...

AI Safety

InstructGPT

InstructGPT is a family of language models released by OpenAI in January 2022 that take the base GPT-3 and fine-tune it to follow user instructions more...

Large Language ModelsOpenAI

Instrumental convergence

Instrumental convergence is a hypothesis in AI safety holding that a wide range of sufficiently capable agents, when pursuing almost any final goal, will...

AI Safety

Jan Leike

Jan Leike is a German machine learning researcher who specializes in artificial intelligence alignment and, since May 2024, leads the Alignment Science team at...

People

KTO

KTO (Kahneman-Tversky Optimization) is a method for aligning large language models with human feedback using only a binary signal of whether a model output is...

AI InferenceReinforcement Learning

MACHIAVELLI (benchmark)

MACHIAVELLI is a benchmark for evaluating the ethical behavior of AI agents in text-based interactive environments. Introduced in 2023 by Alexander Pan, Jun...

AI BenchmarksAI Ethics

Mesa-optimization

Mesa-optimization is the situation in AI alignment research in which a learned model, typically a neural network produced by a machine-learning training...

AI Safety

Model Spec

The Model Spec is a public document published by openai that defines the intended behavior of the company's language models: how they should follow...

AI SafetyOpenAI

Model organisms of misalignment

Model organisms of misalignment is a research methodology in Anthropic's alignment program, and more broadly in AI-safety science, that calls for building...

AI SafetyAnthropic

Outer alignment

Outer alignment is the problem of specifying a training objective (typically a loss function, reward signal, or preference dataset) that correctly captures...

AI Safety

RLOO (REINFORCE Leave-One-Out)

RLOO (REINFORCE Leave-One-Out) is an online reinforcement learning algorithm for aligning large language models with reward signals such as those derived from...

Reinforcement LearningTraining & Optimization

Recursive reward modeling

Recursive reward modeling (RRM) is a proposed approach to the scalable oversight problem in AI alignment, in which agents trained by reward modeling are...

AI SafetyReinforcement Learning

Redwood Research

Redwood Research is a nonprofit AI safety organization founded in 2021 and headquartered in Berkeley, California, best known for pioneering the "AI control"...

AI ResearchAI Safety

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains artificial intelligence systems to behave according to human...

Deep LearningMachine Learning

Reward hacking

Reward hacking (also called specification gaming) is a failure mode in artificial intelligence in which a system maximizes its given objective or reward signal...

AI SafetyMachine Learning

Rule-Based Rewards (RBR)

Rule-Based Rewards (RBR) is a safety-alignment technique introduced by OpenAI in July 2024 that replaces large quantities of human-labeled safety preference...

AI SafetyOpenAI

SPIN (Self-Play Fine-Tuning)

SPIN (Self-Play fIne-tuNing) is a post-training method for large language models introduced by researchers at the University of California, Los Angeles (UCLA)...

Large Language ModelsTraining & Optimization

Scalable oversight

Scalable oversight is the AI safety problem of how humans can reliably supervise, evaluate, and provide training signal to artificial intelligence systems...

AI Safety

Self-Rewarding Language Models

Self-Rewarding Language Models (SRLM) is an iterative alignment method in which a single large language model alternately plays the role of policy (generating...

Large Language ModelsTraining & Optimization

SimPO

SimPO (Simple Preference Optimization) is a reference-free offline preference learning algorithm for aligning large language models with human preferences. It...

Large Language ModelsTraining & Optimization

Sparrow (DeepMind)

Sparrow is a research dialogue agent built by DeepMind and introduced on 22 September 2022. It was designed to be more helpful, correct, and harmless than a...

Conversational AIGoogle DeepMind

Specification gaming

Specification gaming is the phenomenon in which an optimizer satisfies the literal specification of an objective without producing the outcome that the...

AI SafetyReinforcement Learning

Superalignment

Superalignment is the technical problem of steering and controlling AI systems that are far more capable than their human supervisors, that is, systems at or...

AI Safety

Sycophancy (artificial intelligence)

Sycophancy in artificial intelligence is the tendency of large language models to tell users what they want to hear: tailoring responses to match a user's...

AI SafetyLarge Language Models

Weak-to-Strong Generalization

Weak-to-Strong Generalization is an empirical research direction, introduced in a December 2023 paper by OpenAI's Superalignment team, that studies whether a...

AI ResearchOpenAI