AI Wiki
Category

AI Alignment

41 articles

AI control

AI Safety

AI safety via debate

AI Safety

Agentic misalignment

AI Safety, Anthropic

Alignment faking

AI Safety, Anthropic

Apollo Research

AI Companies, AI Research, AI Safety

Collective Constitutional AI

AI Ethics, Anthropic

Constitutional AI

AI Safety, Anthropic

Constitutional Classifiers

AI Inference, AI Safety, Anthropic

DPO

Training & Optimization

Deceptive alignment

AI Safety

Direct Preference Optimization (DPO)

Deep Learning, Machine Learning, Natural Language Processing

Eliciting latent knowledge

AI Safety

Frontier Model Forum

AI Agents, AI Companies, AI Safety

Goodhart's law

Statistics

Gradient hacking

AI Safety

Inner alignment

AI Safety

InstructGPT

Large Language Models, OpenAI, Training & Optimization

Instrumental convergence

AI Safety

Jan Leike

People

KTO

AI Inference, Reinforcement Learning, Training & Optimization

MACHIAVELLI (benchmark)

AI Benchmarks, AI Ethics, AI Safety

Mesa-optimization

AI Safety

Model Spec

AI Safety, OpenAI

Model organisms of misalignment

AI Safety, Anthropic

Outer alignment

AI Safety

RLOO (REINFORCE Leave-One-Out)

Reinforcement Learning, Training & Optimization

Recursive reward modeling

AI Safety, Reinforcement Learning

Redwood Research

AI Research, AI Safety

Reinforcement Learning from Human Feedback (RLHF)

Deep Learning, Machine Learning, Natural Language Processing

Reinforcement learning from human feedback

Machine Learning, Reinforcement Learning

Reward hacking

AI Safety, Machine Learning, Reinforcement Learning

Rule-Based Rewards (RBR)

AI Safety, OpenAI

SPIN (Self-Play Fine-Tuning)

Large Language Models, Training & Optimization

Scalable oversight

AI Safety

Self-Rewarding Language Models

Large Language Models, Training & Optimization

SimPO

Large Language Models, Training & Optimization

Sparrow (DeepMind)

Conversational AI, Google DeepMind

Specification gaming

AI Safety, Reinforcement Learning

Superalignment

AI Safety

Sycophancy (artificial intelligence)

AI Safety, Large Language Models

Weak-to-Strong Generalization

AI Research, OpenAI