AI Alignment
41 articles
AI control
AI Safety
AI safety via debate
AI Safety
Agentic misalignment
AI Safety, Anthropic
Alignment faking
AI Safety, Anthropic
Apollo Research
AI Companies, AI Research, AI Safety
Collective Constitutional AI
AI Ethics, Anthropic
Constitutional AI
AI Safety, Anthropic
Constitutional Classifiers
AI Inference, AI Safety, Anthropic
DPO
Training & Optimization
Deceptive alignment
AI Safety
Direct Preference Optimization (DPO)
Deep Learning, Machine Learning, Natural Language Processing
Eliciting latent knowledge
AI Safety
Frontier Model Forum
AI Agents, AI Companies, AI Safety
Goodhart's law
Statistics
Gradient hacking
AI Safety
Inner alignment
AI Safety
InstructGPT
Large Language Models, OpenAI, Training & Optimization
Instrumental convergence
AI Safety
Jan Leike
People
KTO
AI Inference, Reinforcement Learning, Training & Optimization
MACHIAVELLI (benchmark)
AI Benchmarks, AI Ethics, AI Safety
Mesa-optimization
AI Safety
Model Spec
AI Safety, OpenAI
Model organisms of misalignment
AI Safety, Anthropic
Outer alignment
AI Safety
RLOO (REINFORCE Leave-One-Out)
Reinforcement Learning, Training & Optimization
Recursive reward modeling
AI Safety, Reinforcement Learning
Redwood Research
AI Research, AI Safety
Reinforcement Learning from Human Feedback (RLHF)
Deep Learning, Machine Learning, Natural Language Processing
Reinforcement learning from human feedback
Machine Learning, Reinforcement Learning
Reward hacking
AI Safety, Machine Learning, Reinforcement Learning
Rule-Based Rewards (RBR)
AI Safety, OpenAI
SPIN (Self-Play Fine-Tuning)
Large Language Models, Training & Optimization
Scalable oversight
AI Safety
Self-Rewarding Language Models
Large Language Models, Training & Optimization
SimPO
Large Language Models, Training & Optimization
Sparrow (DeepMind)
Conversational AI, Google DeepMind
Specification gaming
AI Safety, Reinforcement Learning
Superalignment
AI Safety
Sycophancy (artificial intelligence)
AI Safety, Large Language Models
Weak-to-Strong Generalization
AI Research, OpenAI