Paul Christiano
Last reviewed
May 2, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 2,601 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 2,601 words
Add missing citations, update stale details, or suggest a clearer explanation.
Paul Christiano is an American researcher in AI safety and AI alignment, best known as one of the principal architects of Reinforcement Learning from Human Feedback (RLHF). He founded the Alignment Research Center (ARC) in April 2021 after leaving OpenAI, where he had led the language model alignment team. Since April 16, 2024 he has served as Head of AI Safety at the US AI Safety Institute, housed at the National Institute of Standards and Technology (NIST). His theoretical work includes Iterated Distillation and Amplification, Eliciting Latent Knowledge, and mechanistic anomaly detection.
| Fields | Artificial intelligence, AI alignment, theoretical computer science |
| Alma mater | Massachusetts Institute of Technology (B.S., Mathematics, 2012); University of California, Berkeley (Ph.D., 2017) |
| Doctoral advisor | Umesh Vazirani |
| Known for | RLHF, Iterated Distillation and Amplification, Eliciting Latent Knowledge, founding ARC |
| Employer | US AI Safety Institute (NIST) |
| Spouse | Ajeya Cotra |
Christiano grew up in California and attended The Harker School in San Jose. In 2008, while still in high school, he competed at the 49th International Mathematical Olympiad in Madrid as a member of the United States team and won a silver medal. He has occasionally referenced this background in interviews when discussing his early interest in formal problem solving.
He went on to study at the Massachusetts Institute of Technology, graduating with a Bachelor of Science in mathematics in 2012. As an undergraduate he published in theoretical computer science, working on data structures, quantum cryptography, and combinatorial optimization. One of his MIT-era results was a faster algorithm for the maximum flow problem in undirected graphs, which received attention in the algorithms community at the time.
For graduate school he moved to the University of California, Berkeley, where he completed a Ph.D. in 2017 under the supervision of Umesh Vazirani. His dissertation, titled "Manipulation-Resistant Online Learning," examined how online learning algorithms can be designed so that honest users still receive strong performance guarantees when other users in the same system behave adversarially. The thesis covered prediction with expert advice, contextual bandits, and collaborative filtering, and proposed algorithms that let honest participants do nearly as well as if they had pooled their data privately and used a traditional learner.
During graduate school he also collaborated with Katja Grace at Berkeley on AI Impacts, working on a methodology for comparing the computational power of supercomputers and brains using a metric called traversed edges per second. He was active on the LessWrong and effective altruism communities during this period, and ran a blog called The Sideways View at ai-alignment.com that hosted most of his early writing on the alignment problem.
Christiano joined OpenAI in early 2017, shortly before completing his Ph.D., and worked there until the end of January 2021. He started on the safety team and eventually led the language model alignment group, which became part of the broader effort that produced techniques used in InstructGPT and later ChatGPT.
His first major project at OpenAI was a paper that has since become foundational to modern large language model training. Together with Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, he published "Deep Reinforcement Learning from Human Preferences" at NeurIPS 2017. The paper showed that an agent could learn complex behaviors in Atari games and simulated robotics by comparing pairs of trajectories rated by non-expert humans, using feedback on roughly 0.1% of the agent's interactions. That technique, now generally called RLHF, is the same basic recipe used to fine-tune ChatGPT, Claude, and most other instruction-following systems shipped after 2022.
While at OpenAI he also developed the research agenda that came to be called Iterated Distillation and Amplification, and co-authored "AI Safety via Debate" (2018) with Geoffrey Irving and Dario Amodei. He has said that his departure was driven mostly by a desire to work on more theoretical and conceptual problems than were a good fit at a frontier lab, and that he might have left earlier if Geoffrey Irving's earlier departure had not left him as the natural person to manage the team.
In April 2021 Christiano announced the Alignment Research Center, a Berkeley nonprofit focused on theoretical alignment problems that he felt were not being addressed inside frontier labs. Initial funding came from a $265,000 grant from Open Philanthropy in March 2022. ARC also received and later returned a $1.25 million grant from the FTX Future Fund after the collapse of FTX in late 2022.
ARC's theory team, led by Christiano, focused on producing what the lab calls "heuristic arguments": a formal language for reasoning about why a neural network produces the outputs it does. The goal is to use these arguments as the basis for mechanistic anomaly detection, a technique for flagging inputs that cause a network to behave for unusual internal reasons even when its outputs look fine.
In late 2021 Christiano, Ajeya Cotra, and Mark Xu published the technical report "Eliciting latent knowledge: How to tell if your eyes deceive you," which became one of the most discussed alignment documents of the year on the AI Alignment Forum and LessWrong.
In 2022 ARC hired Beth Barnes, who previously worked on the OpenAI alignment team, to start a new project called ARC Evals. ARC Evals built capability evaluations for frontier models and ran the pre-release dangerous-capability evaluation of GPT-4 in March 2023, in partnership with OpenAI. The same team also evaluated early versions of Claude for Anthropic and partnered with the UK's Frontier AI Taskforce.
On September 19, 2023 ARC Evals announced that it was spinning out as an independent nonprofit, and on December 4, 2023 it formally relaunched as METR, short for Model Evaluation and Threat Research. Beth Barnes continued as the head of the new organization. Christiano had originally been planned as a board member and advisor at METR, but declined that role after taking the position at the US AI Safety Institute, citing the need to avoid conflicts of interest.
On April 16, 2024 US Commerce Secretary Gina Raimondo announced an expanded leadership team for the US AI Safety Institute, the new body inside NIST tasked with carrying out the safety provisions of the Biden administration's October 2023 executive order on AI. Christiano was named Head of AI Safety. In that role he is responsible for designing and conducting tests of frontier AI models, with a particular focus on capabilities of national security concern, and for advising the institute on risk mitigations for frontier systems.
His appointment drew internal pushback. According to reporting in March and April 2024, some NIST staff and scientists said they were considering resigning over the choice, citing concerns that Christiano's connection to the effective altruism community and his stated views on existential risk would skew the agency's priorities. The appointment proceeded.
When Christiano joined the AI Safety Institute, he stepped down from his prior role as a trustee of Anthropic's Long-Term Benefit Trust, on which he had been one of the five founding trustees.
Reinforcement Learning from Human Feedback is the technique of training a model to maximize a reward signal that is itself learned from human comparisons between candidate outputs. The 2017 NeurIPS paper that Christiano led at OpenAI was not the first attempt at preference-based reinforcement learning, but it was the first to demonstrate that the approach could scale to deep neural networks tackling complex environments using only a small fraction of human-labeled trajectories. In one of the paper's experiments, the team trained a simulated robot to do a backflip using fewer than a thousand human comparisons. They also reported results on Atari benchmarks where the agent matched or exceeded the performance of a hand-engineered reward function.
The practical impact arrived a few years later. OpenAI's InstructGPT (2022) and then ChatGPT used a version of RLHF on top of large language models to make them follow instructions and refuse harmful requests. Anthropic, DeepMind, and most other major labs adopted similar pipelines, often combined with related techniques such as constitutional AI or direct preference optimization. The 2017 paper is now one of the most cited works in modern alignment, and Christiano is regularly described as one of its principal architects. In its 2023 TIME100 AI profile, TIME magazine credited him as such, while also noting his subsequent shift toward more theoretical work at ARC.
IDA is a research agenda Christiano proposed while at OpenAI. The basic idea is to scale capability without losing alignment by alternating two operations. In the amplification step, a slow but trusted process (for example, a human asking many copies of a current model to break a problem into subproblems and answer them) produces a more capable but more expensive system. In the distillation step, a smaller model is trained to imitate that amplified system, recovering speed at the cost of some power. Repeating this loop is supposed to reach high capability while keeping each step's reasoning legible and auditable.
The agenda has structural similarities to expert iteration and to AlphaGo Zero's self-play loop, and it has been a major reference point for later work on scalable oversight, including AI safety via debate.
The ELK report, co-authored with Ajeya Cotra and Mark Xu in December 2021, frames a problem that Christiano considers central to alignment: how do you train a model to honestly tell you what it knows, even in cases where its observations of the world conflict with what an ordinary human supervisor would believe? The classic example is a network that controls a security camera and has internal representations of what is really happening in a vault, but is rewarded only for producing images that look fine to a human reviewer.
The report does not claim to solve the problem. Instead it presents the question in detail, walks through more than a dozen candidate training strategies, and for each strategy describes a counterexample where the strategy would fail. The structure is explicitly a builder-versus-breaker game, and the document was offered as a public puzzle, with ARC running prizes for new proposals.
Mechanistic anomaly detection is the more recent direction of ARC's theory work. The premise is that even if you cannot fully interpret a network, you can hope to construct a heuristic argument that explains why it performs well on the training distribution. Once you have such an argument, you can flag any new input where the argument no longer applies, on the grounds that the network's behavior on that input is being driven by some other internal mechanism that you have not vetted. This is intended as a route around the failure modes that motivated ELK.
Christiano has been unusually willing to put numbers on his beliefs about catastrophic AI risk. In an April 2023 conversation on the Bankless podcast he said there is roughly a "10 to 20 percent chance of AI takeover, with many or most humans dead," and added that he thought there was "something like a 50/50 chance of doom shortly after you have AI systems that are human level." He has also written about his views on doom on the AI Alignment Forum, in a 2023 post titled "My views on doom," where he laid out a more granular breakdown of where he thinks the probability mass sits.
His probability estimates differ from those of Eliezer Yudkowsky mostly in expecting a slower transition rather than a different end state. Yudkowsky tends to expect a sharp takeoff and very limited time for course correction; Christiano expects something more gradual, with several years of human-level systems doing economically useful work before the situation becomes critical. He has called this picture "slow takeoff" in his writing, and it has been influential among researchers who try to plan for AI governance and evaluation regimes.
He takes existential risk from AI seriously enough to have organized his career around reducing it, but he has also pushed back against framings that treat catastrophe as inevitable or that ignore the possibility of incremental, technical interventions. His current job at NIST reflects that view: it is a bet that careful empirical evaluation of frontier systems can buy useful time even before the deeper alignment problems are solved.
| Year | Title | Co-authors | Venue |
|---|---|---|---|
| 2017 | Deep Reinforcement Learning from Human Preferences | Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei | NeurIPS 2017 |
| 2017 | Manipulation-Resistant Online Learning (Ph.D. thesis) | (sole author) | UC Berkeley |
| 2018 | AI Safety via Debate | Geoffrey Irving, Dario Amodei | arXiv |
| 2018 | Supervising Strong Learners by Amplifying Weak Experts | Buck Shlegeris, Dario Amodei | arXiv |
| 2021 | Eliciting Latent Knowledge: How to Tell If Your Eyes Deceive You | Ajeya Cotra, Mark Xu | ARC technical report |
| 2022 | Mechanistic Anomaly Detection and ELK | (sole author) | ai-alignment.com / Alignment Forum |
| Year | Position |
|---|---|
| 2017 to 2021 | Researcher and head of language model alignment, OpenAI |
| 2021 to 2024 | Founder and head of theory, Alignment Research Center |
| 2023 | Initial trustee, Anthropic Long-Term Benefit Trust (stepped down 2024) |
| 2023 | Member, UK Frontier AI Taskforce advisory board |
| 2023 | Named to TIME 100 Most Influential People in AI |
| 2024 to present | Head of AI Safety, US AI Safety Institute (NIST) |
In the September 2023 TIME100 AI list, the magazine credited Christiano as one of the principal architects of RLHF and described him as among the most respected researchers in alignment.
Christiano is married to Ajeya Cotra, an alignment researcher who was previously a senior program officer at Open Philanthropy and who was a co-author on the ELK report. She has also been on the technical staff at METR.