Paul Christiano

Paul Christiano is an American researcher in AI safety and AI alignment, best known as one of the principal architects of Reinforcement Learning from Human Feedback (RLHF). He founded the Alignment Research Center (ARC) in April 2021 after leaving OpenAI, where he had led the language model alignment team. Since April 16, 2024 he has served as Head of AI Safety at the US AI Safety Institute, housed at the National Institute of Standards and Technology (NIST). His theoretical work includes Iterated Distillation and Amplification, Eliciting Latent Knowledge, and mechanistic anomaly detection.


Fields	Artificial intelligence, AI alignment, theoretical computer science
Alma mater	Massachusetts Institute of Technology (B.S., Mathematics, 2012); University of California, Berkeley (Ph.D., 2017)
Doctoral advisor	Umesh Vazirani
Known for	RLHF, Iterated Distillation and Amplification, Eliciting Latent Knowledge, founding ARC
Employer	US AI Safety Institute (NIST)
Spouse	Ajeya Cotra

Early life and education

Christiano grew up in California and attended The Harker School in San Jose. In 2008, while still in high school, he competed at the 49th International Mathematical Olympiad in Madrid as a member of the United States team and won a silver medal. He has occasionally referenced this background in interviews when discussing his early interest in formal problem solving.

He went on to study at the Massachusetts Institute of Technology, graduating with a Bachelor of Science in mathematics in 2012. As an undergraduate he published in theoretical computer science, working on data structures, quantum cryptography, and combinatorial optimization. One of his MIT-era results was a faster algorithm for the maximum flow problem in undirected graphs, which received attention in the algorithms community at the time.

For graduate school he moved to the University of California, Berkeley, where he completed a Ph.D. in 2017 under the supervision of Umesh Vazirani. His dissertation, titled "Manipulation-Resistant Online Learning," examined how online learning algorithms can be designed so that honest users still receive strong performance guarantees when other users in the same system behave adversarially. The thesis covered prediction with expert advice, contextual bandits, and collaborative filtering, and proposed algorithms that let honest participants do nearly as well as if they had pooled their data privately and used a traditional learner.

During graduate school he also collaborated with Katja Grace at Berkeley on AI Impacts, working on a methodology for comparing the computational power of supercomputers and brains using a metric called traversed edges per second. He was active on the LessWrong and effective altruism communities during this period, and ran a blog called The Sideways View at ai-alignment.com that hosted most of his early writing on the alignment problem.

OpenAI (2017 to 2021)

Christiano joined OpenAI in early 2017, shortly before completing his Ph.D., and worked there until the end of January 2021. He started on the safety team and eventually led the language model alignment group, which became part of the broader effort that produced techniques used in InstructGPT and later ChatGPT.

His first major project at OpenAI was a paper that has since become foundational to modern large language model training. Together with Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, he published "Deep Reinforcement Learning from Human Preferences" at NeurIPS 2017. The paper showed that an agent could learn complex behaviors in Atari games and simulated robotics by comparing pairs of trajectories rated by non-expert humans, using feedback on roughly 0.1% of the agent's interactions. That technique, now generally called RLHF, is the same basic recipe used to fine-tune ChatGPT, Claude, and most other instruction-following systems shipped after 2022.

While at OpenAI he also developed the research agenda that came to be called Iterated Distillation and Amplification, and co-authored "AI Safety via Debate" (2018) with Geoffrey Irving and Dario Amodei. He has said that his departure was driven mostly by a desire to work on more theoretical and conceptual problems than were a good fit at a frontier lab, and that he might have left earlier if Geoffrey Irving's earlier departure had not left him as the natural person to manage the team.

Alignment Research Center (2021 to 2024)

In April 2021 Christiano announced the Alignment Research Center, a Berkeley nonprofit focused on theoretical alignment problems that he felt were not being addressed inside frontier labs. Initial funding came from a $265,000 grant from Open Philanthropy in March 2022. ARC also received and later returned a $1.25 million grant from the FTX Future Fund after the collapse of FTX in late 2022.

ARC's theory team, led by Christiano, focused on producing what the lab calls "heuristic arguments": a formal language for reasoning about why a neural network produces the outputs it does. The goal is to use these arguments as the basis for mechanistic anomaly detection, a technique for flagging inputs that cause a network to behave for unusual internal reasons even when its outputs look fine.

In late 2021 Christiano, Ajeya Cotra, and Mark Xu published the technical report "Eliciting latent knowledge: How to tell if your eyes deceive you," which became one of the most discussed alignment documents of the year on the AI Alignment Forum and LessWrong.

METR spinoff

In 2022 ARC hired Beth Barnes, who previously worked on the OpenAI alignment team, to start a new project called ARC Evals. ARC Evals built capability evaluations for frontier models and ran the pre-release dangerous-capability evaluation of GPT-4 in March 2023, in partnership with OpenAI. The same team also evaluated early versions of Claude for Anthropic and partnered with the UK's Frontier AI Taskforce.

On September 19, 2023 ARC Evals announced that it was spinning out as an independent nonprofit, and on December 4, 2023 it formally relaunched as METR, short for Model Evaluation and Threat Research. Beth Barnes continued as the head of the new organization. Christiano had originally been planned as a board member and advisor at METR, but declined that role after taking the position at the US AI Safety Institute, citing the need to avoid conflicts of interest.

US AI Safety Institute (2024 to present)

On April 16, 2024 US Commerce Secretary Gina Raimondo announced an expanded leadership team for the US AI Safety Institute, the new body inside NIST tasked with carrying out the safety provisions of the Biden administration's October 2023 executive order on AI. Christiano was named Head of AI Safety. In that role he is responsible for designing and conducting tests of frontier AI models, with a particular focus on capabilities of national security concern, and for advising the institute on risk mitigations for frontier systems.

His appointment drew internal pushback. According to reporting in March and April 2024, some NIST staff and scientists said they were considering resigning over the choice, citing concerns that Christiano's connection to the effective altruism community and his stated views on existential risk would skew the agency's priorities. The appointment proceeded.

When Christiano joined the AI Safety Institute, he stepped down from his prior role as a trustee of Anthropic's Long-Term Benefit Trust, on which he had been one of the five founding trustees.

Research contributions

RLHF

Reinforcement Learning from Human Feedback is the technique of training a model to maximize a reward signal that is itself learned from human comparisons between candidate outputs. The 2017 NeurIPS paper that Christiano led at OpenAI was not the first attempt at preference-based reinforcement learning, but it was the first to demonstrate that the approach could scale to deep neural networks tackling complex environments using only a small fraction of human-labeled trajectories. In one of the paper's experiments, the team trained a simulated robot to do a backflip using fewer than a thousand human comparisons. They also reported results on Atari benchmarks where the agent matched or exceeded the performance of a hand-engineered reward function.

The practical impact arrived a few years later. OpenAI's InstructGPT (2022) and then ChatGPT used a version of RLHF on top of large language models to make them follow instructions and refuse harmful requests. Anthropic, DeepMind, and most other major labs adopted similar pipelines, often combined with related techniques such as constitutional AI or direct preference optimization. The 2017 paper is now one of the most cited works in modern alignment, and Christiano is regularly described as one of its principal architects. In its 2023 TIME100 AI profile, TIME magazine credited him as such, while also noting his subsequent shift toward more theoretical work at ARC.

Iterated Distillation and Amplification (IDA)

IDA is a research agenda Christiano proposed while at OpenAI. The basic idea is to scale capability without losing alignment by alternating two operations. In the amplification step, a slow but trusted process (for example, a human asking many copies of a current model to break a problem into subproblems and answer them) produces a more capable but more expensive system. In the distillation step, a smaller model is trained to imitate that amplified system, recovering speed at the cost of some power. Repeating this loop is supposed to reach high capability while keeping each step's reasoning legible and auditable.

The agenda has structural similarities to expert iteration and to AlphaGo Zero's self-play loop, and it has been a major reference point for later work on scalable oversight, including AI safety via debate.

Eliciting Latent Knowledge (ELK)

The ELK report, co-authored with Ajeya Cotra and Mark Xu in December 2021, frames a problem that Christiano considers central to alignment: how do you train a model to honestly tell you what it knows, even in cases where its observations of the world conflict with what an ordinary human supervisor would believe? The classic example is a network that controls a security camera and has internal representations of what is really happening in a vault, but is rewarded only for producing images that look fine to a human reviewer.

The report does not claim to solve the problem. Instead it presents the question in detail, walks through more than a dozen candidate training strategies, and for each strategy describes a counterexample where the strategy would fail. The structure is explicitly a builder-versus-breaker game, and the document was offered as a public puzzle, with ARC running prizes for new proposals.

Mechanistic anomaly detection

Mechanistic anomaly detection is the more recent direction of ARC's theory work. The premise is that even if you cannot fully interpret a network, you can hope to construct a heuristic argument that explains why it performs well on the training distribution. Once you have such an argument, you can flag any new input where the argument no longer applies, on the grounds that the network's behavior on that input is being driven by some other internal mechanism that you have not vetted. This is intended as a route around the failure modes that motivated ELK.

AI risk views

Christiano has been unusually willing to put numbers on his beliefs about catastrophic AI risk. In an April 2023 conversation on the Bankless podcast he said there is roughly a "10 to 20 percent chance of AI takeover, with many or most humans dead," and added that he thought there was "something like a 50/50 chance of doom shortly after you have AI systems that are human level." He has also written about his views on doom on the AI Alignment Forum, in a 2023 post titled "My views on doom," where he laid out a more granular breakdown of where he thinks the probability mass sits.

His probability estimates differ from those of Eliezer Yudkowsky mostly in expecting a slower transition rather than a different end state. Yudkowsky tends to expect a sharp takeoff and very limited time for course correction; Christiano expects something more gradual, with several years of human-level systems doing economically useful work before the situation becomes critical. He has called this picture "slow takeoff" in his writing, and it has been influential among researchers who try to plan for AI governance and evaluation regimes.

He takes existential risk from AI seriously enough to have organized his career around reducing it, but he has also pushed back against framings that treat catastrophe as inevitable or that ignore the possibility of incremental, technical interventions. His current job at NIST reflects that view: it is a bet that careful empirical evaluation of frontier systems can buy useful time even before the deeper alignment problems are solved.

Selected publications

Year	Title	Co-authors	Venue
2017	Deep Reinforcement Learning from Human Preferences	Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei	NeurIPS 2017
2017	Manipulation-Resistant Online Learning (Ph.D. thesis)	(sole author)	UC Berkeley
2018	AI Safety via Debate	Geoffrey Irving, Dario Amodei	arXiv
2018	Supervising Strong Learners by Amplifying Weak Experts	Buck Shlegeris, Dario Amodei	arXiv
2021	Eliciting Latent Knowledge: How to Tell If Your Eyes Deceive You	Ajeya Cotra, Mark Xu	ARC technical report
2022	Mechanistic Anomaly Detection and ELK	(sole author)	ai-alignment.com / Alignment Forum

Roles and recognition

Year	Position
2017 to 2021	Researcher and head of language model alignment, OpenAI
2021 to 2024	Founder and head of theory, Alignment Research Center
2023	Initial trustee, Anthropic Long-Term Benefit Trust (stepped down 2024)
2023	Member, UK Frontier AI Taskforce advisory board
2023	Named to TIME 100 Most Influential People in AI
2024 to present	Head of AI Safety, US AI Safety Institute (NIST)

In the September 2023 TIME100 AI list, the magazine credited Christiano as one of the principal architects of RLHF and described him as among the most respected researchers in alignment.

Personal

Christiano is married to Ajeya Cotra, an alignment researcher who was previously a senior program officer at Open Philanthropy and who was a co-author on the ELK report. She has also been on the technical staff at METR.

References

Wikipedia. "Paul Christiano." https://en.wikipedia.org/wiki/Paul_Christiano
NIST. "Paul Christiano." https://www.nist.gov/people/paul-christiano
NIST. "U.S. Commerce Secretary Gina Raimondo Announces Expansion of U.S. AI Safety Institute Leadership Team." April 16, 2024. https://www.nist.gov/news-events/news/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safety
Christiano, P., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D. "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. https://arxiv.org/abs/1706.03741
Christiano, P. "Manipulation-Resistant Online Learning." Ph.D. thesis, UC Berkeley, 2017. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-107.html
Christiano, P., Cotra, A., Xu, M. "Eliciting Latent Knowledge: How to Tell If Your Eyes Deceive You." ARC, December 2021. https://www.alignment.org/blog/arcs-first-technical-report-eliciting-latent-knowledge/
Christiano, P. "Announcing the Alignment Research Center." April 2021. https://ai-alignment.com/announcing-the-alignment-research-center-a9b07f77431b
METR. "ARC Evals is now METR." December 4, 2023. https://metr.org/blog/2023-12-04-metr-announcement/
METR. "ARC Evals is spinning out from ARC." September 19, 2023. https://metr.org/blog/2023-09-19-spin-out-announcement/
TIME. "Paul Christiano: The 100 Most Influential People in AI 2023." September 7, 2023. https://time.com/collection/time100-ai/6309030/paul-christiano/
Anthropic. "The Long-Term Benefit Trust." September 2023. https://www.anthropic.com/news/the-long-term-benefit-trust
Bankless Podcast. "Paul Christiano on AI Alignment." April 2023.
Christiano, P. "My views on 'doom'." LessWrong / AI Alignment Forum. https://www.lesswrong.com/posts/xWMqsvHapP3nwdSW8/my-views-on-doom
Christiano, P. "Mechanistic anomaly detection and ELK." Alignment Research Center blog. https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/
80,000 Hours. "Paul Christiano on how OpenAI is developing real solutions to the 'AI alignment problem'." https://80000hours.org/podcast/episodes/paul-christiano-ai-alignment-solutions/

Paul Christiano

Paul Christiano

Early life and education

OpenAI (2017 to 2021)

Alignment Research Center (2021 to 2024)

METR spinoff

US AI Safety Institute (2024 to present)

Research contributions

RLHF

Iterated Distillation and Amplification (IDA)

Eliciting Latent Knowledge (ELK)

Mechanistic anomaly detection

AI risk views

Selected publications

Roles and recognition

Personal

See also

References

Improve this article

Paul Christiano

Early life and education

OpenAI (2017 to 2021)

Alignment Research Center (2021 to 2024)

METR spinoff

US AI Safety Institute (2024 to present)

Research contributions

RLHF

Iterated Distillation and Amplification (IDA)

Eliciting Latent Knowledge (ELK)

Mechanistic anomaly detection

AI risk views

Selected publications

Roles and recognition

Personal

See also

References

Paul Christiano

Early life and education

OpenAI (2017 to 2021)

Alignment Research Center (2021 to 2024)

METR spinoff

US AI Safety Institute (2024 to present)

Research contributions

RLHF

Iterated Distillation and Amplification (IDA)

Eliciting Latent Knowledge (ELK)

Mechanistic anomaly detection

AI risk views

Selected publications

Roles and recognition

Personal

See also

References

Improve this article

Related Articles

Dan Hendrycks

Stuart Russell

Sergey Levine

Pieter Abbeel

Ion Stoica

Chelsea Finn

Paul Christiano

Early life and education

OpenAI (2017 to 2021)

Alignment Research Center (2021 to 2024)

METR spinoff

US AI Safety Institute (2024 to present)

Research contributions

RLHF

Iterated Distillation and Amplification (IDA)

Eliciting Latent Knowledge (ELK)

Mechanistic anomaly detection

AI risk views

Selected publications

Roles and recognition

Personal

See also

References

Related Articles

Dan Hendrycks

Stuart Russell

Sergey Levine

Pieter Abbeel

Ion Stoica

Chelsea Finn