Jan Leike

AI Alignment People

23 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v3 · 4,593 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Jan Leike is a German machine learning researcher who specializes in artificial intelligence alignment and, since May 2024, leads the Alignment Science team at Anthropic.^[4]^[22] He co-led OpenAI's Superalignment team with chief scientist Ilya Sutskever from July 2023 until his resignation in May 2024, when he wrote in a widely cited public thread that at OpenAI "safety culture and processes have taken a backseat to shiny products."^[13]^[18]^[19] He completed his PhD in reinforcement learning theory at the Australian National University under Marcus Hutter, worked as a research scientist at DeepMind, and is best known as a co-author of the foundational 2017 paper "Deep Reinforcement Learning from Human Preferences" (Christiano, Leike, Brown et al.), which introduced the form of preference learning that underpins Reinforcement Learning from Human Feedback (RLHF), and as a senior author on the InstructGPT paper (Ouyang et al., 2022) that translated those techniques into modern instruction-tuned language models.^[6]^[7]

Who is Jan Leike?

Jan Leike (born 1986 or 1987) is one of the most prominent AI alignment researchers of the deep learning era, known for three things: the empirical foundations of RLHF, his leadership of OpenAI's four-year Superalignment program, and his high-profile May 2024 resignation over what he described as OpenAI's deprioritization of safety.^[2]^[13]^[19] As of 2026 he leads Anthropic's Alignment Science team, which reports to chief science officer Jared Kaplan and works on scalable oversight, weak-to-strong generalization, automated alignment research, and alignment auditing.^[4]^[22] He summarizes his own research question as: "how can we train AI systems to follow human intent on tasks that are difficult for humans to evaluate directly?"^[8]

Background and education

Leike was born in 1986 or 1987 and grew up in Germany.^[2] He completed undergraduate and master's studies in computer science at the University of Freiburg before moving to Canberra to pursue a PhD at the Australian National University.^[2] His doctoral work was supervised by Marcus Hutter, a theorist known for the AIXI framework of universal intelligence, and Leike's dissertation, titled "Nonparametric General Reinforcement Learning" (2016), addressed convergence and bounded rationality questions arising in general Reinforcement learning.^[8]^[2] Several papers from his doctoral period appear on his publication list, including "Indefinitely Oscillating Martingales" (Leike and Hutter, 2014) and "Bad Universal Priors and Notions of Optimality" (Leike and Hutter, 2015).^[8]

After defending his thesis in 2016, Leike held a six-month postdoctoral fellowship at the Future of Humanity Institute (FHI) at Oxford, where he began the transition from theoretical work on general reinforcement learning toward empirical AI safety questions.^[2] He has cited that postdoc as a pivot point in his research direction, moving from formal models of agents toward concrete machine learning systems where alignment questions could be tested in practice.^[2]

What did Jan Leike do at DeepMind?

In 2017, Leike joined the safety team at DeepMind in London, where he worked for roughly four years as a research scientist alongside Shane Legg and others on the safety team.^[9]^[2] His DeepMind period is associated with the early prototyping of reinforcement learning from human feedback in deep reinforcement learning settings.^[9] He was a co-author on the seminal 2017 paper "Deep Reinforcement Learning from Human Preferences," led by Paul Christiano (then at OpenAI) and including Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei as additional authors.^[6] The paper, posted to arXiv on June 12, 2017 (arXiv:1706.03741) and presented at NeurIPS 2017, demonstrated that deep RL agents could learn complex behaviors, including AlphaGo-adjacent control tasks, simulated robot locomotion, and Atari games, from comparison-based human preferences over short trajectory segments, using feedback on less than one percent of the agent's interactions.^[6]^[10]

In 2018, Leike was the first author of "Scalable agent alignment via reward modeling: a research direction" (Leike, Krueger, Everitt, Martic, Maini, Legg), posted on arXiv on November 19, 2018.^[11] The paper laid out a research agenda for Recursive reward modeling: train a reward model from human feedback, use it to train a policy, then recursively bootstrap reward models for tasks that exceed direct human evaluation by decomposing them into subtasks that can themselves be supervised by reward-modeled agents.^[11] This document became one of the most cited statements of the "model-supervises-model" paradigm that later motivated work on debate, iterated amplification, and weak-to-strong generalization.^[11]^[9]

What did Jan Leike do at OpenAI?

Leike joined OpenAI in early 2021 as part of the alignment team.^[2] He became Head of Alignment, leading the team responsible for developing the techniques used to align OpenAI's deployed language models.^[1]^[3] His most visible product-level contribution from this period is as a senior author on "Training language models to follow instructions with human feedback," posted to arXiv on March 4, 2022 (arXiv:2203.02155).^[7] The paper, led by Long Ouyang with Jeff Wu, Xu Jiang, John Schulman, Paul Christiano, Jan Leike, and Ryan Lowe among the authors, introduced InstructGPT and established the now-standard three-stage RLHF recipe: supervised fine-tuning on demonstrations, reward modeling from human ranking data, and reinforcement learning with that reward signal.^[7] Notably the 1.3 billion parameter InstructGPT model was preferred by human labelers over the 175 billion parameter base GPT-3 despite being roughly 100 times smaller, demonstrating that alignment training rather than raw scale was driving the improvement.^[7]

Leike was also a co-author on "Self-critiquing models for assisting human evaluators" (Saunders, Yeh, Wu, Bills, Ouyang, Ward, Leike), submitted to arXiv on June 12, 2022.^[12] That paper trained language models to write natural language critiques of summaries and showed that the critiques helped human evaluators catch flaws they would otherwise miss, an empirical instantiation of one of the recursive reward modeling building blocks.^[12]

What was OpenAI's Superalignment team?

On July 5, 2023, OpenAI announced a new "Superalignment" effort to be co-led by Leike and chief scientist Ilya Sutskever.^[13]^[14] The team's stated goal was "to solve the core technical challenges of controlling superintelligent AI over the next four years," and OpenAI publicly committed to dedicating 20% of the computing resources the company had secured to date to that effort.^[13] The announcement laid out a three-part technical agenda: training AI systems using scalable human feedback, training AI systems to evaluate other AI systems, and ultimately building an automated alignment researcher of roughly human-level capability that would assist with the work.^[13] The four-year horizon and the size of the compute commitment were unusual public benchmarks for an industrial lab, and Leike discussed the program in extended interviews including an August 2023 episode of the 80,000 Hours Podcast.^[15]

The Superalignment program produced a number of papers during its short lifetime. The most prominent is "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision," posted to arXiv on December 14, 2023 (arXiv:2312.09390), with authors Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu.^[16] The paper used pretrained models in the GPT-4 family to study a deliberately reversed supervisory setup: a weak model (e.g. GPT-2 level) generates labels and a strong model (e.g. GPT-4 level) is fine-tuned on them.^[16] The strong student consistently outperformed its weak teacher across NLP, chess, and reward modeling tasks, and an auxiliary confidence loss could recover something close to GPT-3.5-level performance from a GPT-2-level supervisor on NLP.^[16] The result was framed as an analog of the future situation in which humans must supervise systems they cannot fully evaluate.^[16]

Why did Jan Leike leave OpenAI?

Sutskever announced his departure from OpenAI on May 14, 2024.^[17] Leike followed several days later. Shortly after midnight Pacific time on May 17, 2024, he posted on X: "Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI."^[18] In a thread that followed over the next hours, he stated that he had "been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point," wrote that "over the past years, safety culture and processes have taken a backseat to shiny products," and called on OpenAI to "become a safety-first AGI company."^[18]^[19] He also reported that "over the past few months my team has been struggling for compute and it was getting harder to get crucial research done."^[20]

Within days, OpenAI confirmed that the Superalignment team would be dissolved and its members absorbed into other research groups.^[17]^[20] Subsequent reporting by Fortune, drawing on six sources familiar with the team's work, found that OpenAI never delivered anything close to the publicly committed 20% of secured compute to Superalignment; flex compute requests were repeatedly denied, and ambiguity about whether the commitment meant 20% each year or 20% over four years contributed to chronic shortfalls.^[20] Leike's departure was followed and preceded by other safety-related exits at OpenAI, including those of Daniel Kokotajlo and several other alignment researchers, during a broader period of leadership turbulence at the company.^[19]^[17]

Why did Jan Leike join Anthropic?

On May 28, 2024, Leike announced on X that he was joining Anthropic "to continue the superalignment mission" and that his "new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research."^[21]^[4] He was hired to lead a new effort that reports directly to Anthropic chief science officer Jared Kaplan, with existing Anthropic researchers working on scalable oversight moving under Leike's leadership as the team spun up.^[21]^[4] The team is generally referred to as the Alignment Science team in Anthropic's later communications and on Leike's personal page, where he describes its mission as investigating how to train AI systems to follow human intent on tasks that are difficult for humans to evaluate directly.^[8]^[22]

The hire was widely covered as a notable transfer of safety talent between the two leading frontier labs.^[4]^[21]^[23] CNBC, TechCrunch, Quartz, and Silicon Republic all framed it in the context of Leike's pointed criticism of OpenAI's safety culture earlier that month, and noted that Anthropic, founded by former OpenAI staff including Dario Amodei and Daniela Amodei, had positioned itself as a safety-focused alternative.^[4]^[21]^[23]

According to publication lists and the team's own descriptions, the Anthropic Alignment Science group works on a set of overlapping problems: Scalable oversight for tasks beyond human evaluation, Weak-to-Strong Generalization of alignment properties from weaker to stronger models, robustness against jailbreaks and adversarial input, and automated alignment research in which language models themselves help propose and evaluate alignment techniques.^[22]^[8] One published output from the broader Anthropic alignment research effort during 2024 is "LLM Critics Help Catch LLM Bugs," listed on Leike's publication page as a 2024 paper continuing the critique-style scalable oversight line.^[8]

What is Jan Leike working on now?

As of 2026, Leike continues to lead Anthropic's Alignment Science team, whose stated goal he frames as "optimizing for a post-AGI future where humanity flourishes" and solving what he calls "the hard problem of alignment."^[8] A major 2025 line of work from the team is alignment auditing: building automated agents and validation protocols that probe deployed models for hidden or misaligned objectives. In 2025 the team published "Building and evaluating alignment auditing agents," describing AI systems that autonomously carry out alignment auditing workflows, alongside related work on detecting backdoored "sleeper agent" behavior and on auditing games as a red-team validation protocol.^[29]

In June and early July 2025, Anthropic's Alignment Science team ran a first-of-its-kind joint evaluation exercise with OpenAI: each lab tested the other's publicly released models with a selection of its strongest internal alignment evaluations, covering propensities such as sycophancy, self-preservation, whistleblowing, and willingness to assist with misuse, then exchanged findings before publication.^[30] Anthropic published the results on August 27, 2025, an unusual instance of two competing frontier labs cross-evaluating each other's systems for safety.^[30] The work reflects the institutional turn in Leike's research since leaving OpenAI: alongside technical alignment, he has increasingly emphasized external accountability, third-party assessment, and cross-lab safety evaluation as part of a credible AGI safety stack.^[30]^[22]

Key research contributions

The table below summarizes Leike's most cited papers, with arXiv identifiers and primary venues. Authorship order follows the papers themselves, with Leike's name highlighted in context.

Year	Title	Lead authors	Venue / arXiv	Significance
2016	Nonparametric General Reinforcement Learning	Leike	PhD thesis, ANU	Doctoral dissertation on convergence in general RL; supervised by Marcus Hutter^[8]
2017	Deep Reinforcement Learning from Human Preferences	Christiano, Leike, Brown et al.	NeurIPS 2017; arXiv:1706.03741	Foundational deep Reinforcement Learning from Human Feedback (RLHF) paper; preference learning over trajectory segments^[6]
2018	Scalable agent alignment via reward modeling: a research direction	Leike, Krueger, Everitt et al.	arXiv:1811.07871	Defined Recursive reward modeling research program^[11]
2022	Training language models to follow instructions with human feedback (InstructGPT)	Ouyang et al., including Leike	arXiv:2203.02155; NeurIPS 2022	Three-stage RLHF recipe used in ChatGPT and later models^[7]
2022	Self-critiquing models for assisting human evaluators	Saunders et al., including Leike	arXiv:2206.05802	Empirical study of model-generated critiques for Scalable oversight^[12]
2023	Weak-to-Strong Generalization	Burns, Izmailov, Kirchner et al., including Leike, Sutskever	arXiv:2312.09390; ICML 2024	Empirical setup for studying how strong models behave under weak supervision^[16]
2024	LLM Critics Help Catch LLM Bugs	Anthropic alignment team, including Leike	2024	Continuation of critique-style oversight at Anthropic^[8]
2025	Building and evaluating alignment auditing agents	Anthropic Alignment Science team	2025	Automated agents that audit models for hidden objectives^[29]

Deep Reinforcement Learning from Human Preferences

The 2017 NeurIPS paper introduced an explicit reward model trained from pairwise comparisons collected from human raters.^[6] Rather than asking annotators to score trajectories on a scalar reward, the system shows two short video segments of agent behavior and asks which is preferred, then fits a reward function that explains the comparisons under a Bradley-Terry-like model and uses that reward to train a deep RL policy.^[6] Experimental results showed that this could solve novel tasks ("do a backflip" with a simulated humanoid) within roughly an hour of human feedback time and could match RL trained against engineered rewards on a substantial fraction of Atari games using preference comparisons on under one percent of agent interactions.^[6]^[10] The pattern of reward modeling plus reinforcement learning, with humans (or human-trained assistants) supplying preferences, is the structural template later expanded in InstructGPT and in Constitutional AI and is the empirical heart of Reinforcement Learning from Human Feedback (RLHF).^[6]^[7]

Recursive reward modeling

The 2018 DeepMind technical report on scalable agent alignment via reward modeling sketched a recursive structure for using reward modeling beyond the regime where humans can directly evaluate full trajectories.^[11] If a task is too complex for a human to grade end to end, decompose it into simpler sub-tasks that can be reward-modeled, and use the resulting evaluator agents to provide training signal for harder tasks; the construction continues recursively up the difficulty ladder.^[11] The paper explicitly framed this as a research direction with open problems, including reward hacking, distributional shift, and what it called the "ought-is gap" between learned reward and human values.^[11] Recursive reward modeling, along with debate and iterated amplification, is one of the canonical "model-supervises-model" proposals discussed in the literature on Scalable oversight.^[11]^[24]

InstructGPT and the three-stage recipe

In the InstructGPT paper, Leike is the next-to-last author and is identified with OpenAI's alignment team; in the team's public communications, he and Ryan Lowe are commonly described as the alignment leads who carried the project to release.^[7] The three-stage recipe (supervised fine-tuning on demonstrations, reward model trained on rankings, reinforcement learning against the reward model with a KL penalty toward the supervised policy) is the standard pattern that downstream labs would adapt for ChatGPT, Claude, and most other instruction-tuned models that followed.^[7] InstructGPT itself was deployed as the default in OpenAI's API in early 2022, predating the release of ChatGPT later that year.^[7]

Weak-to-strong generalization

The 2023 Superalignment paper deliberately reversed the conventional supervisory hierarchy used by RLHF: instead of humans (strong supervisors) training models (weaker students), the authors used weak language models to label data for stronger pretrained models.^[16] The motivation was explicit: in the longer-term superhuman case, the supervisor (a human) will be weaker than the supervised model, so it is useful to study what happens when imperfect, weaker-than-target labels are used.^[16] The strong students systematically outperformed their weak teachers across natural language understanding, chess move selection, and reward modeling tasks; this "weak-to-strong" gap could be narrowed further with auxiliary objectives such as confidence regularization.^[16] The paper became a touchstone in the alignment literature for empirical study of post-human-level supervision.^[16]^[22]

What is Jan Leike's research vision?

In interviews, papers, and his personal site, Leike has articulated a research vision built around several overlapping themes.^[15]^[2]^[8]

First, the central problem is supervising AI on tasks that humans cannot directly evaluate. He has summarized the question as: "how can we train AI systems to follow human intent on tasks that are difficult for humans to evaluate directly?"^[8] This frames most of his work on reward modeling, recursive reward modeling, critiques, and scalable oversight as different empirical and conceptual angles on the same question.^[11]^[12]^[16] The framing distinguishes alignment from capability: the worry is not that systems will be too weak, but that systems will eventually be capable enough that humans cannot recognize good from bad behavior without help.^[15]

Second, model-supervises-model paradigms are treated as a primary engineering approach rather than an exotic addition.^[11]^[15] Recursive reward modeling, AI-assisted human evaluation through critiques, weak-to-strong generalization, and the more general program of building "automated alignment researchers" all rely on using language models themselves to extend the reach of human oversight.^[11]^[12]^[16]^[13] The Superalignment announcement made this explicit: the goal was to build a roughly human-level automated alignment researcher and then use compute to scale that researcher up rather than relying solely on direct human labor for alignment progress.^[13]

Third, Scalable oversight research is empirical and benefits from experimental setups even before genuinely superhuman models exist.^[15]^[16] The sandwiching experimental paradigm formalized by Bowman and collaborators in "Measuring Progress on Scalable Oversight for Large Language Models" (2022) and the weak-to-strong setup of the 2023 paper are both attempts to design today's experiments that bear on tomorrow's problem.^[25]^[16] In sandwiching, a model is positioned between a non-expert and an expert on a task, and the experiment tests whether scalable oversight techniques can help the non-expert match the expert's performance using the model; the analog target is humans trying to evaluate AI outputs that lie above their unaided ability.^[25]^[24]

Fourth, dangerous-capability evaluation and red teaming sit alongside alignment training as part of the same overall program of preparing for more capable systems.^[15]^[13] In his 80,000 Hours interview and in the OpenAI Superalignment announcement, Leike emphasized that solving alignment is necessary but not sufficient, and that monitoring, preparedness, and adversarial robustness are part of a serious AGI safety stack.^[15]^[13]^[19] At Anthropic, this orientation aligns with the broader institutional emphasis on the Responsible Scaling Policy framework and capability-gated commitments.^[28]

Fifth, Leike has consistently treated alignment as a technical research problem with empirical traction, while acknowledging that institutional and governance choices set the conditions under which research can succeed.^[19]^[20] His May 2024 resignation statement reads in part as a public argument that even technically sound alignment programs can be undermined by misaligned institutional incentives, an unusually pointed criticism for a researcher to make about a former employer.^[19]^[20]

Public statements and influence

Beyond research papers, Leike has been an unusually public alignment researcher. He maintains the website jan.leike.name with a publications list and a Substack blog "Musings on the Alignment Problem" where he posts longer essays.^[8]^[22] He has been interviewed on the 80,000 Hours Podcast (August 2023) and on Daniel Filan's AXRP podcast (July 2023), and has appeared in panel discussions and university talks on alignment.^[15]^[26]

Leike was named to TIME's list of the 100 most influential people in AI in both 2023 and 2024.^[8]^[2] In the aftermath of his May 2024 resignation, his X thread became one of the most widely cited statements of inside-lab discontent over commercial pressure versus safety priorities, and was reported on by Axios, CNN Business, CBS News, Fortune, CNBC, and others.^[19]^[27]^[18]^[20]

Criticisms and limitations

Like other proponents of RLHF-style alignment, Leike's research program has attracted critique on several fronts.^[28] A standing objection to reward modeling more broadly is reward hacking: the trained model can game the learned reward model in ways that diverge from underlying human preferences.^[11] Critics within the alignment community have argued that recursive reward modeling and other model-supervises-model schemes inherit, and may amplify, errors in human judgment that the learned evaluators encode.^[24] The 2023 weak-to-strong generalization results, while encouraging, were also relatively narrow in domain (NLP, chess, reward modeling) and used pre-existing pretrained models rather than genuinely superhuman systems; the authors themselves describe the work as an analog rather than a solution.^[16]

Leike's resignation also surfaced criticism of the Superalignment program itself: even with public commitments of compute and personnel, the team struggled to obtain promised resources, and the public dissolution of the team less than a year after its announcement raised questions about whether four-year safety timelines inside fast-moving commercial labs are credible.^[20]^[17] His own statements on departure (about "shiny products" and "lost trust") are, in part, an internal critique of that institutional model.^[19]

How does Jan Leike compare to other alignment researchers?

Leike sits within a tightly interconnected network of alignment researchers whose careers have crossed between DeepMind, OpenAI, and Anthropic.^[28] The table below sketches a few of these adjacent positions; each row is a separate dedup set for wikilinks.

Researcher	Past affiliations	Current affiliation	Joint work with Leike
Paul Christiano	OpenAI alignment, Alignment Research Center	US AI Safety Institute (now CAISI)	Lead author of 2017 RLHF paper with Leike^[6]
Ilya Sutskever	OpenAI co-founder and chief scientist	Safe Superintelligence Inc.	Co-led OpenAI Superalignment with Leike^[13]^[17]
Dario Amodei	OpenAI VP of research	Anthropic CEO	Co-author of 2017 RLHF paper^[6]
Dan Hendrycks	UC Berkeley	Center for AI Safety	Independent contemporary in alignment research^[28]
Andrej Karpathy	OpenAI co-founder, Tesla	Independent research and Eureka Labs	OpenAI colleague during InstructGPT period^[7]
John Schulman	OpenAI co-founder	Various (former OpenAI)	InstructGPT co-author^[7]
Sam Altman	OpenAI CEO	OpenAI	Leadership Leike publicly broke with in May 2024^[19]

The intellectual structure of Leike's career maps closely onto a cluster of overlapping topics in alignment research that have their own dedicated entries:^[11]^[16]^[25]

AI Alignment is the broad field within which Leike's career has unfolded.^[28]
AI safety is the umbrella for institutional and technical efforts including dangerous-capability evaluations and preparedness work.^[15]
Superalignment refers to the specific OpenAI initiative Leike co-led with Sutskever from 2023 to 2024.^[13]
Reinforcement Learning from Human Feedback (RLHF) is the deployed technique most directly traceable to the 2017 paper.^[6]
Recursive reward modeling is the research direction Leike proposed in 2018.^[11]
Scalable oversight is the broader empirical research area that includes recursive reward modeling, debate, and weak-to-strong generalization.^[25]
Weak-to-Strong Generalization is the specific Superalignment paper from December 2023.^[16]
AI safety via debate is an adjacent model-supervises-model proposal originally from OpenAI.^[28]
Constitutional AI is Anthropic's RLHF-adjacent technique that extends the same general scheme.^[28]
Responsible Scaling Policy is the institutional commitment framework Anthropic uses to gate capability releases on safety progress.^[28]
Interpretability and Mechanistic interpretability are complementary technical agendas often discussed in conjunction with scalable oversight.^[28]
Direct Preference Optimization (DPO) is a more recent alternative to PPO-style RLHF that retains the preference learning core of the 2017 paper.^[28]

References

CNBC, "OpenAI dissolves Superalignment AI safety team", CNBC, 2024-05-17. https://www.cnbc.com/2024/05/17/openai-superalignment-sutskever-leike.html. Accessed 2026-05-20. ↩
Wikipedia contributors, "Jan Leike", Wikipedia, 2025. https://en.wikipedia.org/wiki/Jan_Leike. Accessed 2026-05-20. ↩
CBS News San Francisco, "OpenAI leader Jan Leike resigns, says safety has 'taken a backseat to shiny products'", CBS News, 2024-05-17. https://www.cbsnews.com/sanfrancisco/news/openai-exec-jan-leike-resigns-says-safety-has-taken-a-backseat/. Accessed 2026-05-20. ↩
CNBC, "OpenAI former safety leader Jan Leike joins rival AI startup Anthropic", CNBC, 2024-05-28. https://www.cnbc.com/2024/05/28/openai-safety-leader-jan-leike-joins-amazon-backed-anthropic.html. Accessed 2026-05-20. ↩
CIO, "Ex-OpenAI researcher Jan Leike joins Anthropic amid AI safety concerns", CIO, 2024-05-29. https://www.cio.com/article/2130038/ex-open-ai-researcher-jan-leike-joins-anthropic-amid-ai-safety-concerns.html. Accessed 2026-05-20.
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, "Deep Reinforcement Learning from Human Preferences", arXiv, 2017-06-12. https://arxiv.org/abs/1706.03741. Accessed 2026-05-20. ↩
Long Ouyang et al., "Training language models to follow instructions with human feedback", arXiv, 2022-03-04. https://arxiv.org/abs/2203.02155. Accessed 2026-05-20. ↩
Jan Leike, "Jan Leike personal website", jan.leike.name, 2024. https://jan.leike.name/. Accessed 2026-06-28. ↩
DeepMind Safety Research, "Scalable agent alignment via reward modeling", Medium, 2018-11-19. https://deepmindsafetyresearch.medium.com/scalable-agent-alignment-via-reward-modeling-bf4ab06dfd84. Accessed 2026-05-20. ↩
NeurIPS, "Deep Reinforcement Learning from Human Preferences", NeurIPS Proceedings, 2017. https://papers.nips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html. Accessed 2026-05-20. ↩
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg, "Scalable agent alignment via reward modeling: a research direction", arXiv, 2018-11-19. https://arxiv.org/abs/1811.07871. Accessed 2026-05-20. ↩
William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, Jan Leike, "Self-critiquing models for assisting human evaluators", arXiv, 2022-06-12. https://arxiv.org/abs/2206.05802. Accessed 2026-05-20. ↩
TechCrunch, "OpenAI is forming a new team to bring 'superintelligent' AI under control", TechCrunch, 2023-07-05. https://techcrunch.com/2023/07/05/openai-is-forming-a-new-team-to-bring-superintelligent-ai-under-control/. Accessed 2026-05-20. ↩
HPCwire AIwire, "OpenAI Launches Alignment Initiative Aimed at Mitigating 'Superintelligent' AI", AIwire, 2023-07-06. https://www.hpcwire.com/aiwire/2023/07/06/openai-launches-alignment-initiative-aimed-at-mitigating-superintelligent-ai/. Accessed 2026-05-20. ↩
80,000 Hours, "Jan Leike on OpenAI's massive push to make superintelligence safe in 4 years or less", 80,000 Hours Podcast, 2023-08-07. https://80000hours.org/podcast/episodes/jan-leike-superalignment/. Accessed 2026-05-20. ↩
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu, "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision", arXiv, 2023-12-14. https://arxiv.org/abs/2312.09390. Accessed 2026-05-20. ↩
Pure AI, "OpenAI Team that Polices AI Superintelligence Disbanded After Departures", Pure AI, 2024-05-20. https://pureai.com/articles/2024/05/20/openai-superintelligence-safety-disbanded.aspx. Accessed 2026-05-20. ↩
Jan Leike, "Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI", X (Twitter), 2024-05-17. https://x.com/janleike/status/1791498174659715494. Accessed 2026-05-20. ↩
Fortune, "Top OpenAI researcher resigns, saying company prioritized 'shiny products' over AI safety", Fortune, 2024-05-17. https://fortune.com/2024/05/17/openai-researcher-resigns-safety/. Accessed 2026-05-20. ↩
Fortune, "OpenAI promised 20% of its computing power to combat the most dangerous kind of AI, but never delivered, sources say", Fortune, 2024-05-21. https://fortune.com/2024/05/21/openai-superalignment-20-compute-commitment-never-fulfilled-sutskever-leike-altman-brockman-murati/. Accessed 2026-05-20. ↩
TechCrunch, "Anthropic hires former OpenAI safety lead to head up new team", TechCrunch, 2024-05-28. https://techcrunch.com/2024/05/28/anthropic-hires-former-openai-safety-lead-to-head-up-new-team/. Accessed 2026-05-20. ↩
Crypto Briefing, "Jan Leike leads Anthropic's alignment science team, doubling down on AI safety research", Crypto Briefing, 2024-05-29. https://cryptobriefing.com/jan-leike-anthropic-alignment-science/. Accessed 2026-05-20. ↩
Quartz, "A former OpenAI safety leader went to rival AI company Anthropic", Quartz, 2024-05-28. https://qz.com/jan-leike-openai-superalignment-rival-anthropic-ai-safe-1851504247. Accessed 2026-05-20. ↩
Alignment Forum, "Scalable Oversight and Weak-to-Strong Generalization", Alignment Forum, 2023. https://www.alignmentforum.org/posts/hw2tGSsvLLyjFoLFS/scalable-oversight-and-weak-to-strong-generalization. Accessed 2026-05-20. ↩
Samuel R. Bowman et al., "Measuring Progress on Scalable Oversight for Large Language Models", arXiv, 2022-11-04. https://arxiv.org/abs/2211.03540. Accessed 2026-05-20. ↩
AXRP, "24 - Superalignment with Jan Leike", AXRP Podcast, 2023-07-27. https://axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html. Accessed 2026-05-20. ↩
Axios, "OpenAI's recent departures force leaders to reaffirm safety commitment", Axios, 2024-05-20. https://www.axios.com/2024/05/20/openai-safety-jan-leike-sam-altman. Accessed 2026-05-20. ↩
WinBuzzer, "Former OpenAI Safety Lead Jan Leike Joins Anthropic in Similar Role", WinBuzzer, 2024-05-28. https://winbuzzer.com/2024/05/28/jan-leike-joins-anthropic-to-lead-new-ai-safety-team-xcxwbn/. Accessed 2026-05-20. ↩
Anthropic Alignment Science, "Building and evaluating alignment auditing agents", Anthropic Alignment Science Blog, 2025. https://alignment.anthropic.com/2025/automated-auditing/. Accessed 2026-06-28. ↩
Anthropic, "Findings from a Pilot Anthropic - OpenAI Alignment Evaluation Exercise", Anthropic Alignment Science Blog, 2025-08-27. https://alignment.anthropic.com/2025/openai-findings/. Accessed 2026-06-28. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

InstructGPT Lilian Weng Paul Christiano Recursive reward modeling Sam Altman

Who is Jan Leike?

Background and education

What did Jan Leike do at DeepMind?

What did Jan Leike do at OpenAI?

What was OpenAI's Superalignment team?

Why did Jan Leike leave OpenAI?

Why did Jan Leike join Anthropic?

What is Jan Leike working on now?

Key research contributions

Deep Reinforcement Learning from Human Preferences

Recursive reward modeling

InstructGPT and the three-stage recipe

Weak-to-strong generalization

What is Jan Leike's research vision?

Public statements and influence

Criticisms and limitations

How does Jan Leike compare to other alignment researchers?

Related research areas

See also

References

Improve this article

Related Articles

Reinforcement Learning from Human Feedback (RLHF)

Constitutional AI

DPO

Reward hacking

MACHIAVELLI (benchmark)

Direct Preference Optimization (DPO)

What links here

Related Articles

Reinforcement Learning from Human Feedback (RLHF)

Constitutional AI

DPO

Reward hacking

MACHIAVELLI (benchmark)

Direct Preference Optimization (DPO)

What links here