Jan Leike
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,186 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,186 words
Add missing citations, update stale details, or suggest a clearer explanation.
Jan Leike is a German machine learning researcher who specializes in artificial intelligence alignment. He completed his PhD in reinforcement learning theory at the Australian National University under Marcus Hutter, then worked as a research scientist at Deepmind and as Head of Alignment at OpenAI, where he co-led the Superalignment team with Ilya Sutskever from July 2023 until his resignation in May 2024.[1][2][3] After publicly criticizing OpenAI's safety culture on the day he left, Leike joined Anthropic later that month to lead a new alignment research effort focused on scalable oversight, weak-to-strong generalization, and automated alignment research.[4][5] He is best known as a co-author of the foundational paper "Deep Reinforcement Learning from Human Preferences" (Christiano, Leike, Brown et al., 2017), which introduced the form of preference learning that later underpins Reinforcement Learning from Human Feedback (RLHF), and as a senior author on the InstructGPT paper (Ouyang et al., 2022) that translated those techniques into modern instruction-tuned language models.[6][7]
Leike was born in 1986 or 1987 and grew up in Germany.[2] He completed undergraduate and master's studies in computer science at the University of Freiburg before moving to Canberra to pursue a PhD at the Australian National University.[2] His doctoral work was supervised by Marcus Hutter, a theorist known for the AIXI framework of universal intelligence, and Leike's dissertation, titled "Nonparametric General Reinforcement Learning" (2016), addressed convergence and bounded rationality questions arising in general Reinforcement learning.[8][2] Several papers from his doctoral period appear on his publication list, including "Indefinitely Oscillating Martingales" (Leike and Hutter, 2014) and "Bad Universal Priors and Notions of Optimality" (Leike and Hutter, 2015).[8]
After defending his thesis in 2016, Leike held a six-month postdoctoral fellowship at the Future of Humanity Institute (FHI) at Oxford, where he began the transition from theoretical work on general reinforcement learning toward empirical AI safety questions.[2] He has cited that postdoc as a pivot point in his research direction, moving from formal models of agents toward concrete machine learning systems where alignment questions could be tested in practice.[2]
In 2017, Leike joined the safety team at Deepmind in London, where he worked for roughly four years as a research scientist alongside Shane Legg and others on the safety team.[9][2] His DeepMind period is associated with the early prototyping of reinforcement learning from human feedback in deep reinforcement learning settings.[9] He was a co-author on the seminal 2017 paper "Deep Reinforcement Learning from Human Preferences," led by Paul Christiano (then at OpenAI) and including Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei as additional authors.[6] The paper, posted to arXiv on June 12, 2017 (arXiv:1706.03741) and presented at NeurIPS 2017, demonstrated that deep RL agents could learn complex behaviors, including AlphaGo-adjacent control tasks, simulated robot locomotion, and Atari games, from comparison-based human preferences over short trajectory segments, using feedback on less than one percent of the agent's interactions.[6][10]
In 2018, Leike was the first author of "Scalable agent alignment via reward modeling: a research direction" (Leike, Krueger, Everitt, Martic, Maini, Legg), posted on arXiv on November 19, 2018.[11] The paper laid out a research agenda for Recursive reward modeling: train a reward model from human feedback, use it to train a policy, then recursively bootstrap reward models for tasks that exceed direct human evaluation by decomposing them into subtasks that can themselves be supervised by reward-modeled agents.[11] This document became one of the most cited statements of the "model-supervises-model" paradigm that later motivated work on debate, iterated amplification, and weak-to-strong generalization.[11][9]
Leike joined OpenAI in early 2021 as part of the alignment team.[2] He became Head of Alignment, leading the team responsible for developing the techniques used to align OpenAI's deployed language models.[1][3] His most visible product-level contribution from this period is as a senior author on "Training language models to follow instructions with human feedback," posted to arXiv on March 4, 2022 (arXiv:2203.02155).[7] The paper, led by Long Ouyang with Jeff Wu, Xu Jiang, John Schulman, Paul Christiano, Jan Leike, and Ryan Lowe among the authors, introduced InstructGPT and established the now-standard three-stage RLHF recipe: supervised fine-tuning on demonstrations, reward modeling from human ranking data, and reinforcement learning with that reward signal.[7] Notably the 1.3 billion parameter InstructGPT model was preferred by human labelers over the 175 billion parameter base GPT-3 despite being roughly 100 times smaller, demonstrating that alignment training rather than raw scale was driving the improvement.[7]
Leike was also a co-author on "Self-critiquing models for assisting human evaluators" (Saunders, Yeh, Wu, Bills, Ouyang, Ward, Leike), submitted to arXiv on June 12, 2022.[12] That paper trained language models to write natural language critiques of summaries and showed that the critiques helped human evaluators catch flaws they would otherwise miss, an empirical instantiation of one of the recursive reward modeling building blocks.[12]
On July 5, 2023, OpenAI announced a new "Superalignment" effort to be co-led by Leike and chief scientist Ilya Sutskever.[13][14] The team's stated goal was "to solve the core technical challenges of controlling superintelligent AI over the next four years," and OpenAI publicly committed to dedicating 20% of the computing resources the company had secured to date to that effort.[13] The announcement laid out a three-part technical agenda: training AI systems using scalable human feedback, training AI systems to evaluate other AI systems, and ultimately building an automated alignment researcher of roughly human-level capability that would assist with the work.[13] The four-year horizon and the size of the compute commitment were unusual public benchmarks for an industrial lab, and Leike discussed the program in extended interviews including an August 2023 episode of the 80,000 Hours Podcast.[15]
The Superalignment program produced a number of papers during its short lifetime. The most prominent is "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision," posted to arXiv on December 14, 2023 (arXiv:2312.09390), with authors Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu.[16] The paper used pretrained models in the GPT-4 family to study a deliberately reversed supervisory setup: a weak model (e.g. GPT-2 level) generates labels and a strong model (e.g. GPT-4 level) is fine-tuned on them.[16] The strong student consistently outperformed its weak teacher across NLP, chess, and reward modeling tasks, and an auxiliary confidence loss could recover something close to GPT-3.5-level performance from a GPT-2-level supervisor on NLP.[16] The result was framed as an analog of the future situation in which humans must supervise systems they cannot fully evaluate.[16]
Sutskever announced his departure from OpenAI on May 14, 2024.[17] Leike followed several days later. Shortly after midnight Pacific time on May 17, 2024, he posted on X: "Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI."[18] In a thread that followed over the next hours, he stated that he had "been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point," wrote that "over the past years, safety culture and processes have taken a backseat to shiny products," and called on OpenAI to "become a safety-first AGI company."[18][19] He also reported that "over the past few months my team has been struggling for compute and it was getting harder to get crucial research done."[20]
Within days, OpenAI confirmed that the Superalignment team would be dissolved and its members absorbed into other research groups.[17][20] Subsequent reporting by Fortune, drawing on six sources familiar with the team's work, found that OpenAI never delivered anything close to the publicly committed 20% of secured compute to Superalignment; flex compute requests were repeatedly denied, and ambiguity about whether the commitment meant 20% each year or 20% over four years contributed to chronic shortfalls.[20] Leike's departure was followed and preceded by other safety-related exits at OpenAI, including those of Daniel Kokotajlo and several other alignment researchers, during a broader period of leadership turbulence at the company.[19][17]
On May 28, 2024, Leike announced on X that he was joining Anthropic "to continue the superalignment mission" and that his "new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research."[21][4] He was hired to lead a new effort that reports directly to Anthropic chief science officer Jared Kaplan, with existing Anthropic researchers working on scalable oversight moving under Leike's leadership as the team spun up.[21][4] The team is generally referred to as the Alignment Science team in Anthropic's later communications and on Leike's personal page, where he describes its mission as investigating how to train AI systems to follow human intent on tasks that are difficult for humans to evaluate directly.[8][22]
The hire was widely covered as a notable transfer of safety talent between the two leading frontier labs.[4][21][23] CNBC, TechCrunch, Quartz, and Silicon Republic all framed it in the context of Leike's pointed criticism of OpenAI's safety culture earlier that month, and noted that Anthropic, founded by former OpenAI staff including Dario Amodei and Daniela Amodei, had positioned itself as a safety-focused alternative.[4][21][23]
According to publication lists and the team's own descriptions, the Anthropic Alignment Science group works on a set of overlapping problems: Scalable oversight for tasks beyond human evaluation, Weak-to-Strong Generalization of alignment properties from weaker to stronger models, robustness against jailbreaks and adversarial input, and automated alignment research in which language models themselves help propose and evaluate alignment techniques.[22][8] One published output from the broader Anthropic alignment research effort during 2024 is "LLM Critics Help Catch LLM Bugs," listed on Leike's publication page as a 2024 paper continuing the critique-style scalable oversight line.[8]
The table below summarizes Leike's most cited papers, with arXiv identifiers and primary venues. Authorship order follows the papers themselves, with Leike's name highlighted in context.
| Year | Title | Lead authors | Venue / arXiv | Significance |
|---|---|---|---|---|
| 2016 | Nonparametric General Reinforcement Learning | Leike | PhD thesis, ANU | Doctoral dissertation on convergence in general RL; supervised by Marcus Hutter[8] |
| 2017 | Deep Reinforcement Learning from Human Preferences | Christiano, Leike, Brown et al. | NeurIPS 2017; arXiv:1706.03741 | Foundational deep Reinforcement Learning from Human Feedback (RLHF) paper; preference learning over trajectory segments[6] |
| 2018 | Scalable agent alignment via reward modeling: a research direction | Leike, Krueger, Everitt et al. | arXiv:1811.07871 | Defined Recursive reward modeling research program[11] |
| 2022 | Training language models to follow instructions with human feedback (InstructGPT) | Ouyang et al., including Leike | arXiv:2203.02155; NeurIPS 2022 | Three-stage RLHF recipe used in ChatGPT and later models[7] |
| 2022 | Self-critiquing models for assisting human evaluators | Saunders et al., including Leike | arXiv:2206.05802 | Empirical study of model-generated critiques for Scalable oversight[12] |
| 2023 | Weak-to-Strong Generalization | Burns, Izmailov, Kirchner et al., including Leike, Sutskever | arXiv:2312.09390; ICML 2024 | Empirical setup for studying how strong models behave under weak supervision[16] |
| 2024 | LLM Critics Help Catch LLM Bugs | Anthropic alignment team, including Leike | 2024 | Continuation of critique-style oversight at Anthropic[8] |
The 2017 NeurIPS paper introduced an explicit reward model trained from pairwise comparisons collected from human raters.[6] Rather than asking annotators to score trajectories on a scalar reward, the system shows two short video segments of agent behavior and asks which is preferred, then fits a reward function that explains the comparisons under a Bradley-Terry-like model and uses that reward to train a deep RL policy.[6] Experimental results showed that this could solve novel tasks ("do a backflip" with a simulated humanoid) within roughly an hour of human feedback time and could match RL trained against engineered rewards on a substantial fraction of Atari games using preference comparisons on under one percent of agent interactions.[6][10] The pattern of reward modeling plus reinforcement learning, with humans (or human-trained assistants) supplying preferences, is the structural template later expanded in InstructGPT and in Constitutional AI and is the empirical heart of Reinforcement Learning from Human Feedback (RLHF).[6][7]
The 2018 DeepMind technical report on scalable agent alignment via reward modeling sketched a recursive structure for using reward modeling beyond the regime where humans can directly evaluate full trajectories.[11] If a task is too complex for a human to grade end to end, decompose it into simpler sub-tasks that can be reward-modeled, and use the resulting evaluator agents to provide training signal for harder tasks; the construction continues recursively up the difficulty ladder.[11] The paper explicitly framed this as a research direction with open problems, including reward hacking, distributional shift, and what it called the "ought-is gap" between learned reward and human values.[11] Recursive reward modeling, along with debate and iterated amplification, is one of the canonical "model-supervises-model" proposals discussed in the literature on Scalable oversight.[11][24]
In the InstructGPT paper, Leike is the next-to-last author and is identified with OpenAI's alignment team; in the team's public communications, he and Ryan Lowe are commonly described as the alignment leads who carried the project to release.[7] The three-stage recipe (supervised fine-tuning on demonstrations, reward model trained on rankings, reinforcement learning against the reward model with a KL penalty toward the supervised policy) is the standard pattern that downstream labs would adapt for ChatGPT, Claude, and most other instruction-tuned models that followed.[7] InstructGPT itself was deployed as the default in OpenAI's API in early 2022, predating the release of ChatGPT later that year.[7]
The 2023 Superalignment paper deliberately reversed the conventional supervisory hierarchy used by RLHF: instead of humans (strong supervisors) training models (weaker students), the authors used weak language models to label data for stronger pretrained models.[16] The motivation was explicit: in the longer-term superhuman case, the supervisor (a human) will be weaker than the supervised model, so it is useful to study what happens when imperfect, weaker-than-target labels are used.[16] The strong students systematically outperformed their weak teachers across natural language understanding, chess move selection, and reward modeling tasks; this "weak-to-strong" gap could be narrowed further with auxiliary objectives such as confidence regularization.[16] The paper became a touchstone in the alignment literature for empirical study of post-human-level supervision.[16][22]
In interviews, papers, and his personal site, Leike has articulated a research vision built around several overlapping themes.[15][2][8]
First, the central problem is supervising AI on tasks that humans cannot directly evaluate. He has summarized the question as: "how can we train AI systems to follow human intent on tasks that are difficult for humans to evaluate directly?"[8] This frames most of his work on reward modeling, recursive reward modeling, critiques, and scalable oversight as different empirical and conceptual angles on the same question.[11][12][16] The framing distinguishes alignment from capability: the worry is not that systems will be too weak, but that systems will eventually be capable enough that humans cannot recognize good from bad behavior without help.[15]
Second, model-supervises-model paradigms are treated as a primary engineering approach rather than an exotic addition.[11][15] Recursive reward modeling, AI-assisted human evaluation through critiques, weak-to-strong generalization, and the more general program of building "automated alignment researchers" all rely on using language models themselves to extend the reach of human oversight.[11][12][16][13] The Superalignment announcement made this explicit: the goal was to build a roughly human-level automated alignment researcher and then use compute to scale that researcher up rather than relying solely on direct human labor for alignment progress.[13]
Third, Scalable oversight research is empirical and benefits from experimental setups even before genuinely superhuman models exist.[15][16] The sandwiching experimental paradigm formalized by Bowman and collaborators in "Measuring Progress on Scalable Oversight for Large Language Models" (2022) and the weak-to-strong setup of the 2023 paper are both attempts to design today's experiments that bear on tomorrow's problem.[25][16] In sandwiching, a model is positioned between a non-expert and an expert on a task, and the experiment tests whether scalable oversight techniques can help the non-expert match the expert's performance using the model; the analog target is humans trying to evaluate AI outputs that lie above their unaided ability.[25][24]
Fourth, dangerous-capability evaluation and red teaming sit alongside alignment training as part of the same overall program of preparing for more capable systems.[15][13] In his 80,000 Hours interview and in the OpenAI Superalignment announcement, Leike emphasized that solving alignment is necessary but not sufficient, and that monitoring, preparedness, and adversarial robustness are part of a serious AGI safety stack.[15][13][19] At Anthropic, this orientation aligns with the broader institutional emphasis on the Responsible Scaling Policy framework and capability-gated commitments.[28]
Fifth, Leike has consistently treated alignment as a technical research problem with empirical traction, while acknowledging that institutional and governance choices set the conditions under which research can succeed.[19][20] His May 2024 resignation statement reads in part as a public argument that even technically sound alignment programs can be undermined by misaligned institutional incentives, an unusually pointed criticism for a researcher to make about a former employer.[19][20]
Beyond research papers, Leike has been an unusually public alignment researcher. He maintains the website jan.leike.name with a publications list and a Substack blog "Musings on the Alignment Problem" where he posts longer essays.[8][22] He has been interviewed on the 80,000 Hours Podcast (August 2023) and on Daniel Filan's AXRP podcast (July 2023), and has appeared in panel discussions and university talks on alignment.[15][26]
Leike was named to TIME's list of the 100 most influential people in AI in both 2023 and 2024.[8][2] In the aftermath of his May 2024 resignation, his X thread became one of the most widely cited statements of inside-lab discontent over commercial pressure versus safety priorities, and was reported on by Axios, CNN Business, CBS News, Fortune, CNBC, and others.[19][27][18][20]
Like other proponents of RLHF-style alignment, Leike's research program has attracted critique on several fronts.[28] A standing objection to reward modeling more broadly is reward hacking: the trained model can game the learned reward model in ways that diverge from underlying human preferences.[11] Critics within the alignment community have argued that recursive reward modeling and other model-supervises-model schemes inherit, and may amplify, errors in human judgment that the learned evaluators encode.[24] The 2023 weak-to-strong generalization results, while encouraging, were also relatively narrow in domain (NLP, chess, reward modeling) and used pre-existing pretrained models rather than genuinely superhuman systems; the authors themselves describe the work as an analog rather than a solution.[16]
Leike's resignation also surfaced criticism of the Superalignment program itself: even with public commitments of compute and personnel, the team struggled to obtain promised resources, and the public dissolution of the team less than a year after its announcement raised questions about whether four-year safety timelines inside fast-moving commercial labs are credible.[20][17] His own statements on departure (about "shiny products" and "lost trust") are, in part, an internal critique of that institutional model.[19]
Leike sits within a tightly interconnected network of alignment researchers whose careers have crossed between Deepmind, OpenAI, and Anthropic.[28] The table below sketches a few of these adjacent positions; each row is a separate dedup set for wikilinks.
| Researcher | Past affiliations | Current affiliation | Joint work with Leike |
|---|---|---|---|
| Paul Christiano | OpenAI alignment, Alignment Research Center | US AI Safety Institute (now CAISI) | Lead author of 2017 RLHF paper with Leike[6] |
| Ilya Sutskever | OpenAI co-founder and chief scientist | Safe Superintelligence Inc. | Co-led OpenAI Superalignment with Leike[13][17] |
| Dario Amodei | OpenAI VP of research | Anthropic CEO | Co-author of 2017 RLHF paper[6] |
| Dan Hendrycks | UC Berkeley | Center for AI Safety | Independent contemporary in alignment research[28] |
| Andrej Karpathy | OpenAI co-founder, Tesla | Independent research and Eureka Labs | OpenAI colleague during InstructGPT period[7] |
| John Schulman | OpenAI co-founder | Various (former OpenAI) | InstructGPT co-author[7] |
| Sam Altman | OpenAI CEO | OpenAI | Leadership Leike publicly broke with in May 2024[19] |
The intellectual structure of Leike's career maps closely onto a cluster of overlapping topics in alignment research that have their own dedicated entries:[11][16][25]