John Schulman
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,543 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,543 words
Add missing citations, update stale details, or suggest a clearer explanation.
John Schulman is an American artificial intelligence researcher and one of the eleven original co-founders of OpenAI, the San Francisco research laboratory that built ChatGPT and the GPT family of large language models.[1][2] He is best known for his work in deep reinforcement learning, where he is the lead author of three of the field's most widely cited algorithmic papers: Trust Region Policy Optimization (TRPO, 2015), Generalized Advantage Estimation (GAE, 2015) and Proximal Policy Optimization (PPO, 2017).[3][4][5] PPO in particular became the dominant policy-gradient method of the late 2010s and the algorithmic engine behind both OpenAI Five and the reinforcement learning from human feedback (RLHF) pipeline used to fine-tune ChatGPT.[6][7]
At OpenAI between 2015 and 2024, Schulman led the team that applied reinforcement learning to large language models and co-led the post-training organization that shipped ChatGPT in November 2022, work for which he has been described in the press as a "ChatGPT architect."[8][9] After almost nine years at the company he announced on 5 August 2024 that he would be leaving OpenAI to join rival Anthropic in order to "deepen my focus on AI alignment" and "return to hands-on technical work."[10][11]
His tenure at Anthropic lasted only about five months. On 6 February 2025 he confirmed that he had left Anthropic the previous week, and several outlets reported that he was joining Thinking Machines Lab, the new artificial intelligence startup founded by former OpenAI chief technology officer Mira Murati.[12][13][14] Thinking Machines Lab publicly emerged from stealth on 18 February 2025 with Schulman listed as its chief scientist.[15][16]
| Born | 1987 or 1988[17] |
| Nationality | American |
| Education | BS Physics, California Institute of Technology (2010); PhD Electrical Engineering and Computer Sciences, University of California, Berkeley (2016)[17][18] |
| Doctoral advisor | Pieter Abbeel[17][18] |
| Known for | TRPO, GAE, PPO; reinforcement learning from human feedback; leading post-training of ChatGPT |
| Affiliations | OpenAI (2015–2024); Anthropic (2024–2025); Thinking Machines Lab (2025–present) |
Schulman was born in 1987 or 1988 in the United States and attended Great Neck South High School in Great Neck, New York, where he moved with his family during the summer before ninth grade.[17][19] According to his 2005 U.S. Physics Team biography, his interest in science began in childhood through documentaries and science fiction, and serious physics study started in eighth grade after he was inspired by the television programme BattleBots.[19] He was selected as one of the twenty-four members of the 2005 U.S. Physics Olympiad team while a junior at Great Neck South.[19]
He enrolled at the California Institute of Technology, graduating in 2010 with a Bachelor of Science in physics.[17][20] He then began graduate study at the University of California, Berkeley, where he initially intended to work in computational neuroscience. In a 2023 interview with Berkeley News he recalled that during his first-year lab rotations he joined the group of robotics professor Pieter Abbeel and was "really excited about that work," in particular Abbeel's projects on autonomous helicopter aerobatics and on robotic laundry folding, so he asked to switch from the neuroscience program into Berkeley's Department of Electrical Engineering and Computer Sciences.[20][21] He has cited that rotation as the moment he committed to artificial intelligence rather than neuroscience as his long-term research direction.[20]
Abbeel became his doctoral advisor, and Schulman completed his PhD in 2016 with a thesis titled "Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs," filed as Berkeley EECS Technical Report EECS-2016-217 on 16 December 2016.[18] The thesis frames reinforcement learning as the optimization of an expected return with respect to the parameters of a policy. Chapter 3 develops the trust region policy optimization algorithm and proves a monotonic improvement guarantee for a related theoretical surrogate; Chapter 4 introduces generalized advantage estimation as a way to reduce the variance of policy-gradient estimates using a learned state-value function; and Chapter 5 presents a unifying calculus for gradient estimators of objectives mixing sampled random variables and differentiable operations, connecting reinforcement learning to variational inference and to memory-and-attention models.[18]
In February 2015 Schulman, with co-authors Sergey Levine, Philipp Moritz, Michael I. Jordan and Pieter Abbeel, posted the paper "Trust Region Policy Optimization" (arXiv:1502.05477) on arXiv.[3] TRPO is a policy gradient algorithm that, at each iteration, optimizes a surrogate advantage objective subject to a constraint on the average Kullback–Leibler divergence between the new and old policies. The KL constraint corresponds to a "trust region" inside which a theoretically motivated monotonic-improvement bound holds; in practice Schulman and colleagues showed that a practical natural-gradient implementation, using a conjugate-gradient solver against the Fisher information matrix and a backtracking line search, could train deep neural network policies for simulated locomotion in the MuJoCo physics simulator and for Atari games from raw screen images.[3] The paper appeared at the 32nd International Conference on Machine Learning later in 2015 and became a foundational reference for stable deep policy optimisation.[22] Before TRPO most policy-gradient methods either took fixed, often dangerous, step sizes in parameter space or relied on heuristic learning-rate tuning; the trust-region formulation gave practitioners a principled way to make as large an update as possible while bounding the risk of policy collapse.[3]
Four months after TRPO, in June 2015, Schulman, Moritz, Levine, Jordan and Abbeel posted "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv:1506.02438).[4] GAE introduces an exponentially weighted estimator, parameterised by a discount factor γ and a trace decay parameter λ, that interpolates between high-variance one-step Monte Carlo returns and low-variance long-horizon temporal-difference targets. By tuning λ a practitioner can trade off bias and variance in the advantage estimate, dramatically improving the sample efficiency of policy gradient methods. GAE is now standard in essentially every modern actor-critic implementation, and is built into popular libraries such as Stable-Baselines3 and CleanRL.[4]
In July 2017 Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford and Oleg Klimov, all then at OpenAI, released "Proximal Policy Optimization Algorithms" (arXiv:1707.06347).[5] PPO retains the trust-region intuition of TRPO but replaces the constrained natural-gradient step with a much simpler first-order objective. The most widely used variant, PPO-Clip, maximises a clipped probability-ratio surrogate that simply truncates the policy update once the ratio between the new and old action probabilities leaves a small interval around one, typically with a clip parameter of 0.2.[5] Because PPO requires only a few lines of code on top of a standard policy gradient implementation, supports both discrete and continuous action spaces, scales easily to distributed training, and tends to be robust to hyperparameter choices, it rapidly became the de-facto policy optimisation algorithm at OpenAI and in the broader research community.[5][7] OpenAI's blog post accompanying the release stated that "PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance," and the algorithm now appears as a baseline in essentially every reinforcement-learning textbook and library.[7] PPO is also the policy-gradient algorithm at the heart of modern reinforcement learning from human feedback pipelines for large language models, including the one used to fine-tune ChatGPT.[9]
Between 2017 and 2019 Schulman was one of the contributors to OpenAI Five, a team of five neural networks trained by self-play to play the multiplayer real-time strategy game Dota 2.[6] The system was trained with a massively scaled version of PPO on a custom distributed reinforcement-learning infrastructure that processed approximately two million frames every two seconds across thousands of GPUs. On 13 April 2019 OpenAI Five defeated OG, the reigning Dota 2 world champions, in a best-of-three exhibition match, the first time an AI system had beaten a world champion at an esports title.[6] The technical report "Dota 2 with Large Scale Deep Reinforcement Learning" (arXiv:1912.06680), authored by OpenAI with Christopher Berner as the lead author, lists Schulman among the project's authors.[23]
From around 2017 Schulman began collaborating with safety researchers at OpenAI and at DeepMind on what would become known as reinforcement learning from human feedback: training a reward model from pairwise human preferences over model outputs, and then using PPO to fine-tune a language model to maximise that reward.[9] He is a co-author of "Training language models to follow instructions with human feedback" (arXiv:2203.02155), the March 2022 paper that introduced InstructGPT.[9] InstructGPT applied a three-stage recipe of supervised fine-tuning on labeller-written demonstrations, reward modelling on pairwise preference data, and PPO fine-tuning against the learned reward, and showed that the resulting 1.3-billion-parameter model was preferred by human labellers to the 175-billion-parameter GPT-3 on the OpenAI prompt distribution, while also being more truthful and less toxic on standard benchmarks.[9] The same RLHF recipe was used to fine-tune the model behind ChatGPT, which OpenAI released as a "research preview" on 30 November 2022 and which reached an estimated 100 million users within two months, the fastest consumer software adoption on record at the time.[9][8]
In a public lecture at UC Berkeley on 19 April 2023, "Reinforcement Learning from Human Feedback: Progress and Challenges," Schulman argued that hallucinations in language models are an unavoidable consequence of pure behaviour cloning: "even if you clone on 100% correct answers, you're teaching the model to hallucinate, because it doesn't have all of those facts," and that reinforcement learning with reward models that explicitly penalise confident fabrication is therefore a necessary part of the solution, rather than a peripheral safety measure.[8][24] This framing of truthfulness as a tractable RLHF objective influenced subsequent post-training work at multiple frontier laboratories.[24]
Schulman was one of the eleven original co-founders of OpenAI, which was announced on 11 December 2015 as a non-profit artificial intelligence research company. The founders, listed in OpenAI's "Introducing OpenAI" blog post, were Sam Altman, Greg Brockman, Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, John Schulman, Pamela Vagata, Wojciech Zaremba and Ilya Sutskever as co-chief technology officer, with Sam Altman and Elon Musk serving as co-chairs.[1][2] At the time of the announcement Schulman was still completing his PhD at Berkeley, which he formally defended in 2016.[17][18] He has said in interviews that he was recruited at a small dinner organised by Altman and Brockman at the Rosewood Sand Hill hotel in Menlo Park in mid-2015, and that the prospect of a long-term, well-resourced research environment focused on artificial general intelligence was decisive in his choice to join.[20]
During his nine years at OpenAI, Schulman moved gradually from pure reinforcement learning research toward the application of RL to large language models. He led OpenAI's reinforcement learning team in its early years, during which the lab open-sourced the Gym benchmark suite and the Baselines reference implementations of TRPO, PPO and other algorithms.[7] He contributed to OpenAI Five between 2017 and 2019, and from 2022 to 2024 co-led the post-training team that fine-tuned the GPT series models for ChatGPT and the OpenAI API.[25][11] After the November 2022 launch of ChatGPT he became one of the most prominent public faces of OpenAI's alignment work, giving talks at Berkeley in April 2023, a Stanford CS25 lecture in 2023 on language model post-training, and a long-form interview on the Dwarkesh podcast in May 2024 in which he discussed reasoning, RLHF and the possibility of artificial general intelligence within a few years.[24][26]
Following the November 2023 board crisis that briefly removed Altman, OpenAI reorganised its safety teams; in mid-2024 several alignment researchers, including Jan Leike and Ilya Sutskever, left the company, and Schulman was named to a new internal safety and security committee chaired by the OpenAI board.[11][25] On 5 August 2024 he announced his own departure in a note posted to X (formerly Twitter): "I've made the difficult decision to leave OpenAI. This choice stems from my desire to deepen my focus on AI alignment, and to start a new chapter of my career where I can return to hands-on technical work."[10][11] In the same message he emphasised that he was not leaving because of disagreements over safety: "Company leaders have been very committed to investment in this area," he wrote, and predicted that "OpenAI and the teams I was part of will continue to thrive without me."[10]
Schulman's August 2024 statement also announced his next employer: "I've decided to pursue this goal at Anthropic, where I believe I can gain new perspectives and do research alongside people deeply engaged with the topics I'm most interested in."[10][11] He stressed that the move was personal and did not reflect a perceived weakness of alignment work at OpenAI, writing that "Company leaders have been very committed to investment in this area."[10] Bloomberg and TechCrunch both reported the move on 5–6 August 2024, alongside the news that fellow co-founder Greg Brockman was taking an extended sabbatical through the end of 2024.[11][27]
At Anthropic Schulman was hired into a research role focused on AI alignment, but he stayed only about five months. On 6 February 2025 he confirmed on X: "Confirming that I left Anthropic last week. Leaving wasn't easy because I enjoyed the stimulating research environment and the kind and talented people I was working with, but I decided to go with another opportunity that I found extremely compelling."[28][29] Anthropic's chief science officer Jared Kaplan told Bloomberg that he was "sad to see John go" but "fully support[ed] his decision to pursue new opportunities."[29][13]
The "extremely compelling" alternative was Thinking Machines Lab, a new artificial intelligence company being assembled by former OpenAI chief technology officer Mira Murati. Fortune and Bloomberg reported on 6 February 2025 that Schulman was joining Murati's startup; the company itself emerged from stealth on 18 February 2025 with a blog post and a leadership announcement.[14][15][16] Schulman was named chief scientist; Barret Zoph, formerly OpenAI's vice president of research, was named chief technology officer; and the founding team included Lilian Weng, Luke Metz, Andrew Tulloch, Jonathan Lachman, Sam Shleifer and Stephen Roller, most of them former OpenAI colleagues.[15][16]
Thinking Machines Lab framed its mission as making "AI systems more widely understood, customizable, and generally capable," with an explicit emphasis on closing the gap between frontier industrial laboratories and the broader research and developer community.[15][16] Press coverage at the time highlighted that the leadership team was "stacked with former OpenAI colleagues" and that the company intended to focus on more efficient post-training techniques rather than ever-larger pretraining runs.[15][16] In July 2025 the company closed a $2 billion seed financing round led by Andreessen Horowitz at a reported $12 billion valuation, one of the largest seed rounds in the history of the technology industry.[16] On 1 October 2025 it released its first product, Tinker, an API for fine-tuning open-weight language models on Thinking Machines' internal infrastructure that allows external developers to submit fine-tuning jobs without managing the underlying distributed computing.[16] In media coverage of the launch, Schulman was identified as the technical lead behind the company's post-training stack.[16]
Schulman has frequently spoken about both the practical mechanics of RLHF and the broader question of when and how artificial general intelligence may arrive. In his April 2023 Berkeley talk "Reinforcement Learning from Human Feedback: Progress and Challenges," he argued that even if a language model is trained with behaviour cloning on entirely correct answers, it will still learn to hallucinate because the model has no internal notion of whether it actually knows a particular fact; he proposed that reinforcement learning with a reward model that penalises confident fabrication is therefore a necessary part of any solution to hallucination, rather than merely a peripheral safety patch.[24][8] He has revisited this argument in subsequent interviews and lectures, framing it as the most important methodological reason why purely supervised fine-tuning will be insufficient for trustworthy assistants.[24]
On AGI timelines, in his May 2024 interview with Dwarkesh Patel, Schulman said it would be "reasonable" to plan for AGI arriving within one to two years while cautioning, "First of all, I don't think this is going to happen next year but it's still useful to have the conversation. It could be two or three years instead." He estimated that his own job, research engineering on frontier models, might be largely automatable in roughly five years, but argued that the bottleneck for fully autonomous AI systems is unlikely to dissolve in a single capability jump.[26] He has consistently characterised current language models as systems that try to produce outputs human judges will rate as correct rather than as agents with intrinsic drives, and has argued that the main near-term safety problems concern misuse and incorrect outputs rather than autonomous goal-directed behaviour.[26] On the topic of competing efforts to build "truth-seeking" chatbots, he told The Decoder in 2023 that proposals such as Elon Musk's "TruthGPT" were "complicated" because reasonable people will disagree about what counts as a true and balanced answer to politically charged questions.[31]
In 2025, the University of California, Berkeley awarded Schulman its Mark Bingham Award for Excellence in Achievement by Young Alumni, citing his contributions to deep reinforcement learning and to ChatGPT.[17]
Selected papers, listed by year: