John Schulman

OpenAI People Reinforcement Learning

19 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

33 citations

Revision

v5 · 3,704 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

John Schulman is an American artificial intelligence researcher, one of the eleven original co-founders of OpenAI, and the inventor of Proximal Policy Optimization (PPO), the reinforcement-learning algorithm that OpenAI describes as "the default reinforcement learning algorithm at OpenAI" and that became the algorithmic engine of the reinforcement learning from human feedback (RLHF) pipeline used to fine-tune ChatGPT.^[1]^[2]^[7] He is the lead author of three of deep reinforcement learning's most widely cited algorithmic papers: Trust Region Policy Optimization (TRPO, 2015), Generalized Advantage Estimation (GAE, 2015) and Proximal Policy Optimization (PPO, 2017).^[3]^[4]^[5] PPO in particular became the dominant policy-gradient method of the late 2010s and the algorithmic backbone of both OpenAI Five and the RLHF recipe behind ChatGPT.^[6]^[7]

At OpenAI between 2015 and 2024, Schulman led the team that applied reinforcement learning to large language models and co-led the post-training organization that shipped ChatGPT in November 2022, work for which he has been described in the press as a "ChatGPT architect."^[8]^[9] After almost nine years at the company he announced on 5 August 2024 that he would be leaving OpenAI to join rival Anthropic in order to "deepen my focus on AI alignment" and "return to hands-on technical work."^[10]^[11]

His tenure at Anthropic lasted only about five months. On 6 February 2025 he confirmed that he had left Anthropic the previous week, and several outlets reported that he was joining Thinking Machines Lab, the new artificial intelligence startup founded by former OpenAI chief technology officer Mira Murati.^[12]^[13]^[14] Thinking Machines Lab publicly emerged from stealth on 18 February 2025 with Schulman listed as its chief scientist, a role he still holds in 2026.^[15]^[16]^[32]

Key facts


Born	1987 or 1988^[17]
Nationality	American
Education	BS Physics, California Institute of Technology (2010); PhD Electrical Engineering and Computer Sciences, University of California, Berkeley (2016)^[17]^[18]
Doctoral advisor	Pieter Abbeel^[17]^[18]
Known for	TRPO, GAE, PPO; reinforcement learning from human feedback; leading post-training of ChatGPT
Affiliations	OpenAI (2015-2024); Anthropic (2024-2025); Thinking Machines Lab (2025-present)

Early life and education

Schulman was born in 1987 or 1988 in the United States and attended Great Neck South High School in Great Neck, New York, where he moved with his family during the summer before ninth grade.^[17]^[19] According to his 2005 U.S. Physics Team biography, his interest in science began in childhood through documentaries and science fiction, and serious physics study started in eighth grade after he was inspired by the television programme BattleBots.^[19] He was selected as one of the twenty-four members of the 2005 U.S. Physics Olympiad team while a junior at Great Neck South.^[19]

He enrolled at the California Institute of Technology, graduating in 2010 with a Bachelor of Science in physics.^[17]^[20] He then began graduate study at the University of California, Berkeley, where he initially intended to work in computational neuroscience. In a 2023 interview with Berkeley News he recalled that during his first-year lab rotations he joined the group of robotics professor Pieter Abbeel and was "really excited about that work," in particular Abbeel's projects on autonomous helicopter aerobatics and on robotic laundry folding, so he asked to switch from the neuroscience program into Berkeley's Department of Electrical Engineering and Computer Sciences.^[20]^[21] He has cited that rotation as the moment he committed to artificial intelligence rather than neuroscience as his long-term research direction.^[20]

Abbeel became his doctoral advisor, and Schulman completed his PhD in 2016 with a thesis titled "Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs," filed as Berkeley EECS Technical Report EECS-2016-217 on 16 December 2016.^[18] The thesis frames reinforcement learning as the optimization of an expected return with respect to the parameters of a policy. Chapter 3 develops the trust region policy optimization algorithm and proves a monotonic improvement guarantee for a related theoretical surrogate; Chapter 4 introduces generalized advantage estimation as a way to reduce the variance of policy-gradient estimates using a learned state-value function; and Chapter 5 presents a unifying calculus for gradient estimators of objectives mixing sampled random variables and differentiable operations, connecting reinforcement learning to variational inference and to memory-and-attention models.^[18]

What are John Schulman's main research contributions?

Trust Region Policy Optimization (TRPO)

In February 2015 Schulman, with co-authors Sergey Levine, Philipp Moritz, Michael I. Jordan and Pieter Abbeel, posted the paper "Trust Region Policy Optimization" (arXiv:1502.05477) on arXiv.^[3] TRPO is a policy gradient algorithm that, at each iteration, optimizes a surrogate advantage objective subject to a constraint on the average Kullback-Leibler divergence between the new and old policies. The KL constraint corresponds to a "trust region" inside which a theoretically motivated monotonic-improvement bound holds; in practice Schulman and colleagues showed that a practical natural-gradient implementation, using a conjugate-gradient solver against the Fisher information matrix and a backtracking line search, could train deep neural network policies for simulated locomotion in the MuJoCo physics simulator and for Atari games from raw screen images.^[3] The paper appeared at the 32nd International Conference on Machine Learning later in 2015 and became a foundational reference for stable deep policy optimisation.^[22] Before TRPO most policy-gradient methods either took fixed, often dangerous, step sizes in parameter space or relied on heuristic learning-rate tuning; the trust-region formulation gave practitioners a principled way to make as large an update as possible while bounding the risk of policy collapse.^[3]

Generalized Advantage Estimation (GAE)

Four months after TRPO, in June 2015, Schulman, Moritz, Levine, Jordan and Abbeel posted "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv:1506.02438).^[4] GAE introduces an exponentially weighted estimator, parameterised by a discount factor gamma and a trace decay parameter lambda, that interpolates between high-variance one-step Monte Carlo returns and low-variance long-horizon temporal-difference targets. By tuning lambda a practitioner can trade off bias and variance in the advantage estimate, dramatically improving the sample efficiency of policy gradient methods. GAE is now standard in essentially every modern actor-critic implementation, and is built into popular libraries such as Stable-Baselines3 and CleanRL.^[4]

Proximal Policy Optimization (PPO)

In July 2017 Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford and Oleg Klimov, all then at OpenAI, released "Proximal Policy Optimization Algorithms" (arXiv:1707.06347).^[5] PPO retains the trust-region intuition of TRPO but replaces the constrained natural-gradient step with a much simpler first-order objective. The most widely used variant, PPO-Clip, maximises a clipped probability-ratio surrogate that simply truncates the policy update once the ratio between the new and old action probabilities leaves a small interval around one, typically with a clip parameter of 0.2.^[5] Because PPO requires only a few lines of code on top of a standard policy gradient implementation, supports both discrete and continuous action spaces, scales easily to distributed training, and tends to be robust to hyperparameter choices, it rapidly became the de-facto policy optimisation algorithm at OpenAI and in the broader research community.^[5]^[7] OpenAI's blog post accompanying the release stated that "PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance," and the algorithm now appears as a baseline in essentially every reinforcement-learning textbook and library.^[7] PPO is also the policy-gradient algorithm at the heart of modern reinforcement learning from human feedback pipelines for large language models, including the one used to fine-tune ChatGPT.^[9]

OpenAI Five and large-scale RL

Between 2017 and 2019 Schulman was one of the contributors to OpenAI Five, a team of five neural networks trained by self-play to play the multiplayer real-time strategy game Dota 2.^[6] The system was trained with a massively scaled version of PPO on a custom distributed reinforcement-learning infrastructure that processed approximately two million frames every two seconds across thousands of GPUs. On 13 April 2019 OpenAI Five defeated OG, the reigning Dota 2 world champions, in a best-of-three exhibition match, the first time an AI system had beaten a world champion at an esports title.^[6] The technical report "Dota 2 with Large Scale Deep Reinforcement Learning" (arXiv:1912.06680), authored by OpenAI with Christopher Berner as the lead author, lists Schulman among the project's authors.^[23]

Reinforcement learning from human feedback and ChatGPT

From around 2017 Schulman began collaborating with safety researchers at OpenAI and at DeepMind on what would become known as reinforcement learning from human feedback: training a reward model from pairwise human preferences over model outputs, and then using PPO to fine-tune a language model to maximise that reward.^[9] He is a co-author of "Training language models to follow instructions with human feedback" (arXiv:2203.02155), the March 2022 paper that introduced InstructGPT.^[9] InstructGPT applied a three-stage recipe of supervised fine-tuning on labeller-written demonstrations, reward modelling on pairwise preference data, and PPO fine-tuning against the learned reward, and showed that the resulting 1.3-billion-parameter model was preferred by human labellers to the 175-billion-parameter GPT-3 on the OpenAI prompt distribution, while also being more truthful and less toxic on standard benchmarks.^[9] The same RLHF recipe was used to fine-tune the model behind ChatGPT, which OpenAI released as a "research preview" on 30 November 2022 and which reached an estimated 100 million users within two months, the fastest consumer software adoption on record at the time.^[9]^[8]

In a public lecture at UC Berkeley on 19 April 2023, "Reinforcement Learning from Human Feedback: Progress and Challenges," Schulman argued that hallucinations in language models are an unavoidable consequence of pure behaviour cloning: "even if you clone on 100% correct answers, you're teaching the model to hallucinate, because it doesn't have all of those facts," and that reinforcement learning with reward models that explicitly penalise confident fabrication is therefore a necessary part of the solution, rather than a peripheral safety measure.^[8]^[24] This framing of truthfulness as a tractable RLHF objective influenced subsequent post-training work at multiple frontier laboratories.^[24]

Career at OpenAI (2015-2024)

Schulman was one of the eleven original co-founders of OpenAI, which was announced on 11 December 2015 as a non-profit artificial intelligence research company. The founders, listed in OpenAI's "Introducing OpenAI" blog post, were Sam Altman, Greg Brockman, Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, John Schulman, Pamela Vagata, Wojciech Zaremba and Ilya Sutskever as co-chief technology officer, with Sam Altman and Elon Musk serving as co-chairs.^[1]^[2] At the time of the announcement Schulman was still completing his PhD at Berkeley, which he formally defended in 2016.^[17]^[18] He has said in interviews that he was recruited at a small dinner organised by Altman and Brockman at the Rosewood Sand Hill hotel in Menlo Park in mid-2015, and that the prospect of a long-term, well-resourced research environment focused on artificial general intelligence was decisive in his choice to join.^[20]

During his nine years at OpenAI, Schulman moved gradually from pure reinforcement learning research toward the application of RL to large language models. He led OpenAI's reinforcement learning team in its early years, during which the lab open-sourced the Gym benchmark suite and the Baselines reference implementations of TRPO, PPO and other algorithms.^[7] He contributed to OpenAI Five between 2017 and 2019, and from 2022 to 2024 co-led the post-training team that fine-tuned the GPT series models for ChatGPT and the OpenAI API.^[25]^[11] After the November 2022 launch of ChatGPT he became one of the most prominent public faces of OpenAI's alignment work, giving talks at Berkeley in April 2023, a Stanford CS25 lecture in 2023 on language model post-training, and a long-form interview on the Dwarkesh podcast in May 2024 in which he discussed reasoning, RLHF and the possibility of artificial general intelligence within a few years.^[24]^[26]

Following the November 2023 board crisis that briefly removed Altman, OpenAI reorganised its safety teams; in mid-2024 several alignment researchers, including Jan Leike and Ilya Sutskever, left the company, and Schulman was named to a new internal safety and security committee chaired by the OpenAI board.^[11]^[25] On 5 August 2024 he announced his own departure in a note posted to X (formerly Twitter): "I've made the difficult decision to leave OpenAI. This choice stems from my desire to deepen my focus on AI alignment, and to start a new chapter of my career where I can return to hands-on technical work."^[10]^[11] In the same message he emphasised that he was not leaving because of disagreements over safety: "Company leaders have been very committed to investment in this area," he wrote, and predicted that "OpenAI and the teams I was part of will continue to thrive without me."^[10]

Why did John Schulman leave OpenAI for Anthropic?

Schulman's August 2024 statement also announced his next employer: "I've decided to pursue this goal at Anthropic, where I believe I can gain new perspectives and do research alongside people deeply engaged with the topics I'm most interested in."^[10]^[11] He stressed that the move was personal and did not reflect a perceived weakness of alignment work at OpenAI, writing that "Company leaders have been very committed to investment in this area."^[10] Bloomberg and TechCrunch both reported the move on 5-6 August 2024, alongside the news that fellow co-founder Greg Brockman was taking an extended sabbatical through the end of 2024.^[11]^[27]

At Anthropic Schulman was hired into a research role focused on AI alignment, but he stayed only about five months. On 6 February 2025 he confirmed on X: "Confirming that I left Anthropic last week. Leaving wasn't easy because I enjoyed the stimulating research environment and the kind and talented people I was working with, but I decided to go with another opportunity that I found extremely compelling."^[28]^[29] Anthropic's chief science officer Jared Kaplan told Bloomberg that he was "sad to see John go" but "fully support[ed] his decision to pursue new opportunities."^[29]^[13]

What is John Schulman's role at Thinking Machines Lab?

The "extremely compelling" alternative was Thinking Machines Lab, a new artificial intelligence company being assembled by former OpenAI chief technology officer Mira Murati. Fortune and Bloomberg reported on 6 February 2025 that Schulman was joining Murati's startup; the company itself emerged from stealth on 18 February 2025 with a blog post and a leadership announcement.^[14]^[15]^[16] Schulman was named chief scientist; Barret Zoph, formerly OpenAI's vice president of research, was named chief technology officer; and the founding team included Lilian Weng, Luke Metz, Andrew Tulloch, Jonathan Lachman, Sam Shleifer and Stephen Roller, most of them former OpenAI colleagues.^[15]^[16]

Thinking Machines Lab framed its mission as making "AI systems more widely understood, customizable, and generally capable," with an explicit emphasis on closing the gap between frontier industrial laboratories and the broader research and developer community.^[15]^[16] Press coverage at the time highlighted that the leadership team was "stacked with former OpenAI colleagues" and that the company intended to focus on more efficient post-training techniques rather than ever-larger pretraining runs.^[15]^[16] In July 2025 the company closed a $2 billion seed financing round led by Andreessen Horowitz at a $12 billion valuation, with participation from Nvidia, Accel, ServiceNow, Cisco, AMD and Jane Street; multiple outlets, including TechCrunch and Crunchbase, described it as the largest seed round in the history of venture capital.^[16]^[33] On 1 October 2025 it released its first product, Tinker, an API for fine-tuning open-weight language models on Thinking Machines' internal infrastructure that allows external developers to submit fine-tuning jobs without managing the underlying distributed computing.^[16] In media coverage of the launch, Schulman was identified as the technical lead behind the company's post-training stack.^[16]

In January 2026, two of the company's named co-founders, chief technology officer Barret Zoph and Luke Metz, returned to OpenAI, alongside another early employee, Sam Schoenholz.^[32] Schulman remained at Thinking Machines Lab as chief scientist, leaving him as the most prominent of the original 2025 founding cohort still at the company.^[32] He has continued to speak publicly on behalf of the lab, indicating in late 2025 and early 2026 that Thinking Machines intended to release its own in-house models, building on the post-training and fine-tuning capabilities behind Tinker.^[32]

Views and public profile

Schulman has frequently spoken about both the practical mechanics of RLHF and the broader question of when and how artificial general intelligence may arrive. In his April 2023 Berkeley talk "Reinforcement Learning from Human Feedback: Progress and Challenges," he argued that even if a language model is trained with behaviour cloning on entirely correct answers, it will still learn to hallucinate because the model has no internal notion of whether it actually knows a particular fact; he proposed that reinforcement learning with a reward model that penalises confident fabrication is therefore a necessary part of any solution to hallucination, rather than merely a peripheral safety patch.^[24]^[8] He has revisited this argument in subsequent interviews and lectures, framing it as the most important methodological reason why purely supervised fine-tuning will be insufficient for trustworthy assistants.^[24]

On AGI timelines, in his May 2024 interview with Dwarkesh Patel, Schulman said it would be "reasonable" to plan for AGI arriving within one to two years while cautioning, "First of all, I don't think this is going to happen next year but it's still useful to have the conversation. It could be two or three years instead." He estimated that his own job, research engineering on frontier models, might be largely automatable in roughly five years, but argued that the bottleneck for fully autonomous AI systems is unlikely to dissolve in a single capability jump.^[26] He has consistently characterised current language models as systems that try to produce outputs human judges will rate as correct rather than as agents with intrinsic drives, and has argued that the main near-term safety problems concern misuse and incorrect outputs rather than autonomous goal-directed behaviour.^[26] On the topic of competing efforts to build "truth-seeking" chatbots, he told The Decoder in 2023 that proposals such as Elon Musk's "TruthGPT" were "complicated" because reasonable people will disagree about what counts as a true and balanced answer to politically charged questions.^[31]

In 2025, the University of California, Berkeley awarded Schulman its Mark Bingham Award for Excellence in Achievement by Young Alumni, citing his contributions to deep reinforcement learning and to ChatGPT.^[17]

Notable publications

Selected papers, listed by year:

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust Region Policy Optimization. arXiv:1502.05477.^[3]
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.^[4]
Schulman, J., Heess, N., Weber, T., and Abbeel, P. (2015). Gradient Estimation Using Stochastic Computation Graphs. arXiv:1506.05254.^[30]
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.^[5]
Schulman, J. (2016). Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs. PhD thesis, University of California, Berkeley, EECS-2016-217.^[18]
OpenAI, Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., Pinto, H. P. d. O., Raiman, J., Salimens, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., Zhang, S., and Schulman, J., among others (2019). Dota 2 with Large Scale Deep Reinforcement Learning. arXiv:1912.06680.^[23]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. arXiv:2203.02155.^[9]

References

OpenAI, "Introducing OpenAI," 11 December 2015. https://openai.com/index/introducing-openai/ ↩
Wikipedia contributors, "OpenAI." https://en.wikipedia.org/wiki/OpenAI ↩
Schulman, J. et al., "Trust Region Policy Optimization," arXiv:1502.05477, 2015. https://arxiv.org/abs/1502.05477 ↩
Schulman, J. et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation," arXiv:1506.02438, 2015. https://arxiv.org/abs/1506.02438 ↩
Schulman, J. et al., "Proximal Policy Optimization Algorithms," arXiv:1707.06347, 2017. https://arxiv.org/abs/1707.06347 ↩
OpenAI, "OpenAI Five," updated 2019. https://openai.com/index/openai-five/ ↩
OpenAI, "Proximal Policy Optimization," 20 July 2017. https://openai.com/index/openai-baselines-ppo/ ↩
UC Berkeley News, "ChatGPT architect, Berkeley alum John Schulman on his journey with AI," 20 April 2023. https://news.berkeley.edu/2023/04/20/chatgpt-architect-berkeley-alum-john-schulman-on-his-journey-with-ai/ ↩
Ouyang, L. et al., "Training Language Models to Follow Instructions with Human Feedback," arXiv:2203.02155, 4 March 2022. https://arxiv.org/abs/2203.02155 ↩
John Schulman (@johnschulman2), X post, 5 August 2024. https://x.com/johnschulman2/status/1820610863499509855 ↩
TechCrunch, "OpenAI co-founder Schulman leaves for Anthropic, Brockman takes extended leave," 5 August 2024. https://techcrunch.com/2024/08/05/openai-co-founder-leaves-for-anthropic/ ↩
TechCrunch, "OpenAI co-founder John Schulman leaves Anthropic after just five months," 6 February 2025. https://techcrunch.com/2025/02/06/openai-co-founder-john-schulman-leaves-anthropic-after-just-five-months/ ↩
Bloomberg, "OpenAI Co-Founder John Schulman Leaves Rival Firm Anthropic," 6 February 2025. https://www.bloomberg.com/news/articles/2025-02-06/openai-co-founder-john-schulman-leaves-rival-firm-anthropic ↩
Fortune, "OpenAI cofounder John Schulman is joining Mira Murati's startup after brief stint at Anthropic," 6 February 2025. https://fortune.com/2025/02/06/openai-john-schulman-mira-muratis-startup-anthropic/ ↩
Fortune, "Former OpenAI CTO Mira Murati unveils Thinking Machines Lab details and leadership team stacked with former OpenAI colleagues," 18 February 2025. https://fortune.com/2025/02/18/former-openai-cto-mira-murati-finally-unveils-her-thinking-machines-lab-startup-and-a-leadership-team-stacked-with-former-openai-colleagues/ ↩
Wikipedia contributors, "Thinking Machines Lab." https://en.wikipedia.org/wiki/Thinking_Machines_Lab ↩
Wikipedia contributors, "John Schulman." https://en.wikipedia.org/wiki/John_Schulman ↩
Schulman, J., "Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs," PhD thesis, UC Berkeley EECS-2016-217, 16 December 2016. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-217.html ↩
American Association of Physics Teachers, "2005 U.S. Physics Olympiad Team biographies: John Schulman." https://www.aapt.org/olympiad2005/bio.cfm?StudentID=348 ↩
University of California, "ChatGPT architect, UC Berkeley alum John Schulman on his journey with AI." https://www.universityofcalifornia.edu/news/chatgpt-architect-uc-berkeley-alum-john-schulman-his-journey-ai ↩
UC Berkeley CDSS, "ChatGPT architect, Berkeley alum John Schulman on his journey with AI." https://cdss.berkeley.edu/news/chatgpt-architect-berkeley-alum-john-schulman-his-journey-ai ↩
Schulman, J. et al., "Trust Region Policy Optimization," Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015. https://proceedings.mlr.press/v37/schulman15.html ↩
Berner, C. et al. (OpenAI), "Dota 2 with Large Scale Deep Reinforcement Learning," arXiv:1912.06680, 2019. https://arxiv.org/abs/1912.06680 ↩
UC Berkeley News, "Berkeley Talks: ChatGPT developer John Schulman on making AI more truthful," 24 April 2023. https://news.berkeley.edu/2023/04/24/berkeley-talks-chatgpt-developer-john-schulman/ ↩
CNBC, "OpenAI co-founder John Schulman says he will leave and join rival Anthropic," 6 August 2024. https://www.cnbc.com/2024/08/06/openai-co-founder-john-schulman-says-he-will-join-rival-anthropic.html ↩
Dwarkesh Patel, "John Schulman (OpenAI Cofounder): Reasoning, RLHF, & Plan for 2027 AGI," May 2024. https://www.dwarkesh.com/p/john-schulman ↩
Bloomberg, "OpenAI Co-Founder John Schulman Departs for AI Rival Anthropic," 6 August 2024. https://www.bloomberg.com/news/articles/2024-08-06/openai-co-founder-john-schulman-departs-for-ai-rival-anthropic ↩
John Schulman (@johnschulman2), X post, 6 February 2025. https://x.com/johnschulman2/status/1887724101667856725 ↩
The Decoder, "OpenAI co-founder John Schulman's brief stint at Anthropic comes to an end," February 2025. https://the-decoder.com/openai-co-founder-john-schulmans-brief-stint-at-anthropic-comes-to-an-end/ ↩
Schulman, J., Heess, N., Weber, T., Abbeel, P., "Gradient Estimation Using Stochastic Computation Graphs," arXiv:1506.05254, 2015. https://arxiv.org/abs/1506.05254 ↩
The Decoder, "Elon Musk's 'TruthGPT' is complicated, says OpenAI co-founder," 2023. https://the-decoder.com/elon-musks-truthgpt-is-complicated-says-openai-co-founder/ ↩
TechCrunch, "Mira Murati's startup, Thinking Machines Lab, is losing two of its co-founders to OpenAI," 14 January 2026. https://techcrunch.com/2026/01/14/mira-muratis-startup-thinking-machines-lab-is-losing-two-of-its-co-founders-to-openai/ ↩
TechCrunch, "Mira Murati's Thinking Machines Lab is worth $12B in seed round," 15 July 2025. https://techcrunch.com/2025/07/15/mira-muratis-thinking-machines-lab-is-worth-12b-in-seed-round/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Andrew Tulloch Barret Zoph Bob McGrew Greg Brockman Gym (OpenAI Gym / Gymnasium)Gym Retro InstructGPT Jan Leike Lilian Weng Mira Murati Model Spec Pieter Abbeel Proximal Policy Optimization (PPO)Rule-Based Rewards (RBR)Thinking Machines Lab WebGPT🤖Wojciech Zaremba

Key facts

Early life and education

What are John Schulman's main research contributions?

Trust Region Policy Optimization (TRPO)

Generalized Advantage Estimation (GAE)

Proximal Policy Optimization (PPO)

OpenAI Five and large-scale RL

Reinforcement learning from human feedback and ChatGPT

Career at OpenAI (2015-2024)

Why did John Schulman leave OpenAI for Anthropic?

What is John Schulman's role at Thinking Machines Lab?

Views and public profile

Notable publications

References

Improve this article

Related Articles

Gym (OpenAI Gym / Gymnasium)

OpenAI Five

Dactyl (OpenAI)

OpenAI Baselines

Spinning Up

Sergey Levine

What links here

Related Articles

Gym (OpenAI Gym / Gymnasium)

OpenAI Five

Dactyl (OpenAI)

OpenAI Baselines

Spinning Up

Sergey Levine

What links here