InstructGPT
Last reviewed
Apr 30, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 3,648 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 3,648 words
Add missing citations, update stale details, or suggest a clearer explanation.
InstructGPT is a family of language models released by OpenAI in January 2022 that take the base GPT-3 and fine-tune it to follow user instructions more helpfully, truthfully, and with less toxic output. The training recipe combines supervised fine-tuning on labeler-written demonstrations, a learned reward model trained on human pairwise comparisons, and reinforcement learning with Proximal Policy Optimization (PPO). This three-stage pipeline, sometimes simply called RLHF for large language models, was first announced in OpenAI's blog post "Aligning language models to follow instructions" on January 27, 2022, and described in detail in the paper Training language models to follow instructions with human feedback by Ouyang et al., presented at NeurIPS 2022 (arXiv:2203.02155).
InstructGPT is the direct technical and commercial predecessor of ChatGPT. OpenAI has stated that ChatGPT is a "sibling model" trained with the same RLHF method, with differences mostly in the data collection setup. Through 2024, almost every aligned production LLM, including GPT-3.5, GPT-4, Claude, Gemini, and the Llama-2-Chat and Llama-3-Instruct families, used some variant of the InstructGPT pipeline.
The base GPT-3 model released in 2020 was trained as an autoregressive next-token predictor on a very large corpus scraped from the open web. It was good at continuing text in the style of its training data, which is not the same thing as following an instruction. Asked "Explain the moon landing to a six year old," GPT-3 might continue with "Explain the theory of gravity to a six year old. Explain the theory of relativity to a six year old." because that is a plausible continuation in a homework worksheet. The model was, in the language of the InstructGPT paper, misaligned with the goals of users sending API queries.
Ouyang and colleagues frame this gap as the difference between optimizing a language modeling objective and optimizing for what users actually want. They borrow the terminology of "helpful, honest, and harmless" from earlier alignment work and treat instruction following as the operational target. The technical question is how to push GPT-3 in that direction without retraining from scratch.
The answer they settled on is a chain of three fine-tuning steps that wraps GPT-3 in a layer of human preferences. The lineage of this idea runs through several earlier OpenAI papers. Christiano et al. 2017, Deep reinforcement learning from human preferences, showed that an agent could learn to play Atari and perform simulated locomotion using human pairwise comparisons instead of a hand-coded reward, with feedback on roughly 0.1% of the agent's interactions. Stiennon et al. 2020, Learning to summarize from human feedback, applied the same idea to summarizing Reddit posts and found that a 1.3B model trained with human feedback could outperform much larger supervised models. InstructGPT generalizes that approach from summarization to open-ended instruction following.
The heart of InstructGPT is a pipeline that takes a pretrained GPT-3 checkpoint and bolts three additional training stages on top.
OpenAI hired about 40 contractors through Upwork and Scale AI, screening them on agreement with researcher judgments and on demonstrated ability to identify sensitive content. Those labelers wrote example prompt and response pairs that demonstrated the desired behavior: useful answers, refusals where appropriate, neutral tone, no fabricated facts. They also produced demonstrations for prompts collected from real users of the OpenAI Playground, with consent.
The resulting SFT dataset contains about 13,000 training prompts. GPT-3 is then fine-tuned on these demonstrations with the standard cross-entropy language modeling loss for 16 epochs. The output is a model that has shifted toward the labeler-written style. This SFT model is the seed for the next two steps.
The second stage replaces hand-written demonstrations with human pairwise comparisons, which are cheaper to collect at scale. For each prompt, the SFT model samples between four and nine candidate completions. Labelers see all of them and rank them from best to worst. Each ranking is then expanded into all of its pairwise comparisons.
The reward model is a separate transformer initialized from the SFT model with the language modeling head replaced by a scalar regression head. It is trained on roughly 33,000 ranking prompts, expanded into many more pairwise comparisons, with the Bradley-Terry pairwise loss. Concretely, the reward model takes a prompt and a completion and outputs a single number; training pushes the score for the labeler-preferred completion above the score for the dispreferred completion.
A notable choice: although they tested 175B reward models, OpenAI used 6B parameter reward models in the final pipeline. Larger reward models were unstable during RL and offered no measurable benefit, while costing far more compute. Using a smaller reward model is part of why the pipeline is feasible at all.
In the final stage, the SFT model is further fine-tuned to maximize the score given by the frozen reward model, using Proximal Policy Optimization (Schulman et al. 2017) as the RL algorithm. PPO is a policy gradient method with a clipped surrogate objective that prevents the new policy from drifting too far from the old policy in any single update, which keeps training stable.
Three details matter here. First, the PPO dataset has about 31,000 prompts drawn entirely from the OpenAI API. Second, the reward used at each token is the reward model score on the full completion plus a per-token KL penalty against the SFT policy. The KL term, scaled by a coefficient called beta, prevents the policy from finding adversarial outputs that score highly on the reward model but look nothing like reasonable text. Without this anchor, RL would aggressively exploit reward model errors, a failure mode known as reward hacking or reward overoptimization. Third, the paper introduces a variant called PPO-ptx that mixes a small fraction of the original pretraining gradient back into the PPO update. PPO-ptx exists specifically to claw back performance on standard NLP benchmarks lost during alignment.
The final PPO model, after this third stage, is what OpenAI calls InstructGPT.
| Step | Input | Method | Dataset | Output |
|---|---|---|---|---|
| 1. SFT | GPT-3 base | Supervised fine-tune on labeler demonstrations | ~13,000 prompt and response pairs | SFT model |
| 2. RM | SFT model | Train reward model on pairwise comparisons | ~33,000 ranking prompts (each expanded into pairs) | 6B reward model |
| 3. RL (PPO) | SFT model + frozen RM | PPO with per-token KL penalty against SFT | ~31,000 API prompts | InstructGPT |
OpenAI trained InstructGPT at three sizes that match the GPT-3 family: 1.3B, 6B, and 175B parameters. The reward model was 6B in every case, even when fine-tuning the 175B policy with PPO.
| Model | Parameters | Notes |
|---|---|---|
| InstructGPT 1.3B | 1.3 billion | Smallest variant; preferred over 175B GPT-3 in human evaluation |
| InstructGPT 6B | 6 billion | Same size as the reward model |
| InstructGPT 175B | 175 billion | Largest variant; deployed as text-davinci-001 in the OpenAI API |
The single most cited result in the paper is that the 1.3B InstructGPT model is preferred by labelers over the 175B base GPT-3 model, despite having more than 100 times fewer parameters. Alignment, in this case, bought more user-perceived quality than two orders of magnitude of additional parameters.
InstructGPT is, in some sense, more a product of its data than its architecture. The architecture is just GPT-3 with extra fine-tuning. The data is the new ingredient.
Prompts came from two sources. The bulk are real prompts submitted to the OpenAI API and the Playground, with users asked for permission to use their data for research. A smaller seed set was written by labelers themselves, used to bootstrap the early SFT data when the API was still new. The team filtered for personal information and tried to balance the distribution across task types, which in practice was dominated by generation, open-ended question answering, brainstorming, chat, rewriting, summarization, and classification.
The labelers themselves were a relatively small and homogeneous group of about 40 contractors, mostly English-speaking, trained from a screening procedure designed by the OpenAI team. The paper acknowledges this as a limitation: the values encoded into InstructGPT are roughly the values of those 40 contractors plus the OpenAI researchers, not a sample of humanity. Later work on cultural bias in RLHF systems makes a lot of this point, and it is one of the better-grounded critiques of the methodology.
The paper reports several headline findings. In human evaluation, outputs from the 175B InstructGPT are preferred to 175B GPT-3 outputs 85 plus or minus 3 percent of the time on the API prompt distribution. Even with strict prompt instructions added to GPT-3, InstructGPT is still preferred about 71 percent of the time. The preference signal is robust across the prompt distribution, not driven by a few task categories.
On truthfulness, evaluated on the TruthfulQA benchmark and on the closed-domain summarization tasks where models can be checked against source text, InstructGPT shows clear improvements. The closed-domain hallucination rate falls from about 41 percent for GPT-3 to about 21 percent for InstructGPT, roughly half. On TruthfulQA, the gap is smaller but consistent.
On toxicity, when prompted to be respectful, InstructGPT generates about 25 percent fewer toxic outputs than GPT-3 on the RealToxicityPrompts benchmark. When the respectful instruction is removed, the gap shrinks. Toxicity is genuinely reduced, but not eliminated.
On bias, the picture is mixed. The model is not noticeably better than GPT-3 on the Winogender or CrowS-Pairs benchmarks. The paper is honest that alignment did not solve bias.
On standard NLP benchmarks like SQuAD, DROP, HellaSwag, and WMT 2015 French to English translation, the PPO model regresses compared to GPT-3. This regression has been called the alignment tax in later work. The PPO-ptx variant, which mixes pretraining gradients into PPO, recovers most of the lost benchmark performance while keeping most of the alignment gains.
| Metric | GPT-3 175B | InstructGPT 175B |
|---|---|---|
| Labeler preference (vs GPT-3) | baseline | preferred 85% of the time |
| Closed-domain hallucination rate | ~41% | ~21% |
| Toxic output (when prompted respectful) | baseline | ~25% fewer toxic outputs |
| Standard NLP benchmarks (SQuAD, DROP, etc.) | baseline | small regressions; recovered by PPO-ptx |
InstructGPT was initially deployed in the OpenAI API as the model named text-davinci-001, released alongside the January 2022 announcement. It became the default Davinci model for new API users that year. Earlier API endpoints based on raw GPT-3 (davinci, curie, babbage, ada) remained available, but documentation pointed users to the InstructGPT-aligned variants by default.
Later models in the same series extended the recipe. text-davinci-002, released in March 2022, used a refined SFT-only approach OpenAI labeled FeedME, which trained on demonstrations including examples from earlier human-feedback models. text-davinci-003, released on November 28, 2022, brought RLHF with PPO back into the picture with further data collection improvements.
Two days after text-davinci-003, on November 30, 2022, OpenAI launched ChatGPT. In OpenAI's own announcement, ChatGPT is described as a sibling model to InstructGPT, trained using the same methods as InstructGPT but with slight differences in the data collection setup. The most important difference: human trainers wrote multi-turn dialogues in which they played both the user and an idealized assistant, sometimes using model-written suggestions as scaffolding. The result was a model that handled conversation, follow-ups, and refusals in a way the single-turn InstructGPT models could not. That product reached roughly 100 million users in two months and changed both the public perception and the commercial trajectory of LLMs.
It is fair to say that without InstructGPT, ChatGPT in its November 2022 form would not exist. The same three-step pipeline, with dialogue data, is the entire technical bridge between them.
A few things make InstructGPT important beyond the immediate product line.
First, it demonstrated that RLHF works at scale on general open-ended language tasks. Christiano 2017 had shown the idea on Atari, Stiennon 2020 on summarization. InstructGPT pushed it to the full diversity of API queries that real users send to a 175B model. That generalization was not obvious in advance.
Second, InstructGPT made aligned LLMs commercially viable. The base GPT-3 was hard to use without prompt engineering and frequently produced outputs that were unhelpful, off-topic, or unsafe. The InstructGPT-aligned models made the OpenAI API approachable to non-experts and acceptable to enterprise customers. The path from "interesting research" to "profitable product" runs through this work.
Third, it set the template that almost every other lab followed. The SFT plus reward model plus PPO architecture became the dominant alignment approach across OpenAI, Anthropic, Google DeepMind, Meta, and the open-weight community. Models like Vicuna, Alpaca, Llama-2-Chat, Mistral-Instruct, and most production chatbots through 2024 trace their training pipeline to InstructGPT in some recognizable form.
Fourth, it brought serious attention to the field of AI alignment. Before InstructGPT, alignment was largely a theoretical research area. After InstructGPT, it became a hiring priority at every major lab, with dedicated teams, public reports, and a growing literature.
The paper itself has been cited many thousands of times. Searches on Google Scholar and Semantic Scholar in 2024 returned citation counts well into the five-digit range, putting it among the most cited papers in machine learning of the 2020s.
Nothing about InstructGPT is perfect, and the paper itself is unusually candid about its limitations.
Reward hacking, sometimes called reward overoptimization, is the most fundamental issue. The reward model is a fallible learned approximation of human preferences. Optimizing too hard against it leads the policy to find outputs that score well on the model but that humans dislike on inspection. The KL penalty against the SFT policy mitigates this, but does not eliminate it, and choosing the KL coefficient is mostly empirical. Gao, Schulman, and Hilton 2022, Scaling Laws for Reward Model Overoptimization, characterized this trade-off in more detail.
Labeler bias is a second concern. The roughly 40 contractors are not representative of global users. Their judgments encode particular linguistic, cultural, and stylistic preferences that get baked into the model and propagate to every downstream system trained from it. Later work on alternative feedback sources, including AI feedback, partially aimed at this.
Sycophancy is a side effect that became easier to see in larger models trained with this pipeline. The reward model often gives higher scores to outputs that flatter the user or restate the user's premise back at them. RLHF therefore produces models that tend to agree with whoever is talking to them, whether or not the user is right. Anthropic's Discovering Language Model Behaviors with Model-Written Evaluations and follow-up work documented this in detail.
Hallucinations are reduced but not removed. InstructGPT still confabulates when asked about facts outside the prompt, and the closed-domain hallucination rate of 21 percent is much lower than 41 percent but still meaningful in a production setting.
The alignment tax on standard NLP benchmarks is real, even if PPO-ptx mostly recovers it. There are tasks where the aligned model is worse than the base model, and the paper is honest about that.
Finally, the cost is high. Hiring tens of skilled labelers, designing screening procedures, collecting tens of thousands of demonstrations and comparisons, training a separate reward model, and running PPO on a 175B policy is expensive. A great deal of subsequent research has gone into making this pipeline cheaper and simpler.
The most thorough public summary of these issues is Casper, Davies et al. 2023, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, published in TMLR with a Survey Certification. That paper organizes RLHF problems into challenges with feedback, challenges with the reward model, and challenges with the policy, and surveys over 250 references on each.
The InstructGPT recipe is the starting point, not the end point. Several methods have since been proposed that either build on it or aim to replace pieces of it.
Direct Preference Optimization (DPO), introduced by Rafailov et al. at NeurIPS 2023 in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model, removes the explicit reward model and the RL step. It reformulates the preference learning problem so that a single binary cross-entropy classification objective on pairwise comparisons is mathematically equivalent to optimizing a policy under an implicit reward model. DPO is much simpler to implement, much more stable to train, and roughly matches PPO-based RLHF on common benchmarks. It became a popular default for open-weight model alignment after 2023.
Constitutional AI, introduced by Bai et al. at Anthropic in 2022, replaces some or all human comparisons with comparisons generated by an LLM following a written constitution of principles. The technique trades human labeler cost for model inference cost, and it allows safety training to scale faster than human annotation can. Anthropic's Claude family is trained with this approach, sometimes called RLAIF, reinforcement learning from AI feedback.
RLAIF more broadly, including the Lee et al. 2023 paper from Google, generalizes the same idea: use a strong language model to produce preference labels at scale, with human oversight on a smaller validation set. Empirically, RLAIF is competitive with RLHF on summarization and other tasks.
A cluster of newer preference-optimization methods, including SLiC (Sequence Likelihood Calibration), RRHF, KTO (Kahneman-Tversky Optimization), IPO (Identity Preference Optimization), and ORPO (Odds Ratio Preference Optimization), explore other angles. They all share the InstructGPT goal of aligning a base model with human preferences but vary in whether they use a reward model, an RL step, an offline objective, or some hybrid.
| Method | Year | Reward model? | RL step? | Notes |
|---|---|---|---|---|
| RLHF (InstructGPT) | 2022 | Yes (separate RM) | Yes (PPO) | Original three-stage pipeline |
| DPO | 2023 | Implicit | No | Single classification objective on pairs |
| Constitutional AI / RLAIF | 2022 to 2023 | Yes | Yes | Uses AI feedback in place of human comparisons |
| SLiC | 2022 | Optional | No | Calibrates sequence likelihoods to preferences |
| KTO | 2024 | No | No | Uses prospect-theory-style loss on positive/negative examples |
| ORPO | 2024 | No | No | Combines SFT and preference optimization in a single stage |
None of these alternatives has fully displaced the InstructGPT recipe at frontier labs as of early 2026, though DPO and its variants are increasingly used either alongside or instead of PPO in production training pipelines.
It is hard to overstate how much of the post-2022 LLM landscape was shaped by this single paper. The vocabulary used to talk about model alignment (helpfulness, harmlessness, refusals, alignment tax, sycophancy, reward hacking) largely comes from InstructGPT and its immediate descendants. The practical training pipeline used by every major closed and open-weight chat model traces back to it. Even academic critiques of RLHF, including Casper et al., are organized around the InstructGPT pipeline as the implicit baseline.
There is a thread in alignment discussion that views InstructGPT as a mixed legacy: it made aligned models commercially valuable, which in turn poured resources into capability research, which in turn made the alignment problem harder rather than easier. That argument is contested, but it is worth holding alongside the more triumphalist reading. The same three-step pipeline that produced helpful chatbots also produced models that hallucinate confidently, flatter users, and bake in narrow labeler preferences. Both of those things are downstream of the same paper.
What is not contested is that InstructGPT marked the moment when RLHF became standard practice for language models. Before March 2022, almost no production LLM used human-feedback reinforcement learning. After March 2022, almost all of them did.