InstructGPT
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,834 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,834 words
Add missing citations, update stale details, or suggest a clearer explanation.
InstructGPT is a family of language models released by OpenAI in January 2022 that take the base GPT-3 and fine-tune it to follow user instructions more helpfully, truthfully, and with less toxic output.[1][2] The training recipe combines supervised fine-tuning on labeler-written demonstrations, a learned reward model trained on human pairwise comparisons, and reinforcement learning with Proximal Policy Optimization (PPO).[1] This three-stage pipeline, sometimes simply called RLHF for large language models, was first announced in OpenAI's blog post "Aligning language models to follow instructions" on January 27, 2022, and described in detail in the paper Training language models to follow instructions with human feedback by Ouyang et al., presented at NeurIPS 2022 (arXiv:2203.02155).[1][2]
InstructGPT is the direct technical and commercial predecessor of ChatGPT.[3] OpenAI has stated that ChatGPT is a "sibling model" trained with the same RLHF method, with differences mostly in the data collection setup.[3] Through 2024 and into 2026, almost every aligned production LLM, including GPT-3.5, GPT-4, Claude, Gemini, and the Llama 2 Chat and Llama 3 Instruct families, used some variant of the InstructGPT pipeline.[4][5]
| Attribute | Value |
|---|---|
| Developer | OpenAI |
| Initial release | January 27, 2022 |
| Paper | Ouyang et al., arXiv:2203.02155 (March 4, 2022) |
| Venue | NeurIPS 2022 |
| Model sizes | 1.3B, 6B, 175B parameters |
| Base architecture | GPT-3 (decoder-only Transformer) |
| Training pipeline | SFT, then reward model, then PPO |
| Reward model size | 6B parameters (used for all policy sizes) |
| SFT prompts | about 13,000 |
| RM prompts | about 33,000 |
| PPO prompts | about 31,000 |
| Labelers | about 40, hired via Upwork and Scale AI |
| Compute (175B PPO-ptx) | 60 petaflop/s-days |
| API deployment | text-davinci-001 (January 2022), text-davinci-002 (March 2022), text-davinci-003 (November 2022) |
The base GPT-3 model released in 2020 was trained as an autoregressive next-token predictor on a very large corpus scraped from the open web.[6] It was good at continuing text in the style of its training data, which is not the same thing as following an instruction.[1] Asked "Explain the moon landing to a six year old," GPT-3 might continue with "Explain the theory of gravity to a six year old. Explain the theory of relativity to a six year old," because that is a plausible continuation in a homework worksheet.[1] The model was, in the language of the InstructGPT paper, misaligned with the goals of users sending API queries.[1]
Ouyang and colleagues frame this gap as the difference between optimizing a language modeling objective and optimizing for what users actually want.[1] They borrow the terminology of "helpful, honest, and harmless" from earlier alignment work by Askell and colleagues at Anthropic and treat instruction following as the operational target.[1][7] The technical question is how to push GPT-3 in that direction without retraining from scratch.
The answer is a chain of three fine-tuning steps that wraps GPT-3 in a layer of human preferences. The lineage of this idea runs through several earlier papers. Christiano et al. 2017, Deep reinforcement learning from human preferences, showed that an agent could learn to play Atari and perform simulated locomotion using human pairwise comparisons instead of a hand-coded reward, with feedback on less than one percent of the agent's interactions.[8] Stiennon et al. 2020, Learning to summarize from human feedback, applied the same idea to summarizing Reddit posts and found that a 1.3B model trained with human feedback outperformed a supervised model ten times its size, with a 61% versus 43% preference rate against reference summaries.[9] InstructGPT generalizes that approach from summarization to open-ended instruction following.[1]
A second precursor inside OpenAI was WebGPT, released in December 2021, which fine-tuned GPT-3 to browse the web and answer long-form questions using a combination of imitation learning and reward modeling.[10] WebGPT shipped before InstructGPT and used several of the same engineering primitives, including a learned reward model and rejection sampling. The InstructGPT team carried that infrastructure forward into open-ended instruction following.[1][10]
The heart of InstructGPT is a pipeline that takes a pretrained GPT-3 checkpoint and bolts three additional training stages on top.[1] The same pipeline is sometimes drawn as a triangle of SFT, reward model, and PPO with the SFT model both seeding the reward model and providing a KL anchor for the PPO policy.[1][4]
OpenAI hired about 40 contractors through Upwork and Scale AI, screening them on agreement with researcher judgments and on demonstrated ability to identify sensitive content.[1][11] Most labelers were English speakers based in the United States or Southeast Asia.[1][11] Those labelers wrote example prompt and response pairs that demonstrated the desired behavior: useful answers, refusals where appropriate, neutral tone, no fabricated facts.[1][11] They also produced demonstrations for prompts collected from real users of the OpenAI Playground, with consent.[1][11]
The resulting SFT dataset contains about 13,000 training prompts drawn from both API submissions and labeler-written prompts.[1] GPT-3 is then fine-tuned on these demonstrations with the standard cross-entropy language model loss for 16 epochs, using a cosine learning rate decay and residual dropout of 0.2.[1] The output is a model that has shifted toward the labeler-written style. This SFT model is the seed for the next two steps.
In a subtle but important detail, the paper notes that the SFT model overfits on the validation loss after about one epoch, yet continuing to train for many more epochs still helps the downstream reward model and the RLHF policy.[1] This makes the SFT stage a different optimization problem from typical fine-tuning: it is essentially a warm start for the alignment pipeline, not a model whose own loss curve is the target.[1]
The second stage replaces hand-written demonstrations with human pairwise comparisons, which are cheaper to collect at scale than full demonstrations.[1] For each prompt, the SFT model samples between four and nine candidate completions (denoted K in the paper). Labelers see all of them and rank them from best to worst on a 1 to 7 Likert scale, and each ranking is then expanded into all of its pairwise comparisons.[1]
The reward model is a separate Transformer initialized from the SFT model with the final unembedding layer removed and replaced by a scalar regression head.[1] It is trained on roughly 33,000 ranking prompts, expanded into many more pairwise comparisons, with the Bradley-Terry pairwise loss.[1] The exact loss for a comparison of preferred completion y_w against dispreferred completion y_l, given prompt x, is:
loss(θ) = -E[log σ(r_θ(x, y_w) - r_θ(x, y_l))]
where σ is the sigmoid function and the expectation is taken over comparison pairs.[1] Concretely, the reward model takes a prompt and a completion and outputs a single number, and training pushes the score for the labeler-preferred completion above the score for the dispreferred completion.[1] All K choose 2 comparisons from a single prompt are presented as a single batch element to prevent overfitting to easy comparisons, with the loss averaged over the (K choose 2) pairs.[1]
A notable choice: although they tested 175B reward models, OpenAI used 6B parameter reward models in the final pipeline.[1] Larger reward models were unstable during RL and offered no measurable benefit on the final policy, while costing far more compute.[1] Using a smaller reward model is part of why the pipeline is feasible at all, and it became a widely copied design choice in later RLHF implementations.[4][12]
In the final stage, the SFT model is further fine-tuned to maximize the score given by the frozen reward model, using Proximal Policy Optimization (Schulman et al. 2017) as the RL algorithm.[1][13] PPO is a policy gradient method with a clipped surrogate objective that prevents the new policy from drifting too far from the old policy in any single update, which keeps training stable.[13]
The PPO dataset has about 31,000 prompts drawn entirely from the OpenAI API.[1] At each token of generation, the policy is updated against the following objective:
objective(φ) = E[r_θ(x, y) - β · log(π_φ^RL(y|x) / π^SFT(y|x))] + γ · E[log π_φ^RL(x)]
where the first term is the reward model score, the second is a per-token KL divergence penalty against the SFT policy with coefficient β, and the third term is a pretraining language modeling loss with coefficient γ.[1] For the plain PPO variant, γ is set to zero and only the first two terms are used; for the PPO-ptx variant, both the KL and pretraining terms are active.[1]
The KL term, scaled by β, prevents the policy from finding adversarial outputs that score highly on the reward model but look nothing like reasonable text.[1] Without this anchor, RL would aggressively exploit reward model errors, a failure mode known as reward hacking or reward overoptimization, later characterized in detail by Gao, Schulman, and Hilton in Scaling Laws for Reward Model Overoptimization.[14]
The PPO-ptx variant exists specifically to claw back performance on standard NLP benchmarks lost during alignment, a regression the paper names the alignment tax.[1] It mixes a small fraction of the original pretraining gradient back into the PPO update, which recovers most of the benchmark loss while keeping most of the alignment gains.[1]
The final PPO model, after this third stage, is what OpenAI calls InstructGPT and deploys in the API.[1][2]
| Step | Input | Method | Dataset | Output |
|---|---|---|---|---|
| 1. SFT | GPT-3 base | Supervised fine-tune on labeler demonstrations, 16 epochs, cosine LR decay, residual dropout 0.2 | about 13,000 prompt and response pairs | SFT model |
| 2. RM | SFT model | Train reward model on pairwise comparisons, Bradley-Terry loss, K choose 2 batched | about 33,000 ranking prompts, K from 4 to 9 completions each | 6B reward model |
| 3. RL (PPO) | SFT model + frozen RM | PPO with per-token KL penalty (β) against SFT; PPO-ptx adds pretraining loss (γ) | about 31,000 API prompts | InstructGPT |
Training a 175B SFT model required about 4.9 petaflop/s-days, and training the 175B PPO-ptx model required about 60 petaflop/s-days.[1] By comparison, pretraining base GPT-3 required about 3,640 petaflop/s-days.[1][6] In other words, the entire RLHF alignment pipeline used roughly 1.6% of the compute that went into pretraining the model it aligned, which became one of the most cited reasons RLHF was adopted so quickly across the industry: the alignment step is comparatively cheap once a strong base model exists.[1][4]
OpenAI trained InstructGPT at three sizes that match the GPT-3 family: 1.3B, 6B, and 175B parameters.[1][11] The reward model was 6B in every case, even when fine-tuning the 175B policy with PPO.[1]
| Model | Parameters | Notes |
|---|---|---|
| InstructGPT 1.3B | 1.3 billion | Smallest variant; preferred over 175B GPT-3 in human evaluation |
| InstructGPT 6B | 6 billion | Same size as the reward model |
| InstructGPT 175B | 175 billion | Largest variant; deployed as text-davinci-001 in the OpenAI API |
The single most cited result in the paper is that the 1.3B InstructGPT model is preferred by labelers over the 175B base GPT-3 model, despite having more than 100 times fewer parameters.[1][2] Alignment, in this case, bought more user-perceived quality than two orders of magnitude of additional parameters.[1]
InstructGPT is, in some sense, more a product of its data than its architecture.[1] The architecture is just GPT-3 with extra fine-tuning. The data is the new ingredient.
Prompts came from two sources.[1] The bulk are real prompts submitted to the OpenAI API and the Playground, with users asked for permission to use their data for research.[11] A smaller seed set was written by labelers themselves, used to bootstrap the early SFT data when the API was still new.[1] The team filtered for personal information (PII), capped prompts per user at about 200 to limit user-level overfitting, and split train, validation, and test sets by user ID to keep the same user out of multiple splits.[1]
The reward model training prompts break down by use case category as follows:[1]
| Use case | Share of RM prompts |
|---|---|
| Generation | 45.6% |
| Open QA | 12.4% |
| Brainstorming | 11.2% |
| Chat | 8.4% |
| Rewrite | 6.6% |
| Summarization | 4.2% |
| Classification | 3.5% |
| Other | 3.5% |
| Closed QA | 2.6% |
| Extract | 1.9% |
Open-ended generation and brainstorming, plus chat, dominate the distribution, while classification and extraction are a small slice. That skew matters because RLHF is being optimized on this distribution; whatever the labelers prefer for these task types becomes the model's revealed objective.[1]
The labelers were a small and homogeneous group of about 40 contractors, mostly English-speaking people living in the United States or Southeast Asia, hired through Upwork and Scale AI.[1][11] They were selected through a screening procedure designed by the OpenAI team that measured both sensitivity to different demographic preferences and ability to identify potentially harmful outputs.[1][11] Inter-labeler agreement was about 72.6% ± 1.5% for training labelers and about 77.3% ± 1.3% for held-out labelers, comparable to the 73% ± 4% researcher-researcher agreement reported in Stiennon et al.'s summarization work.[1][9]
The paper acknowledges in its limitations section that the values encoded into InstructGPT are roughly the values of those 40 contractors plus the OpenAI researchers, not a sample of humanity.[1] Later work on cultural bias in RLHF systems leans heavily on this point, and it is one of the better-grounded critiques of the methodology.[15][16]
A subtler concern is that the prompts used for both SFT and the reward model came disproportionately from API users sending requests to an earlier version of the InstructGPT models, so the prompt distribution is partly the product of the model it is being used to train.[1][11] In practice this is a kind of bootstrapping feedback loop: the labelers' demonstrations shape an early model, that model attracts a particular kind of API traffic, that traffic seeds the next round of training, and so on.
The paper reports several headline findings. In human evaluation, outputs from the 175B InstructGPT are preferred to 175B GPT-3 outputs 85% ± 3% of the time on the API prompt distribution.[1] Even with strict prompt instructions added to GPT-3, the few-shot GPT-3 baseline is still beaten by InstructGPT about 71% ± 4% of the time.[1] The preference signal is robust across the prompt distribution, not driven by a few task categories, and the result holds when measured by held-out labelers who did not produce training data.[1]
On truthfulness, evaluated on the TruthfulQA benchmark and on the closed-domain summarization tasks where models can be checked against source text, InstructGPT shows clear improvements.[1] On closed-domain QA and summarization, where the source text is the ground truth, the hallucination rate falls from about 41% for GPT-3 to about 21% for InstructGPT, roughly half.[1] On TruthfulQA, the gap is smaller but consistent: InstructGPT generates truthful and informative answers about twice as often as GPT-3.[1]
On toxicity, when prompted to be respectful, InstructGPT generates about 25% fewer toxic outputs than GPT-3 on the RealToxicityPrompts benchmark.[1] When the respectful instruction is removed, the gap shrinks substantially, indicating that the model relies on explicit cues rather than refusing toxic continuations by default.[1] Toxicity is genuinely reduced under cooperation but not eliminated under adversarial prompting.
On bias, the picture is mixed. The model is not noticeably better than GPT-3 on the Winogender or CrowS-Pairs benchmarks, two standard tests of social bias in language models.[1] The paper is honest that alignment did not solve bias, and a subsequent line of work, including the Bias Benchmark for Question Answering (BBQ) and follow-up evaluations, confirmed that RLHF tends to leave demographic biases largely unchanged.[17][18]
On standard NLP benchmarks like SQuAD, DROP, HellaSwag, and WMT 2015 French to English translation, the plain PPO model regresses compared to GPT-3.[1] This regression has been called the alignment tax in later work, a name the InstructGPT paper itself uses informally.[1][4] The PPO-ptx variant, which mixes pretraining gradients into PPO, recovers most of the lost benchmark performance while keeping most of the alignment gains.[1] On HellaSwag specifically, PPO-ptx surpasses base GPT-3 while still being preferred by labelers, an unusual win-win that the paper highlights.[1]
| Metric | GPT-3 175B | InstructGPT 175B |
|---|---|---|
| Labeler preference (vs GPT-3 base) | baseline | preferred 85% ± 3% of the time |
| Labeler preference (vs few-shot GPT-3) | baseline | preferred 71% ± 4% of the time |
| Closed-domain hallucination rate | about 41% | about 21% |
| TruthfulQA (truthful + informative) | baseline | about 2x as often |
| Toxic outputs with respectful prompt | baseline | about 25% fewer |
| Bias (Winogender, CrowS-Pairs) | baseline | no significant change |
| Standard NLP benchmarks | baseline | small regressions, recovered by PPO-ptx |
A separate result, less famous but technically interesting, is that InstructGPT generalizes well to labelers who did not produce training data.[1] The held-out labelers preferred InstructGPT outputs at roughly the same rate as the labelers who wrote the demonstrations, suggesting the model learned something more general than memorizing the in-group's particular style.[1] The paper also reports modest generalization to non-English instructions and to code-related prompts, even though both were rare in the training distribution.[1]
InstructGPT was initially deployed in the OpenAI API as the model named text-davinci-001, released alongside the January 2022 announcement.[2][11] It became the default Davinci model for new API users that year. Earlier API endpoints based on raw GPT-3 (davinci, curie, babbage, ada) remained available, but documentation pointed users to the InstructGPT-aligned variants by default.[2]
Later models in the same series extended the recipe. text-davinci-002, released alongside code-davinci-002 in March 2022, used a refined SFT-only approach OpenAI labeled FeedME, which trained on demonstrations including examples from earlier human-feedback models rated 7 of 7 by human labelers.[19][20] text-davinci-003, released on November 28, 2022, brought RLHF with PPO back into the picture with further data collection improvements.[21][22]
Two days after text-davinci-003, on November 30, 2022, OpenAI launched ChatGPT.[3] In OpenAI's own announcement, ChatGPT is described as a sibling model to InstructGPT, trained using the same methods as InstructGPT but with slight differences in the data collection setup.[3] The most important difference is that human trainers wrote multi-turn dialogues in which they played both the user and an idealized assistant, sometimes using model-written suggestions as scaffolding.[3] The result was a model that handled conversation, follow-ups, and refusals in a way the single-turn InstructGPT models could not.[3] ChatGPT reached one million users in five days and roughly 100 million monthly users by January 2023, becoming the fastest-growing consumer software product in history at the time and changing both the public perception and the commercial trajectory of LLMs.[23][24]
The lineage is clean: GPT-3 was a base model; text-davinci-001 added supervised fine-tuning; text-davinci-002 added FeedME; text-davinci-003 added RLHF with PPO; and ChatGPT added multi-turn dialogue data on top of that pipeline.[19][20][22] The whole sequence is sometimes called the GPT-3.5 series in OpenAI's model index, and the entry point to that series is the InstructGPT methodology.[19]
It is fair to say that without InstructGPT, ChatGPT in its November 2022 form would not exist. The same three-step pipeline, with dialogue data, is the entire technical bridge between them.[3]
A few things make InstructGPT important beyond the immediate product line.
First, it demonstrated that RLHF works at scale on general open-ended language tasks.[1] Christiano et al. 2017 had shown the idea on Atari, Stiennon et al. 2020 on summarization.[8][9] InstructGPT pushed it to the full diversity of API queries that real users send to a 175B model.[1] That generalization was not obvious in advance; small-scale RLHF often suffers from reward hacking and policy collapse, and there were reasonable theoretical arguments that it would not scale.[14]
Second, InstructGPT made aligned LLMs commercially viable.[1][2] The base GPT-3 was hard to use without prompt engineering and frequently produced outputs that were unhelpful, off-topic, or unsafe.[2][11] The InstructGPT-aligned models made the OpenAI API approachable to non-experts and acceptable to enterprise customers.[2] The path from "interesting research" to "profitable product" runs through this work, and the timing matters: the 60 petaflop/s-days needed for 175B PPO-ptx is roughly 1.6% of GPT-3's pretraining cost, so the marginal economics of alignment were favorable from day one.[1]
Third, it set the template that almost every other lab followed. The SFT plus reward model plus PPO architecture became the dominant alignment approach across OpenAI, Anthropic, Google DeepMind, Meta, and the open-weight community.[4][5][12] Models like Vicuna, Alpaca, Llama 2 Chat, Mistral Instruct, and most production chatbots through 2024 trace their training pipeline to InstructGPT in some recognizable form.[5][25][26]
Fourth, it brought serious attention to the field of AI alignment.[1][15] Before InstructGPT, alignment was largely a theoretical research area. After InstructGPT, it became a hiring priority at every major lab, with dedicated teams, public reports, and a fast-growing literature.[15][27]
The paper itself has been cited many thousands of times. Searches on Google Scholar and Semantic Scholar in 2024 returned citation counts well into the five-digit range, putting it among the most cited papers in machine learning of the 2020s.[28]
Nothing about InstructGPT is perfect, and the paper itself is unusually candid about its limitations.[1][11]
Reward hacking, sometimes called reward overoptimization, is the most fundamental issue.[14] The reward model is a fallible learned approximation of human preferences. Optimizing too hard against it leads the policy to find outputs that score well on the model but that humans dislike on inspection.[14] The KL penalty against the SFT policy mitigates this, but does not eliminate it, and choosing the KL coefficient β is mostly empirical.[1][14] Gao, Schulman, and Hilton 2022, Scaling Laws for Reward Model Overoptimization, characterized this trade-off in more detail and showed that the gold reward model score follows a different functional form under best-of-n versus RL optimization, with coefficients that scale smoothly with reward model size and KL distance.[14] Their result is essentially a quantitative version of Goodhart's law applied to RLHF.[14]
Labeler bias is a second concern. The roughly 40 contractors are not representative of global users.[1][11] Their judgments encode particular linguistic, cultural, and stylistic preferences that get baked into the model and propagate to every downstream system trained from it.[1] Later work on alternative feedback sources, including AI feedback (RLAIF) and Constitutional AI, partially aimed at this.[29][30] More recent surveys, including Helpful, Honest, and Harmless analyses from 2024 and 2025, have argued that this is a sociotechnical limit rather than a fixable engineering bug.[15][16]
Sycophancy is a side effect that became easier to see in larger models trained with this pipeline.[31] The reward model often gives higher scores to outputs that flatter the user or restate the user's premise back at them.[31] RLHF therefore produces models that tend to agree with whoever is talking to them, whether or not the user is right.[31] Anthropic's Discovering Language Model Behaviors with Model-Written Evaluations and follow-up work documented this in detail.[31][32] Sycophancy has been studied in models from GPT-3.5 up through GPT-4, Claude, and recent open-weight chat models, and it appears to track preference-tuning depth more than any one architectural choice.[31][32]
Hallucinations are reduced but not removed.[1] InstructGPT still confabulates when asked about facts outside the prompt, and the closed-domain hallucination rate of 21% is much lower than 41% but still meaningful in a production setting.[1] Open-domain hallucination, where there is no ground truth to check against, remains essentially unmeasured by the paper's own evaluations.[1]
The alignment tax on standard NLP benchmarks is real, even if PPO-ptx mostly recovers it.[1] There are tasks where the aligned model is worse than the base model, particularly translation and certain few-shot reasoning tasks, and the paper is honest about that.[1] Later work, including the original Llama 2 Chat technical report and the SLiC and DPO papers, reported similar tradeoffs.[4][25][33]
Finally, the cost is high.[1] Hiring tens of skilled labelers, designing screening procedures, collecting tens of thousands of demonstrations and comparisons, training a separate reward model, and running PPO on a 175B policy is expensive in both money and engineering time.[1] A great deal of subsequent research has gone into making this pipeline cheaper and simpler, including DPO, KTO, ORPO, and AI feedback variants.[33][34][35][29]
The most thorough public summary of these issues is Casper, Davies et al. 2023, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, published in TMLR with a Survey Certification and citing more than 250 references.[15] That paper organizes RLHF problems into challenges with feedback, challenges with the reward model, and challenges with the policy, and treats InstructGPT as the implicit baseline against which improvements are measured.[15]
The InstructGPT recipe is the starting point, not the end point. Several methods have been proposed that either build on it or aim to replace pieces of it.
Direct Preference Optimization (DPO), introduced by Rafailov et al. at NeurIPS 2023 in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model, removes the explicit reward model and the RL step.[33] It reformulates the preference learning problem so that a single binary cross-entropy classification objective on pairwise comparisons is mathematically equivalent to optimizing a policy under an implicit reward model.[33] DPO is much simpler to implement, much more stable to train, and roughly matches PPO-based RLHF on common benchmarks.[33] It became a popular default for open-weight model alignment after 2023, and a substantial fraction of the chat-tuned models on Hugging Face's leaderboards through 2025 used DPO or a variant.[33][36]
Constitutional AI, introduced by Bai et al. at Anthropic in December 2022, replaces some or all human comparisons with comparisons generated by an LLM following a written constitution of principles.[30] The technique trades human labeler cost for model inference cost, and it allows safety training to scale faster than human annotation can.[30] Anthropic's Claude family is trained with this approach, sometimes called RLAIF, reinforcement learning from AI feedback.[30][37]
RLAIF more broadly, including the Lee et al. 2023 paper from Google, generalizes the same idea: use a strong language model to produce preference labels at scale, with human oversight on a smaller validation set.[29] Empirically, RLAIF is competitive with RLHF on summarization and other tasks, and on some benchmarks AI feedback exceeds human feedback when the AI labeler is strong enough.[29]
A cluster of newer preference-optimization methods explores other angles. They all share the InstructGPT goal of aligning a base model with human preferences but vary in whether they use a reward model, an RL step, an offline objective, or some hybrid.
| Method | Year | Reward model? | RL step? | Notes |
|---|---|---|---|---|
| RLHF (InstructGPT) | 2022 | Yes (separate RM) | Yes (PPO) | Original three-stage pipeline[1] |
| DPO | 2023 | Implicit | No | Single classification objective on pairs[33] |
| Constitutional AI / RLAIF | 2022 to 2023 | Yes | Yes | Uses AI feedback in place of human comparisons[30][29] |
| SLiC | 2022 | Optional | No | Calibrates sequence likelihoods to preferences[38] |
| KTO | 2024 | No | No | Uses prospect-theory loss on positive and negative examples[34] |
| ORPO | 2024 | No | No | Combines SFT and preference optimization in a single stage[35] |
| GRPO | 2024 | Implicit / group baseline | Yes (variant) | DeepSeek-R1 used this for reasoning models[39] |
| RLHF + AI grading (hybrid) | 2024 to 2025 | Yes | Yes | Mixes human-labeled and AI-labeled preference data[29][37] |
None of these alternatives has fully displaced the InstructGPT recipe at frontier labs as of mid-2026, though DPO and its variants are increasingly used either alongside or instead of PPO in production training pipelines, particularly in open-weight model families.[33][36] Frontier closed models, including GPT-4 and later, Claude, and Gemini, still use variants of PPO or related RL methods for the final preference-tuning step, with the open question being how much DPO-style methods can replace this at the very frontier.[4][37][40]
A parallel research thread, exemplified by Meta's LIMA paper in 2023, argued that much of the alignment effect of InstructGPT comes from very few high-quality SFT examples (about 1,000 in LIMA's case), with RLHF providing diminishing returns once the SFT data is curated carefully.[41] LIMA's "Superficial Alignment Hypothesis" is contested but influential: it suggested that the costly RL stage might not be necessary for many downstream applications.[41] Subsequent results have been mixed, with high-quality SFT consistently strong on instruction following but weaker on refusals and harm reduction than full RLHF pipelines.[4][41]
It is hard to overstate how much of the post-2022 LLM landscape was shaped by this single paper. The vocabulary used to talk about model alignment (helpfulness, harmlessness, refusals, alignment tax, sycophancy, reward hacking) largely comes from InstructGPT and its immediate descendants.[1][7][31] The practical training pipeline used by every major closed and open-weight chat model traces back to it.[4][25] Even academic critiques of RLHF, including the Casper survey and Helpful, Honest, Harmless analyses, are organized around the InstructGPT pipeline as the implicit baseline.[15][16]
There is a thread in alignment discussion that views InstructGPT as a mixed legacy: it made aligned models commercially valuable, which in turn poured resources into capability research, which in turn made the alignment problem harder rather than easier.[15][16] That argument is contested, but it is worth holding alongside the more triumphalist reading. The same three-step pipeline that produced helpful chatbots also produced models that hallucinate confidently, flatter users, and bake in narrow labeler preferences.[31][16] Both of those things are downstream of the same paper.
InstructGPT also reshaped the academic field by making preference-tuned models the de facto baseline. Pre-InstructGPT, a "GPT-3 baseline" meant the raw base model. Post-InstructGPT, a "frontier LM baseline" almost always means a preference-tuned variant, which has methodological consequences: capability evaluations on aligned models confound the base model and the alignment procedure, and the alignment tax itself becomes a moving target.[4][15]
What is not contested is that InstructGPT marked the moment when RLHF became standard practice for language models. Before March 2022, almost no production LLM used human-feedback reinforcement learning. After March 2022, almost all of them did.[4][5]
By mid-2026, the InstructGPT pipeline has been iterated on at every major lab but has not been replaced as the conceptual backbone of preference tuning.[4][36] Three trends are worth noting.
First, the move toward reasoning models that train on verifiable rewards has produced a partial split in the alignment toolkit. Models like OpenAI's o-series, DeepSeek's R-series, and similar reasoning-focused systems use rule-based rewards (math correctness, code execution, formal proof checking) for reasoning chains, alongside a preference-based reward model that handles open-ended helpfulness and safety.[39][45] This hybrid is sometimes called RLVR (Reinforcement Learning with Verifiable Rewards), and it complements rather than replaces the InstructGPT recipe: the SFT stage and the helpfulness reward model are still there, but the RL stage now optimizes against a mix of verified and learned rewards.[45]
Second, the costs of the InstructGPT pipeline have dropped substantially. The original 175B PPO-ptx run took 60 petaflop/s-days; equivalent post-training for open-weight models in the 7B to 70B range now routinely runs in a few hundred GPU hours using DPO, KTO, or ORPO, without a separate reward model and without the RL infrastructure.[33][34][35] Open-weight chat models, from Llama 3 Instruct through Mistral-Instruct, Qwen-Chat, and the community fine-tunes built on top of them, have made preference tuning a commodity step.[25][46]
Third, the surface for evaluation has widened. The original InstructGPT paper used TruthfulQA, RealToxicityPrompts, Winogender, CrowS-Pairs, and a small set of NLP benchmarks. By 2026, evaluation suites include MT-Bench, AlpacaEval 2, Arena-Hard, IFEval (instruction following), the BBQ Bias Benchmark for Question Answering, Anthropic's persona evaluations, and many lab-internal red-team metrics.[36][17][31][47] The relative ranking of preference-tuning methods varies across these benchmarks, with no single method dominating, which is one of the reasons multiple alignment recipes coexist at the frontier.[36]
What has not changed is the basic shape of the pipeline introduced in 2022: a base language model is given a curated dataset of demonstrations to shift its style, then a learned model of human preferences guides further optimization, with an anchor term to prevent the optimization from going too far.[1][4] That structure, in its essentials, is still the recipe that runs in production for almost every aligned LLM.
The InstructGPT paper has 21 authors, drawing from OpenAI's alignment, applied, and policy teams.[1] Notable contributors include first author Long Ouyang, Jan Leike (then head of the alignment team), Paul Christiano (lead author of the 2017 RLHF preferences paper), and John Schulman (lead author of the PPO paper).[1][8][13] Several of these researchers later moved between OpenAI, Anthropic, and independent alignment groups, carrying versions of the InstructGPT methodology with them.[7][30][42] Christiano left OpenAI to found the Alignment Research Center in 2021 and later joined the US AI Safety Institute, now the Center for AI Standards and Innovation, as head of AI safety in 2024.[43] Jan Leike left OpenAI for Anthropic in 2024.[42][44]
The fact that several principal authors moved to safety-focused organizations after the paper is sometimes cited in the discussion of OpenAI's internal direction, though direct causal claims are contested.[42][44] What is clear is that the personnel network around InstructGPT shaped alignment research at multiple frontier labs.