InstructGPT

AI Alignment Large Language Models OpenAI Training & Optimization

35 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

47 citations

Revision

v4 · 6,913 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

InstructGPT is a family of language models released by OpenAI in January 2022 that take the base GPT-3 and fine-tune it to follow user instructions more helpfully, truthfully, and with less toxic output, using a three-step pipeline of supervised fine-tuning, a learned reward model, and reinforcement learning that the field now calls RLHF.^[1]^[2] Its single most cited result is that a 1.3B-parameter InstructGPT model was preferred by human labelers over the 175B base GPT-3, despite having more than 100 times fewer parameters.^[1]^[2] The method was announced in OpenAI's blog post "Aligning language models to follow instructions" on January 27, 2022, and described in the paper Training language models to follow instructions with human feedback by Ouyang et al., presented at NeurIPS 2022 (arXiv:2203.02155).^[1]^[2]

The training recipe combines supervised fine-tuning on labeler-written demonstrations, a learned reward model trained on human pairwise comparisons, and reinforcement learning with Proximal Policy Optimization (PPO).^[1] InstructGPT is the direct technical and commercial predecessor of ChatGPT.^[3] OpenAI has stated that ChatGPT is a "sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response," trained with the same RLHF method but with slight differences in the data collection setup.^[3] Through 2024 and into 2026, almost every aligned production LLM, including GPT-3.5, GPT-4, Claude, Gemini, and the Llama 2 Chat and Llama 3 Instruct families, used some variant of the InstructGPT pipeline.^[4]^[5]

Infobox

Attribute	Value
Developer	OpenAI
Initial release	January 27, 2022
Paper	Ouyang et al., arXiv:2203.02155 (March 4, 2022)
Venue	NeurIPS 2022
Model sizes	1.3B, 6B, 175B parameters
Base architecture	GPT-3 (decoder-only Transformer)
Training pipeline	SFT, then reward model, then PPO
Reward model size	6B parameters (used for all policy sizes)
SFT prompts	about 13,000
RM prompts	about 33,000
PPO prompts	about 31,000
Labelers	about 40, hired via Upwork and Scale AI
Compute (175B PPO-ptx)	60 petaflop/s-days
API deployment	text-davinci-001 (January 2022), text-davinci-002 (March 2022), text-davinci-003 (November 2022)

What problem did InstructGPT solve?

The base GPT-3 model released in 2020 was trained as an autoregressive next-token predictor on a very large corpus scraped from the open web.^[6] It was good at continuing text in the style of its training data, which is not the same thing as following an instruction.^[1] Asked "Explain the moon landing to a six year old," GPT-3 might continue with "Explain the theory of gravity to a six year old. Explain the theory of relativity to a six year old," because that is a plausible continuation in a homework worksheet.^[1] The model was, in the language of the InstructGPT paper, misaligned with the goals of users sending API queries.^[1]

Ouyang and colleagues frame this gap as the difference between optimizing a language modeling objective and optimizing for what users actually want.^[1] They borrow the terminology of "helpful, honest, and harmless" from earlier alignment work by Askell and colleagues at Anthropic and treat instruction following as the operational target.^[1]^[7] The technical question is how to push GPT-3 in that direction without retraining from scratch.

The answer is a chain of three fine-tuning steps that wraps GPT-3 in a layer of human preferences. The lineage of this idea runs through several earlier papers. Christiano et al. 2017, Deep reinforcement learning from human preferences, showed that an agent could learn to play Atari and perform simulated locomotion using human pairwise comparisons instead of a hand-coded reward, with feedback on less than one percent of the agent's interactions.^[8] Stiennon et al. 2020, Learning to summarize from human feedback, applied the same idea to summarizing Reddit posts and found that a 1.3B model trained with human feedback outperformed a supervised model ten times its size, with a 61% versus 43% preference rate against reference summaries.^[9] InstructGPT generalizes that approach from summarization to open-ended instruction following.^[1]

A second precursor inside OpenAI was WebGPT, released in December 2021, which fine-tuned GPT-3 to browse the web and answer long-form questions using a combination of imitation learning and reward modeling.^[10] WebGPT shipped before InstructGPT and used several of the same engineering primitives, including a learned reward model and rejection sampling. The InstructGPT team carried that infrastructure forward into open-ended instruction following.^[1]^[10]

How does the InstructGPT training pipeline work?

The heart of InstructGPT is a pipeline that takes a pretrained GPT-3 checkpoint and bolts three additional training stages on top.^[1] The same pipeline is sometimes drawn as a triangle of SFT, reward model, and PPO with the SFT model both seeding the reward model and providing a KL anchor for the PPO policy.^[1]^[4]

Step 1: Supervised fine-tuning (SFT)

OpenAI hired about 40 contractors through Upwork and Scale AI, screening them on agreement with researcher judgments and on demonstrated ability to identify sensitive content.^[1]^[11] Most labelers were English speakers based in the United States or Southeast Asia.^[1]^[11] Those labelers wrote example prompt and response pairs that demonstrated the desired behavior: useful answers, refusals where appropriate, neutral tone, no fabricated facts.^[1]^[11] They also produced demonstrations for prompts collected from real users of the OpenAI Playground, with consent.^[1]^[11]

The resulting SFT dataset contains about 13,000 training prompts drawn from both API submissions and labeler-written prompts.^[1] GPT-3 is then fine-tuned on these demonstrations with the standard cross-entropy language model loss for 16 epochs, using a cosine learning rate decay and residual dropout of 0.2.^[1] The output is a model that has shifted toward the labeler-written style. This SFT model is the seed for the next two steps.

In a subtle but important detail, the paper notes that the SFT model overfits on the validation loss after about one epoch, yet continuing to train for many more epochs still helps the downstream reward model and the RLHF policy.^[1] This makes the SFT stage a different optimization problem from typical fine-tuning: it is essentially a warm start for the alignment pipeline, not a model whose own loss curve is the target.^[1]

Step 2: Reward model (RM) training

The second stage replaces hand-written demonstrations with human pairwise comparisons, which are cheaper to collect at scale than full demonstrations.^[1] For each prompt, the SFT model samples between four and nine candidate completions (denoted $K$ in the paper). Labelers see all of them and rank them from best to worst on a 1 to 7 Likert scale, and each ranking is then expanded into all of its pairwise comparisons.^[1]

The reward model is a separate Transformer initialized from the SFT model with the final unembedding layer removed and replaced by a scalar regression head.^[1] It is trained on roughly 33,000 ranking prompts, expanded into many more pairwise comparisons, with the Bradley-Terry pairwise loss.^[1] The exact loss for a comparison of preferred completion $y_w$ against dispreferred completion $y_l$ , given prompt $x$ , is:

\operatorname{loss}(\theta) = -\mathbb{E}\left[\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))\right]

where $\sigma$ is the sigmoid function and the expectation is taken over comparison pairs.^[1] Concretely, the reward model takes a prompt and a completion and outputs a single number, and training pushes the score for the labeler-preferred completion above the score for the dispreferred completion.^[1] All $\binom{K}{2}$ comparisons from a single prompt are presented as a single batch element to prevent overfitting to easy comparisons, with the loss averaged over the $\binom{K}{2}$ pairs.^[1]

A notable choice: although they tested 175B reward models, OpenAI used 6B parameter reward models in the final pipeline.^[1] Larger reward models were unstable during RL and offered no measurable benefit on the final policy, while costing far more compute.^[1] Using a smaller reward model is part of why the pipeline is feasible at all, and it became a widely copied design choice in later RLHF implementations.^[4]^[12]

Step 3: Reinforcement learning with PPO

In the final stage, the SFT model is further fine-tuned to maximize the score given by the frozen reward model, using Proximal Policy Optimization (Schulman et al. 2017) as the RL algorithm.^[1]^[13] PPO is a policy gradient method with a clipped surrogate objective that prevents the new policy from drifting too far from the old policy in any single update, which keeps training stable.^[13]

The PPO dataset has about 31,000 prompts drawn entirely from the OpenAI API.^[1] At each token of generation, the policy is updated against the following objective:

\operatorname{objective}(\phi) = \mathbb{E}\left[r_\theta(x, y) - \beta \log\left(\frac{\pi_\phi^{\text{RL}}(y \mid x)}{\pi^{\text{SFT}}(y \mid x)}\right)\right] + \gamma \mathbb{E}\left[\log \pi_\phi^{\text{RL}}(x)\right]

where the first term is the reward model score, the second is a per-token KL divergence penalty against the SFT policy with coefficient $\beta$ , and the third term is a pretraining language modeling loss with coefficient $\gamma$ .^[1] For the plain PPO variant, $\gamma$ is set to zero and only the first two terms are used; for the PPO-ptx variant, both the KL and pretraining terms are active.^[1]

The KL term, scaled by $\beta$ , prevents the policy from finding adversarial outputs that score highly on the reward model but look nothing like reasonable text.^[1] Without this anchor, RL would aggressively exploit reward model errors, a failure mode known as reward hacking or reward overoptimization, later characterized in detail by Gao, Schulman, and Hilton in Scaling Laws for Reward Model Overoptimization.^[14]

The PPO-ptx variant exists specifically to claw back performance on standard NLP benchmarks lost during alignment, a regression the paper names the alignment tax.^[1] It mixes a small fraction of the original pretraining gradient back into the PPO update, which recovers most of the benchmark loss while keeping most of the alignment gains.^[1]

The final PPO model, after this third stage, is what OpenAI calls InstructGPT and deploys in the API.^[1]^[2]

Pipeline summary

Step	Input	Method	Dataset	Output
1. SFT	GPT-3 base	Supervised fine-tune on labeler demonstrations, 16 epochs, cosine LR decay, residual dropout 0.2	about 13,000 prompt and response pairs	SFT model
2. RM	SFT model	Train reward model on pairwise comparisons, Bradley-Terry loss, K choose 2 batched	about 33,000 ranking prompts, K from 4 to 9 completions each	6B reward model
3. RL (PPO)	SFT model + frozen RM	PPO with per-token KL penalty ( $\beta$ ) against SFT; PPO-ptx adds pretraining loss ( $\gamma$ )	about 31,000 API prompts	InstructGPT

How much compute did InstructGPT cost to train?

Training a 175B SFT model required about 4.9 petaflop/s-days, and training the 175B PPO-ptx model required about 60 petaflop/s-days.^[1] By comparison, pretraining base GPT-3 required about 3,640 petaflop/s-days.^[1]^[6] In other words, the entire RLHF alignment pipeline used roughly 1.6% of the compute that went into pretraining the model it aligned, which became one of the most cited reasons RLHF was adopted so quickly across the industry: the alignment step is comparatively cheap once a strong base model exists.^[1]^[4]

What model sizes did InstructGPT come in?

OpenAI trained InstructGPT at three sizes that match the GPT-3 family: 1.3B, 6B, and 175B parameters.^[1]^[11] The reward model was 6B in every case, even when fine-tuning the 175B policy with PPO.^[1]

Model	Parameters	Notes
InstructGPT 1.3B	1.3 billion	Smallest variant; preferred over 175B GPT-3 in human evaluation
InstructGPT 6B	6 billion	Same size as the reward model
InstructGPT 175B	175 billion	Largest variant; deployed as text-davinci-001 in the OpenAI API

The single most cited result in the paper is that the 1.3B InstructGPT model is preferred by labelers over the 175B base GPT-3 model, despite having more than 100 times fewer parameters.^[1]^[2] Alignment, in this case, bought more user-perceived quality than two orders of magnitude of additional parameters.^[1]

Datasets and labelers

InstructGPT is, in some sense, more a product of its data than its architecture.^[1] The architecture is just GPT-3 with extra fine-tuning. The data is the new ingredient.

Prompt sources and categories

Prompts came from two sources.^[1] The bulk are real prompts submitted to the OpenAI API and the Playground, with users asked for permission to use their data for research.^[11] A smaller seed set was written by labelers themselves, used to bootstrap the early SFT data when the API was still new.^[1] The team filtered for personal information (PII), capped prompts per user at about 200 to limit user-level overfitting, and split train, validation, and test sets by user ID to keep the same user out of multiple splits.^[1]

The reward model training prompts break down by use case category as follows:^[1]

Use case	Share of RM prompts
Generation	45.6%
Open QA	12.4%
Brainstorming	11.2%
Chat	8.4%
Rewrite	6.6%
Summarization	4.2%
Classification	3.5%
Other	3.5%
Closed QA	2.6%
Extract	1.9%

Open-ended generation and brainstorming, plus chat, dominate the distribution, while classification and extraction are a small slice. That skew matters because RLHF is being optimized on this distribution; whatever the labelers prefer for these task types becomes the model's revealed objective.^[1]

Who labeled the data?

The labelers were a small and homogeneous group of about 40 contractors, mostly English-speaking people living in the United States or Southeast Asia, hired through Upwork and Scale AI.^[1]^[11] They were selected through a screening procedure designed by the OpenAI team that measured both sensitivity to different demographic preferences and ability to identify potentially harmful outputs.^[1]^[11] Inter-labeler agreement was about 72.6% ± 1.5% for training labelers and about 77.3% ± 1.3% for held-out labelers, comparable to the 73% ± 4% researcher-researcher agreement reported in Stiennon et al.'s summarization work.^[1]^[9]

The paper acknowledges in its limitations section that the values encoded into InstructGPT are roughly the values of those 40 contractors plus the OpenAI researchers, not a sample of humanity.^[1] Later work on cultural bias in RLHF systems leans heavily on this point, and it is one of the better-grounded critiques of the methodology.^[15]^[16]

A subtler concern is that the prompts used for both SFT and the reward model came disproportionately from API users sending requests to an earlier version of the InstructGPT models, so the prompt distribution is partly the product of the model it is being used to train.^[1]^[11] In practice this is a kind of bootstrapping feedback loop: the labelers' demonstrations shape an early model, that model attracts a particular kind of API traffic, that traffic seeds the next round of training, and so on.

What were InstructGPT's key results?

The paper reports several headline findings. In human evaluation, outputs from the 175B InstructGPT are preferred to 175B GPT-3 outputs 85% ± 3% of the time on the API prompt distribution.^[1] Even with strict prompt instructions added to GPT-3, the few-shot GPT-3 baseline is still beaten by InstructGPT about 71% ± 4% of the time.^[1] The preference signal is robust across the prompt distribution, not driven by a few task categories, and the result holds when measured by held-out labelers who did not produce training data.^[1]

Truthfulness

On truthfulness, evaluated on the TruthfulQA benchmark and on the closed-domain summarization tasks where models can be checked against source text, InstructGPT shows clear improvements.^[1] On closed-domain QA and summarization, where the source text is the ground truth, the hallucination rate falls from about 41% for GPT-3 to about 21% for InstructGPT, roughly half.^[1] On TruthfulQA, the gap is smaller but consistent: InstructGPT generates truthful and informative answers about twice as often as GPT-3.^[1]

Toxicity

On toxicity, when prompted to be respectful, InstructGPT generates about 25% fewer toxic outputs than GPT-3 on the RealToxicityPrompts benchmark.^[1] When the respectful instruction is removed, the gap shrinks substantially, indicating that the model relies on explicit cues rather than refusing toxic continuations by default.^[1] Toxicity is genuinely reduced under cooperation but not eliminated under adversarial prompting.

Bias

On bias, the picture is mixed. The model is not noticeably better than GPT-3 on the Winogender or CrowS-Pairs benchmarks, two standard tests of social bias in language models.^[1] The paper is honest that alignment did not solve bias, and a subsequent line of work, including the Bias Benchmark for Question Answering (BBQ) and follow-up evaluations, confirmed that RLHF tends to leave demographic biases largely unchanged.^[17]^[18]

Alignment tax

On standard NLP benchmarks like SQuAD, DROP, HellaSwag, and WMT 2015 French to English translation, the plain PPO model regresses compared to GPT-3.^[1] This regression has been called the alignment tax in later work, a name the InstructGPT paper itself uses informally.^[1]^[4] The PPO-ptx variant, which mixes pretraining gradients into PPO, recovers most of the lost benchmark performance while keeping most of the alignment gains.^[1] On HellaSwag specifically, PPO-ptx surpasses base GPT-3 while still being preferred by labelers, an unusual win-win that the paper highlights.^[1]

Metric	GPT-3 175B	InstructGPT 175B
Labeler preference (vs GPT-3 base)	baseline	preferred 85% ± 3% of the time
Labeler preference (vs few-shot GPT-3)	baseline	preferred 71% ± 4% of the time
Closed-domain hallucination rate	about 41%	about 21%
TruthfulQA (truthful + informative)	baseline	about 2x as often
Toxic outputs with respectful prompt	baseline	about 25% fewer
Bias (Winogender, CrowS-Pairs)	baseline	no significant change
Standard NLP benchmarks	baseline	small regressions, recovered by PPO-ptx

Generalization

A separate result, less famous but technically interesting, is that InstructGPT generalizes well to labelers who did not produce training data.^[1] The held-out labelers preferred InstructGPT outputs at roughly the same rate as the labelers who wrote the demonstrations, suggesting the model learned something more general than memorizing the in-group's particular style.^[1] The paper also reports modest generalization to non-English instructions and to code-related prompts, even though both were rare in the training distribution.^[1]

InstructGPT was initially deployed in the OpenAI API as the model named text-davinci-001, released alongside the January 2022 announcement.^[2]^[11] It became the default Davinci model for new API users that year. Earlier API endpoints based on raw GPT-3 (davinci, curie, babbage, ada) remained available, but documentation pointed users to the InstructGPT-aligned variants by default.^[2]

Later models in the same series extended the recipe. text-davinci-002, released alongside code-davinci-002 in March 2022, used a refined SFT-only approach OpenAI labeled FeedME, which trained on demonstrations including examples from earlier human-feedback models rated 7 of 7 by human labelers.^[19]^[20] text-davinci-003, released on November 28, 2022, brought RLHF with PPO back into the picture with further data collection improvements.^[21]^[22]

Two days after text-davinci-003, on November 30, 2022, OpenAI launched ChatGPT.^[3] In OpenAI's own announcement, ChatGPT is described as a sibling model to InstructGPT, "trained using the same methods as InstructGPT, but with slight differences in the data collection setup."^[3] The most important difference is that human trainers wrote multi-turn dialogues in which they played both the user and an idealized assistant, sometimes using model-written suggestions as scaffolding.^[3] The result was a model that handled conversation, follow-ups, and refusals in a way the single-turn InstructGPT models could not.^[3] ChatGPT reached one million users in five days and roughly 100 million monthly users by January 2023, becoming the fastest-growing consumer software product in history at the time and changing both the public perception and the commercial trajectory of LLMs.^[23]^[24]

The lineage is clean: GPT-3 was a base model; text-davinci-001 added supervised fine-tuning; text-davinci-002 added FeedME; text-davinci-003 added RLHF with PPO; and ChatGPT added multi-turn dialogue data on top of that pipeline.^[19]^[20]^[22] The whole sequence is sometimes called the GPT-3.5 series in OpenAI's model index, and the entry point to that series is the InstructGPT methodology.^[19]

It is fair to say that without InstructGPT, ChatGPT in its November 2022 form would not exist. The same three-step pipeline, with dialogue data, is the entire technical bridge between them.^[3]

Why is InstructGPT significant?

A few things make InstructGPT important beyond the immediate product line.

First, it demonstrated that RLHF works at scale on general open-ended language tasks.^[1] Christiano et al. 2017 had shown the idea on Atari, Stiennon et al. 2020 on summarization.^[8]^[9] InstructGPT pushed it to the full diversity of API queries that real users send to a 175B model.^[1] That generalization was not obvious in advance; small-scale RLHF often suffers from reward hacking and policy collapse, and there were reasonable theoretical arguments that it would not scale.^[14]

Second, InstructGPT made aligned LLMs commercially viable.^[1]^[2] The base GPT-3 was hard to use without prompt engineering and frequently produced outputs that were unhelpful, off-topic, or unsafe.^[2]^[11] The InstructGPT-aligned models made the OpenAI API approachable to non-experts and acceptable to enterprise customers.^[2] The path from "interesting research" to "profitable product" runs through this work, and the timing matters: the 60 petaflop/s-days needed for 175B PPO-ptx is roughly 1.6% of GPT-3's pretraining cost, so the marginal economics of alignment were favorable from day one.^[1]

Third, it set the template that almost every other lab followed. The SFT plus reward model plus PPO architecture became the dominant alignment approach across OpenAI, Anthropic, Google DeepMind, Meta, and the open-weight community.^[4]^[5]^[12] Models like Vicuna, Alpaca, Llama 2 Chat, Mistral Instruct, and most production chatbots through 2024 trace their training pipeline to InstructGPT in some recognizable form.^[5]^[25]^[26]

Fourth, it brought serious attention to the field of AI alignment.^[1]^[15] Before InstructGPT, alignment was largely a theoretical research area. After InstructGPT, it became a hiring priority at every major lab, with dedicated teams, public reports, and a fast-growing literature.^[15]^[27]

The paper itself has been cited many thousands of times. Searches on Google Scholar and Semantic Scholar in 2024 returned citation counts well into the five-digit range, putting it among the most cited papers in machine learning of the 2020s.^[28]

What are InstructGPT's limitations?

Nothing about InstructGPT is perfect, and the paper itself is unusually candid about its limitations.^[1]^[11]

Reward hacking

Reward hacking, sometimes called reward overoptimization, is the most fundamental issue.^[14] The reward model is a fallible learned approximation of human preferences. Optimizing too hard against it leads the policy to find outputs that score well on the model but that humans dislike on inspection.^[14] The KL penalty against the SFT policy mitigates this, but does not eliminate it, and choosing the KL coefficient β is mostly empirical.^[1]^[14] Gao, Schulman, and Hilton 2022, Scaling Laws for Reward Model Overoptimization, characterized this trade-off in more detail and showed that the gold reward model score follows a different functional form under best-of-n versus RL optimization, with coefficients that scale smoothly with reward model size and KL distance.^[14] Their result is essentially a quantitative version of Goodhart's law applied to RLHF.^[14]

Labeler bias

Labeler bias is a second concern. The roughly 40 contractors are not representative of global users.^[1]^[11] Their judgments encode particular linguistic, cultural, and stylistic preferences that get baked into the model and propagate to every downstream system trained from it.^[1] Later work on alternative feedback sources, including AI feedback (RLAIF) and Constitutional AI, partially aimed at this.^[29]^[30] More recent surveys, including Helpful, Honest, and Harmless analyses from 2024 and 2025, have argued that this is a sociotechnical limit rather than a fixable engineering bug.^[15]^[16]

Sycophancy

Sycophancy is a side effect that became easier to see in larger models trained with this pipeline.^[31] The reward model often gives higher scores to outputs that flatter the user or restate the user's premise back at them.^[31] RLHF therefore produces models that tend to agree with whoever is talking to them, whether or not the user is right.^[31] Anthropic's Discovering Language Model Behaviors with Model-Written Evaluations and follow-up work documented this in detail.^[31]^[32] Sycophancy has been studied in models from GPT-3.5 up through GPT-4, Claude, and recent open-weight chat models, and it appears to track preference-tuning depth more than any one architectural choice.^[31]^[32]

Hallucinations

Hallucinations are reduced but not removed.^[1] InstructGPT still confabulates when asked about facts outside the prompt, and the closed-domain hallucination rate of 21% is much lower than 41% but still meaningful in a production setting.^[1] Open-domain hallucination, where there is no ground truth to check against, remains essentially unmeasured by the paper's own evaluations.^[1]

Alignment tax

The alignment tax on standard NLP benchmarks is real, even if PPO-ptx mostly recovers it.^[1] There are tasks where the aligned model is worse than the base model, particularly translation and certain few-shot reasoning tasks, and the paper is honest about that.^[1] Later work, including the original Llama 2 Chat technical report and the SLiC and DPO papers, reported similar tradeoffs.^[4]^[25]^[33]

Cost and engineering complexity

Finally, the cost is high.^[1] Hiring tens of skilled labelers, designing screening procedures, collecting tens of thousands of demonstrations and comparisons, training a separate reward model, and running PPO on a 175B policy is expensive in both money and engineering time.^[1] A great deal of subsequent research has gone into making this pipeline cheaper and simpler, including DPO, KTO, ORPO, and AI feedback variants.^[33]^[34]^[35]^[29]

Open problems survey

The most thorough public summary of these issues is Casper, Davies et al. 2023, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, published in TMLR with a Survey Certification and citing more than 250 references.^[15] That paper organizes RLHF problems into challenges with feedback, challenges with the reward model, and challenges with the policy, and treats InstructGPT as the implicit baseline against which improvements are measured.^[15]

What methods succeeded or replaced InstructGPT?

The InstructGPT recipe is the starting point, not the end point. Several methods have been proposed that either build on it or aim to replace pieces of it.

Direct preference optimization (DPO)

Direct Preference Optimization (DPO), introduced by Rafailov et al. at NeurIPS 2023 in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model, removes the explicit reward model and the RL step.^[33] It reformulates the preference learning problem so that a single binary cross-entropy classification objective on pairwise comparisons is mathematically equivalent to optimizing a policy under an implicit reward model.^[33] DPO is much simpler to implement, much more stable to train, and roughly matches PPO-based RLHF on common benchmarks.^[33] It became a popular default for open-weight model alignment after 2023, and a substantial fraction of the chat-tuned models on Hugging Face's leaderboards through 2025 used DPO or a variant.^[33]^[36]

Constitutional AI and RLAIF

Constitutional AI, introduced by Bai et al. at Anthropic in December 2022, replaces some or all human comparisons with comparisons generated by an LLM following a written constitution of principles.^[30] The technique trades human labeler cost for model inference cost, and it allows safety training to scale faster than human annotation can.^[30] Anthropic's Claude family is trained with this approach, sometimes called RLAIF, reinforcement learning from AI feedback.^[30]^[37]

RLAIF more broadly, including the Lee et al. 2023 paper from Google, generalizes the same idea: use a strong language model to produce preference labels at scale, with human oversight on a smaller validation set.^[29] Empirically, RLAIF is competitive with RLHF on summarization and other tasks, and on some benchmarks AI feedback exceeds human feedback when the AI labeler is strong enough.^[29]

Other preference optimization methods

A cluster of newer preference-optimization methods explores other angles. They all share the InstructGPT goal of aligning a base model with human preferences but vary in whether they use a reward model, an RL step, an offline objective, or some hybrid.

Method	Year	Reward model?	RL step?	Notes
RLHF (InstructGPT)	2022	Yes (separate RM)	Yes (PPO)	Original three-stage pipeline^[1]
DPO	2023	Implicit	No	Single classification objective on pairs^[33]
Constitutional AI / RLAIF	2022 to 2023	Yes	Yes	Uses AI feedback in place of human comparisons^[30]^[29]
SLiC	2022	Optional	No	Calibrates sequence likelihoods to preferences^[38]
KTO	2024	No	No	Uses prospect-theory loss on positive and negative examples^[34]
ORPO	2024	No	No	Combines SFT and preference optimization in a single stage^[35]
GRPO	2024	Implicit / group baseline	Yes (variant)	DeepSeek-R1 used this for reasoning models^[39]
RLHF + AI grading (hybrid)	2024 to 2025	Yes	Yes	Mixes human-labeled and AI-labeled preference data^[29]^[37]

None of these alternatives has fully displaced the InstructGPT recipe at frontier labs as of mid-2026, though DPO and its variants are increasingly used either alongside or instead of PPO in production training pipelines, particularly in open-weight model families.^[33]^[36] Frontier closed models, including GPT-4 and later, Claude, and Gemini, still use variants of PPO or related RL methods for the final preference-tuning step, with the open question being how much DPO-style methods can replace this at the very frontier.^[4]^[37]^[40]

LIMA and minimal-data alignment

A parallel research thread, exemplified by Meta's LIMA paper in 2023, argued that much of the alignment effect of InstructGPT comes from very few high-quality SFT examples (about 1,000 in LIMA's case), with RLHF providing diminishing returns once the SFT data is curated carefully.^[41] LIMA's "Superficial Alignment Hypothesis" is contested but influential: it suggested that the costly RL stage might not be necessary for many downstream applications.^[41] Subsequent results have been mixed, with high-quality SFT consistently strong on instruction following but weaker on refusals and harm reduction than full RLHF pipelines.^[4]^[41]

Influence on the field

It is hard to overstate how much of the post-2022 LLM landscape was shaped by this single paper. The vocabulary used to talk about model alignment (helpfulness, harmlessness, refusals, alignment tax, sycophancy, reward hacking) largely comes from InstructGPT and its immediate descendants.^[1]^[7]^[31] The practical training pipeline used by every major closed and open-weight chat model traces back to it.^[4]^[25] Even academic critiques of RLHF, including the Casper survey and Helpful, Honest, Harmless analyses, are organized around the InstructGPT pipeline as the implicit baseline.^[15]^[16]

There is a thread in alignment discussion that views InstructGPT as a mixed legacy: it made aligned models commercially valuable, which in turn poured resources into capability research, which in turn made the alignment problem harder rather than easier.^[15]^[16] That argument is contested, but it is worth holding alongside the more triumphalist reading. The same three-step pipeline that produced helpful chatbots also produced models that hallucinate confidently, flatter users, and bake in narrow labeler preferences.^[31]^[16] Both of those things are downstream of the same paper.

InstructGPT also reshaped the academic field by making preference-tuned models the de facto baseline. Pre-InstructGPT, a "GPT-3 baseline" meant the raw base model. Post-InstructGPT, a "frontier LM baseline" almost always means a preference-tuned variant, which has methodological consequences: capability evaluations on aligned models confound the base model and the alignment procedure, and the alignment tax itself becomes a moving target.^[4]^[15]

What is not contested is that InstructGPT marked the moment when RLHF became standard practice for language models. Before March 2022, almost no production LLM used human-feedback reinforcement learning. After March 2022, almost all of them did.^[4]^[5]

Is the InstructGPT recipe still used in 2026?

By mid-2026, the InstructGPT pipeline has been iterated on at every major lab but has not been replaced as the conceptual backbone of preference tuning.^[4]^[36] Three trends are worth noting.

First, the move toward reasoning models that train on verifiable rewards has produced a partial split in the alignment toolkit. Models like OpenAI's o-series, DeepSeek's R-series, and similar reasoning-focused systems use rule-based rewards (math correctness, code execution, formal proof checking) for reasoning chains, alongside a preference-based reward model that handles open-ended helpfulness and safety.^[39]^[45] This hybrid is sometimes called RLVR (Reinforcement Learning with Verifiable Rewards), and it complements rather than replaces the InstructGPT recipe: the SFT stage and the helpfulness reward model are still there, but the RL stage now optimizes against a mix of verified and learned rewards.^[45]

Second, the costs of the InstructGPT pipeline have dropped substantially. The original 175B PPO-ptx run took 60 petaflop/s-days; equivalent post-training for open-weight models in the 7B to 70B range now routinely runs in a few hundred GPU hours using DPO, KTO, or ORPO, without a separate reward model and without the RL infrastructure.^[33]^[34]^[35] Open-weight chat models, from Llama 3 Instruct through Mistral-Instruct, Qwen-Chat, and the community fine-tunes built on top of them, have made preference tuning a commodity step.^[25]^[46]

Third, the surface for evaluation has widened. The original InstructGPT paper used TruthfulQA, RealToxicityPrompts, Winogender, CrowS-Pairs, and a small set of NLP benchmarks. By 2026, evaluation suites include MT-Bench, AlpacaEval 2, Arena-Hard, IFEval (instruction following), the BBQ Bias Benchmark for Question Answering, Anthropic's persona evaluations, and many lab-internal red-team metrics.^[36]^[17]^[31]^[47] The relative ranking of preference-tuning methods varies across these benchmarks, with no single method dominating, which is one of the reasons multiple alignment recipes coexist at the frontier.^[36]

What has not changed is the basic shape of the pipeline introduced in 2022: a base language model is given a curated dataset of demonstrations to shift its style, then a learned model of human preferences guides further optimization, with an anchor term to prevent the optimization from going too far.^[1]^[4] That structure, in its essentials, is still the recipe that runs in production for almost every aligned LLM.

Authorship and context

The InstructGPT paper has 21 authors, drawing from OpenAI's alignment, applied, and policy teams.^[1] Notable contributors include first author Long Ouyang, Jan Leike (then head of the alignment team), Paul Christiano (lead author of the 2017 RLHF preferences paper), and John Schulman (lead author of the PPO paper).^[1]^[8]^[13] Several of these researchers later moved between OpenAI, Anthropic, and independent alignment groups, carrying versions of the InstructGPT methodology with them.^[7]^[30]^[42] Christiano left OpenAI to found the Alignment Research Center in 2021 and later joined the US AI Safety Institute, now the Center for AI Standards and Innovation, as head of AI safety in 2024.^[43] Jan Leike left OpenAI for Anthropic in 2024.^[42]^[44]

The fact that several principal authors moved to safety-focused organizations after the paper is sometimes cited in the discussion of OpenAI's internal direction, though direct causal claims are contested.^[42]^[44] What is clear is that the personnel network around InstructGPT shaped alignment research at multiple frontier labs.

References

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, "Training Language Models to Follow Instructions with Human Feedback", NeurIPS 2022 / arXiv:2203.02155, 2022-03-04. https://arxiv.org/abs/2203.02155. Accessed 2026-05-24. ↩
OpenAI, "Aligning Language Models to Follow Instructions", OpenAI Blog, 2022-01-27. https://openai.com/index/instruction-following/. Accessed 2026-05-24. ↩
OpenAI, "Introducing ChatGPT", OpenAI Blog, 2022-11-30. https://openai.com/index/chatgpt/. Accessed 2026-05-24. ↩
Nathan Lambert, "A Tiny History of RLHF" and related entries, RLHF Book, 2024 to 2025. https://rlhfbook.com/. Accessed 2026-05-24. ↩
Hugo Touvron, Louis Martin, Kevin Stone et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models", Meta AI, arXiv:2307.09288, 2023-07-18. https://arxiv.org/abs/2307.09288. Accessed 2026-05-24. ↩
Tom B. Brown, Benjamin Mann, Nick Ryder et al., "Language Models are Few-Shot Learners", NeurIPS 2020, arXiv:2005.14165, 2020-05-28. https://arxiv.org/abs/2005.14165. Accessed 2026-05-24. ↩
Amanda Askell, Yuntao Bai, Anna Chen et al., "A General Language Assistant as a Laboratory for Alignment", Anthropic, arXiv:2112.00861, 2021-12-01. https://arxiv.org/abs/2112.00861. Accessed 2026-05-24. ↩
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, "Deep Reinforcement Learning from Human Preferences", NeurIPS 2017, arXiv:1706.03741, 2017-06-12. https://arxiv.org/abs/1706.03741. Accessed 2026-05-24. ↩
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to Summarize from Human Feedback", NeurIPS 2020, arXiv:2009.01325, 2020-09-02. https://arxiv.org/abs/2009.01325. Accessed 2026-05-24. ↩
Reiichiro Nakano, Jacob Hilton, Suchir Balaji et al., "WebGPT: Browser-Assisted Question-Answering with Human Feedback", OpenAI, arXiv:2112.09332, 2021-12-17. https://arxiv.org/abs/2112.09332. Accessed 2026-05-24. ↩
OpenAI, "InstructGPT Model Card", GitHub: openai/following-instructions-human-feedback, last updated 2022-01. https://github.com/openai/following-instructions-human-feedback/blob/main/model-card.md. Accessed 2026-05-24. ↩
Yuntao Bai, Andy Jones, Kamal Ndousse et al., "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback", Anthropic, arXiv:2204.05862, 2022-04-12. https://arxiv.org/abs/2204.05862. Accessed 2026-05-24. ↩
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, "Proximal Policy Optimization Algorithms", OpenAI, arXiv:1707.06347, 2017-07-20. https://arxiv.org/abs/1707.06347. Accessed 2026-05-24. ↩
Leo Gao, John Schulman, Jacob Hilton, "Scaling Laws for Reward Model Overoptimization", ICML 2023, arXiv:2210.10760, 2022-10-19. https://arxiv.org/abs/2210.10760. Accessed 2026-05-24. ↩
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jeremy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell, "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback", TMLR (Survey Certification), arXiv:2307.15217, 2023-07-27. https://arxiv.org/abs/2307.15217. Accessed 2026-05-24. ↩
"Helpful, Honest, Harmless? Sociotechnical Limits of AI Alignment and Safety through Reinforcement Learning from Human Feedback", PMC review article, 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12137480/. Accessed 2026-05-24. ↩
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, Samuel R. Bowman, "BBQ: A Hand-Built Bias Benchmark for Question Answering", Findings of ACL 2022, arXiv:2110.08193, 2021-10-15. https://arxiv.org/abs/2110.08193. Accessed 2026-05-24. ↩
Rachel Rudinger, Jason Naradowsky, Brian Leonard, Benjamin Van Durme, "Gender Bias in Coreference Resolution (Winogender)", NAACL 2018, arXiv:1804.09301, 2018-04-25. https://arxiv.org/abs/1804.09301. Accessed 2026-05-24. ↩
OpenAI, "Model Index for Researchers (GPT-3.5 series)", OpenAI documentation, archived 2023. https://platform.openai.com/docs/model-index-for-researchers. Accessed 2026-05-24. ↩
John McDonnell, "OpenAI Comes Clean About GPT-3.5", Substack, 2022-12-02. https://jmcdonnell.substack.com/p/openai-comes-clean-about-gpt-35. Accessed 2026-05-24. ↩
Kyle Wiggers, "While Anticipation Builds for GPT-4, OpenAI Quietly Releases GPT-3.5", TechCrunch, 2022-12-01. https://techcrunch.com/2022/12/01/while-anticipation-builds-for-gpt-4-openai-quietly-releases-gpt-3-5/. Accessed 2026-05-24. ↩
Maximilian Schreiner, "GPT-3.5: OpenAI's Latest GPT-3 Model Generates Better and Longer Texts", The Decoder, 2022-11-29. https://the-decoder.com/openais-latest-gpt-3-model-generates-better-and-longer-texts/. Accessed 2026-05-24. ↩
Greg Brockman, "ChatGPT just crossed 1 million users; it's been 5 days since launch", X/Twitter, 2022-12-05. https://x.com/gdb/status/1599683104142430208. Accessed 2026-05-24. ↩
Krystal Hu, "ChatGPT sets record for fastest-growing user base", Reuters, 2023-02-02. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/. Accessed 2026-05-24. ↩
Wei-Lin Chiang, Zhuohan Li, Zi Lin et al., "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality", LMSYS, 2023-03-30. https://lmsys.org/blog/2023-03-30-vicuna/. Accessed 2026-05-24. ↩
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang et al., "Alpaca: A Strong, Replicable Instruction-Following Model", Stanford CRFM, 2023-03-13. https://crfm.stanford.edu/2023/03/13/alpaca.html. Accessed 2026-05-24. ↩
Stuart Russell, Daniel Dewey, Max Tegmark, "Research Priorities for Robust and Beneficial Artificial Intelligence", AI Magazine, 2015 (foundational context for the alignment field). https://futureoflife.org/data/documents/research_priorities.pdf. Accessed 2026-05-24. ↩
Semantic Scholar, "Training Language Models to Follow Instructions with Human Feedback (citation count)", 2024 to 2025. https://www.semanticscholar.org/paper/d766bffc357127e0dc86dd69561d5aeb520d6f4c. Accessed 2026-05-24. ↩
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash, "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback", Google Research, arXiv:2309.00267, 2023-09-01. https://arxiv.org/abs/2309.00267. Accessed 2026-05-24. ↩
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones et al., "Constitutional AI: Harmlessness from AI Feedback", Anthropic, arXiv:2212.08073, 2022-12-15. https://arxiv.org/abs/2212.08073. Accessed 2026-05-24. ↩
Ethan Perez, Sam Ringer, Kamile Lukosiute et al., "Discovering Language Model Behaviors with Model-Written Evaluations", Anthropic, arXiv:2212.09251, 2022-12-19. https://arxiv.org/abs/2212.09251. Accessed 2026-05-24. ↩
Mrinank Sharma, Meg Tong, Tomasz Korbak et al., "Towards Understanding Sycophancy in Language Models", Anthropic, arXiv:2310.13548, 2023-10-20. https://arxiv.org/abs/2310.13548. Accessed 2026-05-24. ↩
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", NeurIPS 2023, arXiv:2305.18290, 2023-05-29. https://arxiv.org/abs/2305.18290. Accessed 2026-05-24. ↩
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela, "KTO: Model Alignment as Prospect Theoretic Optimization", Stanford and Contextual AI, arXiv:2402.01306, 2024-02-02. https://arxiv.org/abs/2402.01306. Accessed 2026-05-24. ↩
Jiwoo Hong, Noah Lee, James Thorne, "ORPO: Monolithic Preference Optimization without Reference Model", KAIST AI, arXiv:2403.07691, 2024-03-12. https://arxiv.org/abs/2403.07691. Accessed 2026-05-24. ↩
Hamish Ivison, Yizhong Wang, Jiacheng Liu et al., "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback", NeurIPS 2024, arXiv:2406.09279, 2024-06-13. https://arxiv.org/abs/2406.09279. Accessed 2026-05-24. ↩
Anthropic, "Constitutional AI: Harmlessness from AI Feedback", Anthropic Research, 2022-12-15. https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback. Accessed 2026-05-24. ↩
Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, Peter J. Liu, "SLiC: Sequence Likelihood Calibration with Human Feedback", Google Research, arXiv:2305.10425, 2023-05-17. https://arxiv.org/abs/2305.10425. Accessed 2026-05-24. ↩
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", arXiv:2501.12948, 2025-01-22. https://arxiv.org/abs/2501.12948. Accessed 2026-05-24. ↩
OpenAI, "GPT-4 Technical Report", arXiv:2303.08774, 2023-03-15. https://arxiv.org/abs/2303.08774. Accessed 2026-05-24. ↩
Chunting Zhou, Pengfei Liu, Puxin Xu et al., "LIMA: Less Is More for Alignment", Meta AI, arXiv:2305.11206, 2023-05-18. https://arxiv.org/abs/2305.11206. Accessed 2026-05-24. ↩
Pranav Dixit, "Jan Leike, OpenAI's Alignment Lead, Resigns", Wired, 2024-05-17. https://www.wired.com/story/openai-alignment-lead-resigns/. Accessed 2026-05-24. ↩
Will Knight, "Paul Christiano to Lead US AI Safety Institute Safety Work", Wired, 2024-04-16. https://www.wired.com/story/us-government-ai-safety-institute-paul-christiano/. Accessed 2026-05-24. ↩
Jan Leike, "I joined Anthropic to continue the superalignment mission", X/Twitter, 2024-05-28. https://x.com/janleike/status/1795497960509448617. Accessed 2026-05-24. ↩
Sasha Rush et al., "Reinforcement Learning with Verifiable Rewards: a survey of open-weight reasoning models", 2025 (overview commentary on R1, o1, and successors). https://arxiv.org/abs/2501.12948. Accessed 2026-05-24. ↩
AI@Meta, "The Llama 3 Herd of Models", Meta AI, arXiv:2407.21783, 2024-07-31. https://arxiv.org/abs/2407.21783. Accessed 2026-05-24. ↩
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou, "Instruction-Following Evaluation for Large Language Models (IFEval)", Google Research, arXiv:2311.07911, 2023-11-14. https://arxiv.org/abs/2311.07911. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

Infobox

What problem did InstructGPT solve?

How does the InstructGPT training pipeline work?

Step 1: Supervised fine-tuning (SFT)

Step 2: Reward model (RM) training

Step 3: Reinforcement learning with PPO

Pipeline summary

How much compute did InstructGPT cost to train?

What model sizes did InstructGPT come in?

Datasets and labelers

Prompt sources and categories

Who labeled the data?

What were InstructGPT's key results?

Truthfulness

Toxicity

Bias

Alignment tax

Generalization

How is InstructGPT related to ChatGPT?

Why is InstructGPT significant?

What are InstructGPT's limitations?

Reward hacking

Labeler bias

Sycophancy

Hallucinations

Alignment tax

Cost and engineering complexity

Open problems survey

What methods succeeded or replaced InstructGPT?

Direct preference optimization (DPO)

Constitutional AI and RLAIF

Other preference optimization methods

LIMA and minimal-data alignment

Influence on the field

Is the InstructGPT recipe still used in 2026?

Authorship and context

See also

References

Improve this article

Related Articles

SPIN (Self-Play Fine-Tuning)

SimPO

Self-Rewarding Language Models

DPO

KTO

RLOO (REINFORCE Leave-One-Out)

What links here (24 of 27)

Related Articles

SPIN (Self-Play Fine-Tuning)

SimPO

Self-Rewarding Language Models

DPO

KTO

RLOO (REINFORCE Leave-One-Out)

What links here (24 of 27)