Weak-to-Strong Generalization

AI Alignment AI Research OpenAI

21 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 4,187 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Weak-to-Strong Generalization is an empirical research direction, introduced in a December 2023 paper by OpenAI's Superalignment team, that studies whether a strong AI model trained on labels from a much weaker supervisor can still recover most of its full capabilities. In the paper's headline result, a gpt-2-level model used as supervisor was able to elicit close to gpt-4's performance: finetuning GPT-4 on the weak labels plus an auxiliary confidence loss recovered nearly 80 percent of the performance gap, reaching roughly GPT-3.5-level accuracy on natural language tasks^[1]^[2]. The work is framed as an empirical analog for the long-term superalignment problem: how can humans, who will eventually be cognitively weaker than the systems they oversee, still elicit aligned, high-quality behavior from superhuman AI^[1]^[2].

Formally titled Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision, the paper was led by Collin Burns, Pavel Izmailov, and Jan Hendrik Kirchner. Released on December 14, 2023, it was the Superalignment team's first major publication and was accompanied by an openai research blog post, a 49-page arXiv preprint (2312.09390), an open-source code release on GitHub, and a $10 million Superalignment Fast Grants program intended to seed academic and independent research in the same direction^[1]^[2]^[3]. The paper introduced the Performance Gap Recovery (PGR) metric, the analogy of pairing a GPT-2 supervisor with a GPT-4 student, and three concrete intervention techniques (an auxiliary confidence loss, bootstrapping, and generative finetuning) that improved generalization above naive finetuning. Its broader role in the ai alignment discourse is as a foundational, influential reframing of the supervision problem, even though the Superalignment team itself was dissolved in May 2024 following the departures of ilya sutskever and Jan Leike^[4]^[5].

Why is weak supervision a problem for alignment?

The motivating challenge for the paper is straightforward. Standard alignment techniques such as rlhf depend on humans, or models trained on human judgments, being competent evaluators of model outputs. As the paper's abstract puts it, "Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior"^[1]. For tasks that current humans can grade reliably, this works: a labeler can compare two responses, score a piece of writing, or check whether code runs. For superhuman systems, the assumption breaks down. A future model writing a million-line codebase, deriving a novel mathematical proof, or designing biological molecules may produce outputs that no human can verify in finite time^[1]^[2].

Burns and coauthors argue that this is not a hypothetical concern. If pretrained models continue scaling, capabilities will outpace humans on more and more tasks, and at some point reward signals from human supervisors will be systematically weaker than the latent knowledge of the model. The Superalignment team, which OpenAI announced in July 2023 with a stated commitment of 20 percent of secured compute over four years, was created specifically to study this regime in advance of its arrival^[6].

The paper observes that two complementary research programs address the gap. Scalable oversight seeks to augment human evaluators with tools, decomposition, or AI assistants so they can grade outputs that would otherwise be beyond them, including methods like recursive reward modeling and AI safety via debate. Weak-to-strong generalization takes the opposite tack: it accepts that supervision will be imperfect and asks whether the strong student model can correctly generalize from flawed labels because the relevant concept is already encoded in its pretrained representations^[1]. The two approaches are framed as orthogonal rather than competing.

A practical reason this question is tractable today is that the team can flip the usual relationship. In normal training, humans are the strong evaluators of weaker models. To study the future regime, the team substitutes a smaller pretrained model in the role of the "weak human" and a much larger pretrained model in the role of the future "superhuman student." The resulting supervisor-student pair offers an empirical sandbox that did not previously exist.

How did OpenAI test weak-to-strong generalization?

The core experimental procedure has three stages^[1]^[2]:

Train a weak supervisor. A small pretrained model from the GPT-4 family is finetuned on ground-truth labels for the task. Its predictions on a held-out set become the "weak labels."
Train a strong student on weak labels. A much larger pretrained model is finetuned only on the weak labels, never seeing ground truth. This is the weak-to-strong model.
Train a strong ceiling model. The same large pretrained model is finetuned on ground-truth labels to estimate the maximum performance the strong student could in principle reach. This is the ceiling against which weak-to-strong performance is compared.

The team used pretrained models spanning roughly seven orders of magnitude of pretraining compute, from GPT-2-scale models up to GPT-4 itself, all from the same model family. This range allowed them to study how generalization changes as a function of both the size of the weak supervisor and the size of the strong student^[1].

What three task families were used?

The paper evaluates weak-to-strong generalization across three task domains chosen for their mix of practical relevance, evaluation difficulty, and structural variety^[1]^[2].

NLP classification. Twenty-two popular NLP datasets were used, including tasks spanning ethics, commonsense reasoning, natural language inference, and sentiment analysis. All were converted to binary classification with approximately balanced classes so that PGR could be measured uniformly. The choice of binary classification simplifies probability calibration and makes auxiliary loss formulations straightforward.

Chess puzzles. The team used the chess puzzle dataset introduced by Schwarzschild et al. (2021), drawn from lichess.org. Each puzzle presents a chess position whose correct continuation requires multi-step tactical reasoning, and the model is trained to predict the optimal first move. Chess is generative and structurally different from binary classification, providing a stress test for whether the weak-to-strong phenomenon depends on task format.

Reward modeling. The team finetuned models on the proprietary chatgpt reward-model dataset, in which humans compared pairs of model responses and chose the preferred one. Pairwise preference modeling is structurally closer to the production rlhf pipeline and therefore the most directly relevant of the three task families to real-world alignment.

The cross-domain design lets the authors examine whether interventions that work in one setting carry over to others, and which domains are easier or harder for weak-to-strong recovery in the first place.

What is Performance Gap Recovery (PGR)?

The key quantitative metric the paper introduces is Performance Gap Recovery (PGR), defined as^[1]:

PGR = (weak-to-strong performance - weak performance) / (strong ceiling performance - weak performance)

PGR is 1 if the weak-to-strong student matches the ground-truth-supervised ceiling, and 0 if it does no better than its weak supervisor. Negative values are possible when the student underperforms its supervisor (rare in practice). The metric isolates the fraction of the gap recovered rather than absolute accuracy, which makes it possible to compare results across tasks where weak and ceiling accuracies differ widely.

This framing is important because it normalizes away the obvious confound that bigger ceiling models tend to be better in absolute terms. PGR is meaningfully positive only when the strong student outperforms its supervisor by leveraging knowledge the supervisor did not provide. The team is careful to note that PGR has been mis-cited in some secondary sources as "PRR" or other near-acronyms; the correct term is Performance Gap Recovery^[1].

Three Intervention Techniques

The paper found that naive finetuning produced clearly positive but modest PGR in most settings. The authors then introduced three interventions, each best suited to one of the three task families^[1]^[2].

Auxiliary Confidence Loss

For NLP tasks, the authors added an auxiliary term to the standard cross-entropy loss that rewards the strong student for being confident in its own predictions even when those predictions disagree with the weak labels. Conceptually, the loss is related to conditional entropy minimization: it pulls the student's output distribution toward sharper, more decisive predictions, which lets the student override supervisor noise when its own representation strongly indicates the opposite answer^[1]^[2].

Empirically, this intervention dramatically improved generalization. The paper reports that "with the smallest weak supervisor and largest strong student, the confidence loss increases median PGR from about 25% to nearly 80%"^[1]. In absolute terms, finetuning GPT-4 with a GPT-2-level supervisor and the auxiliary confidence loss recovered close to GPT-3.5-level performance on the 22 NLP tasks^[1]^[2]. The intervention is, in effect, a way of telling the student that the supervisor is approximately right but noisy, so the student should trust its own confidence when the two disagree.

Bootstrapping

For chess puzzles, naive finetuning showed PGR that decreased as the supervisor-student gap grew, with the smallest supervisor producing near-zero PGR. The authors hypothesized that very large jumps in capability are particularly difficult for the student to bridge in a single step.

Their fix was bootstrapping: instead of jumping directly from the smallest supervisor to the largest student, they trained an intermediate-sized model on labels from the small supervisor, then trained a larger model on labels from the intermediate, and finally trained the largest student on labels from the intermediate. The team ran three iterations of weak-to-strong learning, using two intermediate model sizes before finetuning the largest model^[1].

Bootstrapping substantially improved chess PGR across large supervisor-student gaps, reducing the deterioration observed in the naive setting. The authors note explicitly, however, that bootstrapping alone is not expected to be sufficient for aligning models much more capable than their supervisors: each step must still produce a faithful enough intermediate, and errors can compound across the chain^[1].

Generative Finetuning

For reward modeling, the team observed the weakest baseline of the three domains: naive weak-to-strong yielded PGR around 10 percent and rarely above 20 percent. The intervention was to add an extra generative finetuning stage in which the strong student was first finetuned in a self-supervised language-modeling style on prompt-response data drawn from the reward-modeling distribution. Only after this unsupervised stage did the student fit the binary preference labels^[1]^[2].

The generative finetuning step is interpreted as increasing the salience of the relevant concept inside the strong model. By doing additional language-modeling work on the task distribution, the student's representations become more aligned with the structure the reward-modeling head ultimately needs to read out. The intervention improved reward modeling PGR by roughly 10 to 20 percentage points but did not close the gap to ground-truth supervision the way the confidence loss did for NLP.

What did the experiments find?

Several results stand out across the three task families^[1]^[2].

Weak-to-strong generalization is real and ubiquitous. Naively finetuning a strong pretrained model on labels from a much weaker supervisor consistently produces a student that outperforms the supervisor. The phenomenon was robust across NLP, chess, and reward modeling, and across multiple supervisor-student size pairs. EleutherAI's independent replication using Llama 3 8B and Qwen1.5 0.5B reproduced clear weak-to-strong generalization with PGR greater than zero across nearly all of 21 NLP datasets, providing external corroboration that the phenomenon is not GPT-family-specific^[7].

PGR generally improves with student size, often with weak-supervisor size, and varies sharply by domain. On NLP, the largest students typically recovered more than half of the gap; even very weak supervisors paired with large students achieved PGR above 20 percent. On chess, PGR was close to zero at the largest supervisor-student gaps but climbed above 40 percent at smaller gaps. On reward modeling, the baseline was the weakest of the three, signaling that the most production-relevant setting is also the hardest^[1].

Simple interventions can dramatically improve generalization, but not solve it. The auxiliary confidence loss raised median NLP PGR from approximately 25 percent to nearly 80 percent for extreme size gaps. Bootstrapping rescued chess from its size-induced collapse. Generative finetuning improved reward-modeling PGR by a non-trivial but still modest margin. The team explicitly frames the methods as proofs of concept rather than deployment-ready alignment techniques^[1].

Naive RLHF likely scales poorly to superhuman models. Because weak-to-strong generalization was particularly poor on the production-relevant ChatGPT reward-modeling setting, the authors conclude that "naive RLHF will likely scale poorly to superhuman models without additional work"^[1]. This is one of the paper's most-cited takeaways: it is not that current alignment methods fail outright, but that they should not be assumed to extend automatically to systems more capable than their supervisors.

Strong students often disagree with supervisor errors. Larger students tended to agree less with supervisor mistakes despite being trained to imitate them, a kind of inverse-scaling pattern. The interpretation is that the strong student's pretrained representations already encode a more accurate concept than the supervisor's label, and finetuning surfaces that latent knowledge rather than overwriting it^[1].

Early stopping matters. The authors note that early stopping based on ground-truth validation gave roughly a 15-PGR-point boost over training to convergence in the naive NLP setting, but only about a 5-point boost when using the confidence loss. This complicates evaluation because production deployments would not have ground-truth validation signals available^[1].

What limitations did the authors acknowledge?

The paper is unusually candid about the disanalogies between its experimental setup and the future regime it tries to inform. The authors devote a full section to limitations^[1].

Imitation saliency disanalogy. Future superhuman models will likely be much better at predicting what humans think and say than current GPT-4 is at predicting GPT-2 outputs, because such models will be trained extensively on human-generated data. A genuinely superhuman model under naive human supervision might find it easier to imitate human errors than current students find it to imitate weak-model errors, which would worsen weak-to-strong generalization rather than reproduce the positive result.

Pretraining leakage disanalogy. Many of the 22 NLP tasks and the chess puzzles likely appeared, at least indirectly, in pretraining data. The strong student may therefore be drawing on a concept it has already seen rather than discovering a genuinely latent capability. Future superhuman alignment-relevant capabilities, by contrast, may not appear in pretraining at all and may be much harder to elicit.

Methods are not consistently successful. None of the three interventions worked across all three domains. Reward modeling, the most production-relevant setting, retained the largest residual gap even after generative finetuning. The authors write that they are "still far from recovering the full performance gap" in the reward modeling setting.

The approach is not a complete alignment strategy. Weak-to-strong generalization addresses how to extract latent capabilities from a strong model given imperfect labels. It does not by itself address questions of reward specification, robustness to optimization pressure, or detection of deceptive behavior. The authors describe their methodology as a starting point and call for follow-up work that constructs experimental setups with fewer disanalogies.

Reception in the Alignment Community

The paper received heavy attention on ai alignment discussion venues including LessWrong and the Alignment Forum. Reaction broadly clustered into three categories^[8]^[9].

The first group treated weak-to-strong generalization as a legitimate, novel empirical paradigm: a way to study a real phenomenon (latent knowledge elicitation) using current models, in advance of having actually superhuman systems to align. EleutherAI's independent replication on a non-GPT model family reinforced this view by reproducing the basic effect in different conditions^[7].

The second group expressed methodological skepticism. Critics on the Alignment Forum questioned whether weak-to-strong generalization, as framed, qualifies as an alignment technique at all rather than as an interesting capabilities phenomenon. One line of critique argued that techniques in the weak-to-strong family may not survive adversarial evaluation: a sufficiently capable model that learns to imitate the supervisor while concealing latent disagreement is exactly the failure mode the paper hopes to overcome, and there is no obvious way to detect it from outputs alone^[8].

The third group focused on the disanalogies the paper itself raised, especially the imitation saliency concern. Several writers argued that future models trained on human data are precisely the case where weak-to-strong generalization is least likely to apply, because the student will be unusually good at human-style outputs and therefore most tempted to imitate human-style errors. This concern motivated some of the subsequent theoretical work that tried to characterize when weak-to-strong generalization holds and when it fails.

Follow-Up Research

The paper catalyzed a wave of follow-up work in 2024, 2025, and 2026 across theoretical, empirical, and methodological directions.

Easy-to-Hard Generalization. Sun, Yu and coauthors at IBM and collaborators published Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision at NeurIPS 2024 (arXiv 2403.09472)^[10]. They framed alignment as the dual problem: rather than training a strong student from a weak supervisor, train a process reward model on easy human-labeled tasks and use it to evaluate generations on hard tasks. Their process-supervised 7B reinforcement learning model and a 34B reranking model achieved 34.0 percent and 52.5 percent accuracy on the MATH500 benchmark while using human supervision only on easy problems. Easy-to-hard and weak-to-strong are typically treated as complementary framings of the same underlying capabilities-elicitation challenge.

Theoretical analysis. Charikar, Pabbaraju, and Shiragur (arXiv 2405.15116) presented Quantifying the Gain in Weak-to-Strong Generalization at NeurIPS 2024, deriving a theoretical relationship between the strong model's misfit error on weak labels and the magnitude of improvement over the weak teacher^[11]. Hunter Lang and collaborators (arXiv 2405.16043) produced an additional theoretical framework, and subsequent works in 2025 and 2026 examined weak-to-strong generalization in random feature networks and through bias-variance decompositions^[12].

Methodological extensions. A reliability-aware alignment method (arXiv 2406.19032) proposed querying the weak supervisor multiple times to estimate per-example reliability, then filtering or reweighting accordingly^[13]. The Aligner approach (arXiv 2402.02416), presented as a NeurIPS 2024 Oral, recast the weak-to-strong setup as a correction problem: a small "aligner" model is trained to correct the outputs of a much larger frontier model^[14].

Debate-assisted weak-to-strong generalization. Lang, Huang, and Li published Debate Helps Weak-to-Strong Generalization at AAAI 2025 (arXiv 2501.13124), showing that a debate protocol between strong-model debaters can give a weak supervisor enough additional context to produce more reliable training labels, integrating the weak-to-strong paradigm with the AI safety via debate research program^[15]. This work directly tests the hypothesis that scalable oversight techniques can be composed with weak-to-strong methods rather than competing with them.

Studies of limitations and overfitting. Several 2025 papers (for example arXiv 2502.01458 and the ACL 2025 paper How to Mitigate Overfitting in Weak-to-Strong Generalization) examined the regimes in which the phenomenon fails or where the student overfits to weak labels rather than ignoring them^[16]^[17]. EleutherAI's experimental writeup also reported that several plausible interventions (entropy losses, confidence windows, activation probes) failed to reliably outperform vanilla weak-to-strong training, providing a cautionary counterpoint to the original paper^[7].

Automated weak-to-strong research. By 2026, anthropic's alignment research team had published work on an Automated Weak-to-Strong Researcher, exploring whether automated agents themselves could iterate on the methodology faster than human researchers^[18].

Why was the Superalignment team disbanded?

The paper was the first major output of OpenAI's Superalignment team, which the company announced in July 2023, co-led by Ilya Sutskever and Jan Leike. At launch, OpenAI publicly committed 20 percent of the compute it had secured to that point to the team's research over a four-year horizon, with the explicit goal of solving superhuman alignment within those four years^[6].

The paper's release in December 2023 was accompanied by a $10 million Superalignment Fast Grants program. The grants, partially funded by a $5 million donation from Eric Schmidt, offered $100,000 to $2 million awards to academic labs, nonprofits, and individual researchers, plus a one-year $150,000 OpenAI Superalignment Fellowship for graduate students ($75,000 in stipend plus $75,000 in compute and research funding). The program explicitly prioritized weak-to-strong generalization, interpretability, and scalable oversight as research directions^[3].

In May 2024, the Superalignment team was effectively dissolved following the departures of both co-leads. Ilya Sutskever announced his exit on May 14, 2024; Jan Leike resigned hours later, writing publicly that at OpenAI "safety culture and processes have taken a backseat to shiny products"^[4]. Subsequent reporting by Fortune and others alleged that the team had never actually received anything close to the promised 20 percent compute allocation, and that compute requests from the team were routinely denied. Remaining team members were redistributed across other OpenAI research teams^[4]^[5].

Several authors of the weak-to-strong paper later left OpenAI. Jan Leike joined anthropic in May 2024 to lead its Alignment Science team, taking on, in his own words, an explicit agenda of "scalable oversight, weak-to-strong generalization, and automated alignment research"^[19]. Pavel Izmailov, a lead author of the paper, also moved to Anthropic in 2024, contributing to Claude 3.7 and Claude 4, and was reported to be starting as an assistant professor at NYU in Fall 2025^[20]. Leopold Aschenbrenner, another coauthor, departed OpenAI in April 2024 amid an internal investigation. Collin Burns, the paper's lead author and a former Berkeley PhD student known for prior work on discovering latent knowledge in language models without supervision, later moved to Anthropic as well^[21].

The dissolution of the team complicates the paper's institutional legacy. While the research direction it opened has been pursued vigorously by external groups and by Anthropic's alignment team, OpenAI itself reorganized its safety work after May 2024, and the specific corporate Superalignment program that produced the paper no longer exists.

How does weak-to-strong generalization relate to scalable oversight and debate?

The paper is best understood as one half of a broader research program for aligning systems whose outputs humans cannot directly evaluate. Its sibling program is scalable oversight, which encompasses methods such as AI safety via debate, iterated amplification, recursive reward modeling, and constitutional approaches that use AI assistants to extend human evaluation capacity^[1]^[9].

Scalable oversight asks: how can we make human evaluators effectively stronger? Weak-to-strong generalization asks: given that evaluators will still be effectively weaker, how can the strong model's own latent knowledge be elicited despite imperfect labels? In the paper's own framing, the two are not substitutes. A practical superalignment strategy would likely use scalable oversight techniques to make supervisor labels less noisy, while relying on weak-to-strong generalization to extract latent capability from the residual noise.

This composition is exactly what the Debate Helps Weak-to-Strong Generalization paper later operationalized. By having strong models argue both sides of a question, a weak supervisor can produce more reliable labels, which the strong student can then generalize from. The integration suggests that the original paper's framing of weak-to-strong as orthogonal to, but compatible with, scalable oversight has held up empirically^[15].

The conceptual link to mechanistic interpretability and related interpretability research is more indirect but important. If weak-to-strong generalization works because the strong model already encodes the correct concept in its representations, then interpretability tools (such as sparse autoencoders, activation steering, and probing) could in principle verify whether the elicited capability matches the latent representation. Several research agendas in 2024 through 2026 explicitly combine weak-to-strong with interpretability, treating the two as complementary lenses on the question of whether a strong model "knows" something its supervisor does not.

The 2023 paper remains a foundational citation in the modern alignment literature not because its specific methods were definitive but because it gave the field a tractable empirical setup, a clean metric (PGR), and an analogy (weak teacher to strong student) that subsequent researchers could build on. Its weaknesses, especially the imitation-saliency and pretraining-leakage disanalogies, continue to motivate active research into more realistic experimental designs.

References

Burns, Collin; Izmailov, Pavel; Kirchner, Jan Hendrik; Baker, Bowen; Gao, Leo; Aschenbrenner, Leopold; Chen, Yining; Ecoffet, Adrien; Joglekar, Manas; Leike, Jan; Sutskever, Ilya; Wu, Jeff. "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision." arXiv:2312.09390, December 14, 2023. https://arxiv.org/abs/2312.09390 (Accessed 2026-05-19). ↩
OpenAI. "Weak-to-strong generalization." OpenAI Research blog post, December 14, 2023. https://openai.com/index/weak-to-strong-generalization/ (Accessed 2026-05-19). ↩
OpenAI. "Superalignment Fast Grants." OpenAI announcement, December 2023. https://openai.com/index/superalignment-fast-grants/ (Accessed 2026-05-19). ↩
Field, Hayden. "OpenAI dissolves team focused on long-term AI risks, less than one year after announcing it." CNBC, May 17, 2024. https://www.cnbc.com/2024/05/17/openai-superalignment-sutskever-leike.html (Accessed 2026-05-19). ↩
Field, Hayden; Sigalos, MacKenzie. "OpenAI promised 20% of its computing power to combat the most dangerous kind of AI, but never delivered." Fortune, May 21, 2024. https://fortune.com/2024/05/21/openai-superalignment-20-compute-commitment-never-fulfilled-sutskever-leike-altman-brockman-murati/ (Accessed 2026-05-19). ↩
OpenAI. "Introducing Superalignment." OpenAI Research blog post, July 5, 2023. https://openai.com/index/introducing-superalignment/ (Accessed 2026-05-19). ↩
EleutherAI. "Experiments in Weak-to-Strong Generalization." EleutherAI Blog, June 19, 2024. https://blog.eleuther.ai/weak-to-strong/ (Accessed 2026-05-19). ↩
"Is weak-to-strong generalization an alignment technique?" Alignment Forum. https://www.alignmentforum.org/posts/NPBjELgHFEeHTgDrK/is-weak-to-strong-generalization-an-alignment-technique (Accessed 2026-05-19). ↩
"Scalable Oversight and Weak-to-Strong Generalization." LessWrong/Alignment Forum. https://www.lesswrong.com/posts/hw2tGSsvLLyjFoLFS/scalable-oversight-and-weak-to-strong-generalization (Accessed 2026-05-19). ↩
Sun, Zhiqing; Yu, Longhui; et al. "Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision." arXiv:2403.09472, NeurIPS 2024. https://arxiv.org/abs/2403.09472 (Accessed 2026-05-19). ↩
Charikar, Moses; Pabbaraju, Chirag; Shiragur, Kirankumar. "Quantifying the Gain in Weak-to-Strong Generalization." arXiv:2405.15116, NeurIPS 2024. https://arxiv.org/abs/2405.15116 (Accessed 2026-05-19). ↩
Lang, Hunter; et al. "Theoretical Analysis of Weak-to-Strong Generalization." arXiv:2405.16043, 2024. https://arxiv.org/abs/2405.16043 (Accessed 2026-05-19). ↩
"Improving Weak-to-Strong Generalization with Reliability-Aware Alignment." arXiv:2406.19032, 2024. https://arxiv.org/abs/2406.19032 (Accessed 2026-05-19). ↩
Ji, Jiaming; et al. "Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction." arXiv:2402.02416, NeurIPS 2024. https://arxiv.org/abs/2402.02416 (Accessed 2026-05-19). ↩
Lang, Hao; Huang, Fei; Li, Yongbin. "Debate Helps Weak-to-Strong Generalization." arXiv:2501.13124, AAAI 2025. https://arxiv.org/abs/2501.13124 (Accessed 2026-05-19). ↩
"The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration." arXiv:2502.01458, 2025. https://arxiv.org/abs/2502.01458 (Accessed 2026-05-19). ↩
"How to Mitigate Overfitting in Weak-to-strong Generalization?" ACL 2025. https://aclanthology.org/2025.acl-long.784.pdf (Accessed 2026-05-19). ↩
Anthropic. "Automated Weak-to-Strong Researcher." Anthropic Alignment Science, 2026. https://alignment.anthropic.com/2026/automated-w2s-researcher/ (Accessed 2026-05-19). ↩
Field, Hayden. "OpenAI safety leader Jan Leike joins rival AI startup Anthropic." CNBC, May 28, 2024. https://www.cnbc.com/2024/05/28/openai-safety-leader-jan-leike-joins-amazon-backed-anthropic.html (Accessed 2026-05-19). ↩
Izmailov, Pavel. Personal CV and homepage. https://izmailovpavel.github.io/ (Accessed 2026-05-19). ↩
Burns, Collin. Personal homepage. https://collinpburns.com/ (Accessed 2026-05-19). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

AI control Agentic misalignment Alignment faking Gradient hacking Instrumental convergence Jan Leike Leopold Aschenbrenner Model organisms of misalignment Recursive reward modeling Sandbagging (artificial intelligence)Scalable oversight Sleeper Agents (paper)Specification gaming

Why is weak supervision a problem for alignment?

How did OpenAI test weak-to-strong generalization?

What three task families were used?

What is Performance Gap Recovery (PGR)?

Three Intervention Techniques

Auxiliary Confidence Loss

Bootstrapping

Generative Finetuning

What did the experiments find?

What limitations did the authors acknowledge?

Reception in the Alignment Community

Follow-Up Research

Why was the Superalignment team disbanded?

How does weak-to-strong generalization relate to scalable oversight and debate?

References

Improve this article

Related Articles

Redwood Research

Apollo Research

InstructGPT

Rule-Based Rewards (RBR)

Model Spec

Noam Brown

What links here

Related Articles

Redwood Research

Apollo Research

InstructGPT

Rule-Based Rewards (RBR)

Model Spec

Noam Brown

What links here