Superalignment

29 min read

Updated Jul 23, 2026

Superalignment is the technical problem of steering and controlling AI systems that are far more capable than their human supervisors, that is, systems at or beyond the level of superintelligence, and the name of the dedicated OpenAI research team that pursued it. OpenAI launched the Superalignment team on July 5, 2023, co-led by chief scientist Ilya Sutskever and head of alignment Jan Leike, pledged 20% of its secured compute over four years, and set the goal of solving the core technical challenges of superhuman alignment by 2027; the team was effectively disbanded in May 2024 after both leaders resigned.^[1]^[2] As a research label, superalignment is sometimes used generically for any program targeting the alignment of hypothetical superhuman systems, but in common usage it refers to this specific OpenAI initiative.^[1]^[3]

OpenAI introduced the project on July 5, 2023, framing the alignment of superintelligent AI as "one of the most important unsolved technical problems of our time" and pledging to dedicate "20% of the compute we've secured to date" over the next four years to solving it.^[1] The stated goal was to "solve the core technical challenges of superintelligence alignment in four years," primarily by building a "roughly human-level automated alignment researcher" that could then be used to scale up alignment research itself.^[1] The team produced several technical papers in 2023 and 2024, most notably "Weak-to-Strong Generalization" (December 14, 2023), and ran an academic grants program, "Superalignment Fast Grants," in partnership with former Google CEO Eric Schmidt that awarded US$9,895,000 across roughly 50 grantees.^[4]^[5]

The project collapsed less than a year after its founding. Sutskever announced his resignation from OpenAI on May 14, 2024; Leike resigned the following day and on May 17 published a widely circulated X (Twitter) thread accusing the company of letting "safety culture and processes" take "a backseat to shiny products."^[6]^[7]^[8] Within hours of Leike's departure, OpenAI confirmed it was dissolving the Superalignment team and folding its work into broader research efforts under co-founder John Schulman.^[9]^[10] Subsequent reporting by Fortune, citing former employees, found that the 20% compute pledge had never actually been delivered: the team's requests for compute were "repeatedly denied," even though they never approached the 20% threshold.^[11] Sutskever went on to co-found Safe Superintelligence Inc. in June 2024, while Leike joined Anthropic later the same month to lead a new Alignment Science team working on substantially the same research agenda.^[12]^[13]^[14]

What is superalignment?

The broader field of AI alignment is concerned with ensuring that the goals pursued by AI systems remain consistent with the intentions and values of their human operators.^[15] Research on alignment as an academic discipline grew sharply after the publication of Nick Bostrom's 2014 book Superintelligence: Paths, Dangers, Strategies, the 2016 paper "Concrete Problems in AI Safety" by Dario Amodei, Chris Olah and others, and Stuart Russell's 2019 book Human Compatible, all of which argued that powerful future systems might pursue objectives in ways that differ subtly but catastrophically from what their designers intended.^[15]^[16] Bostrom's orthogonality thesis (the proposition that arbitrary levels of intelligence are compatible with arbitrary final goals) and his instrumental convergence thesis became standard reference points in subsequent discussion, alongside Russell's framing of alignment as "ensuring that a highly competent machine combined with humans who have an imperfect ability to specify human preferences" does not produce catastrophe.^[15]

Within that field, "superalignment" denotes the more specific problem of aligning systems that exceed human capabilities. Standard alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) rely on humans being able to evaluate and supervise model outputs. OpenAI's announcement post argued that this assumption breaks down for superhuman systems: "Humans won't be able to reliably supervise AI systems much smarter than us, and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs."^[1] The Superalignment team's working hypothesis was therefore that the only realistic path was to build AI systems capable of contributing substantively to alignment research themselves, what Sutskever and Leike described as a "roughly human-level automated alignment researcher", and then use that system, recursively, to attack the harder problem of aligning more capable successors.^[1]

The conceptual move from "alignment" to "superalignment" therefore had two components: first, a claim that the present-day techniques (RLHF, instruction tuning, reward modelling) would not generalise to systems whose outputs humans could not check; second, a methodological bet that the bridging device between today's techniques and the eventual problem of superintelligent alignment would itself be an AI system. This bet sat in some tension with parts of the AI safety community that had historically emphasised either purely theoretical work (such as the early agenda of the Machine Intelligence Research Institute) or mechanistic interpretability as the central tool. The Superalignment team's stance was distinctly empirical and engineering-led: it sought tractable analogues for the eventual weak-supervisor problem that could be tested with currently available models.

This research framing was distinct from contemporaneous programs at peer labs. Anthropic had publicly committed to a Constitutional AI approach, in which models are trained against an explicit written set of principles rather than purely against human preference data, and had invested heavily in mechanistic interpretability research aimed at understanding neural network internals.^[17] Google DeepMind had a long-running alignment and safety team that emphasised interpretability and scalable oversight; the company's Frontier Safety Framework would be published in May 2024 and specify capability thresholds at which additional mitigations would be triggered.^[17] By comparison, OpenAI's Superalignment program was unusual in three respects: it was time-boxed to four years, it was explicitly aimed at superhuman systems rather than current models, and it was attached to a numerical resource commitment, 20% of secured compute, that had no precedent in the industry.^[1]^[11]

What was OpenAI's Superalignment team?

OpenAI announced the Superalignment team in a blog post titled "Introducing Superalignment," published on July 5, 2023 and authored under the bylines of Sutskever and Leike.^[1] The post opened by characterising superintelligence as both the "most impactful technology humanity has ever invented" and as potentially "very dangerous, and could lead to the disempowerment of humanity or even human extinction," then argued that although such systems "may seem far off now, we believe they could arrive this decade."^[1]

Who led the Superalignment team?

The team was co-led by Ilya Sutskever, OpenAI's co-founder and chief scientist, and Jan Leike, the company's head of alignment.^[1]^[3] Leike, a German-born AI safety researcher who had completed his PhD at the Australian National University under Marcus Hutter and previously worked at Google DeepMind under Shane Legg, had joined OpenAI in 2021 and was one of the principal authors of the InstructGPT paper that underpins the RLHF training of ChatGPT and GPT-4.^[13]

OpenAI did not publicly disclose the precise size of the Superalignment team, but stated that it would be "assembling a team of top machine learning researchers and engineers" drawing on staff "from our previous alignment team, as well as researchers and engineers from other teams across the company."^[1] By the time of the team's dissolution in May 2024, reporting placed its size at roughly 25 researchers, with public co-authorship of team papers identifying names including Collin Burns, Leopold Aschenbrenner, Pavel Izmailov, Jan Hendrik Kirchner, Leo Gao, Bowen Baker, Jeffrey Wu, Yining Chen, Adrien Ecoffet and Manas Joglekar.^[4]^[11]

What was the four-year goal?

The announcement set an unusually concrete objective: "Our goal is to solve the core technical challenges of superintelligence alignment in four years."^[1] That timeline, running from July 2023 to mid-2027, was justified by the team's view that systems capable of meaningfully accelerating alignment work might appear within that window. The post acknowledged the goal was "incredibly ambitious" and that there was no guarantee of success, but argued that "we are optimistic that a focused, concerted effort can solve this problem."^[1]

The work plan had three components: (1) develop scalable training methods that allow human oversight to be extended to superhuman behaviour, (2) validate the resulting models so that one can be confident they remain aligned, and (3) "stress-test" the alignment pipeline by deliberately training misaligned models and checking whether the alignment techniques detect them.^[1] The intended endpoint, repeatedly emphasised, was an automated alignment researcher: a model human-level at conducting alignment research, which could then be used to attack the residual problem of aligning still more powerful systems.^[1]

What was the 20% compute pledge?

The most concrete, and ultimately most contentious, commitment in the announcement was a resource allocation pledge: "OpenAI is dedicating 20% of the compute we've secured to date over the next four years to solving the problem of superintelligence alignment. Our chief basic research bet is our new Superalignment team, but getting this right is critical to achieve our mission and we expect many teams to contribute, from developing new methods to scaling them up to deployment."^[1]

At the time of the announcement, the pledge was widely cited, in coverage by TechCrunch, MIT Technology Review and others, as evidence of an exceptional institutional commitment to safety, and as a competitive benchmark that other frontier labs would have to address.^[3]^[4] The 20% figure was attached specifically to "compute secured to date" (as of July 2023), rather than to all future compute, but no public mechanism was created for auditing the allocation, and Fortune later reported that the team was never told whether the promise meant 20% each year for four years, roughly 5% a year, or some variable amount.^[11]

Following Sutskever's and Leike's resignations in May 2024, Fortune published a detailed exclusive on May 21, 2024 reporting that the pledge had never been honoured. Citing roughly half a dozen sources familiar with the team's operations, the magazine wrote that the Superalignment team had been "never given anything close to 20% of OpenAI's secured compute."^[11] Compute requests from the team, including requests for shares well below the headline 20% figure, had been "repeatedly denied" by senior leadership. Reporting identified VP of Research Bob McGrew as the executive who typically conveyed the denials, with chief technology officer Mira Murati also involved in resourcing decisions.^[11] TechCrunch's complementary reporting on May 18 quoted a person familiar with the team's work who said the team had been "sailing against the wind" trying to get sufficient compute and had ultimately been "left to wither."^[9]

The compute story was central to Leike's public resignation statement. On X (Twitter), he wrote: "Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done."^[7]^[8] The disjunction between the public 20% commitment and the team's lived experience became a recurring theme in subsequent coverage and shaped much of the public criticism of OpenAI's safety posture in mid-2024.

What research did the Superalignment team produce?

Despite the resource frictions later revealed, the Superalignment team produced a substantive body of public work during its ten-month active lifespan. The two most prominent technical papers were "Weak-to-Strong Generalization" (December 2023) and "Prover-Verifier Games" (July 2024), the latter of which appeared after the team's formal dissolution but was drafted while it still existed.

What is weak-to-strong generalization (December 14, 2023)?

The team's flagship paper, "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision," was posted to arXiv on December 14, 2023 (preprint identifier 2312.09390).^[18] The author list, in order, was Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu.^[18] OpenAI published an accompanying blog post under the title "Weak-to-strong generalization."^[19]

The paper took as its starting point the observation that future superhuman models would, by definition, behave in ways too complex for humans to evaluate reliably. Researchers therefore needed an empirical analogue for the eventual situation in which humans were the "weak" supervisor of a "strong" superhuman model. The team's central proposal was to study the analogous setup between two existing models of different capability levels: using GPT-2-level models to supervise GPT-4-level models on standard NLP, chess and reward-modelling tasks.^[18]^[19]

Their key finding was that, when strong pretrained models were naively fine-tuned on labels generated by weak models, the resulting student models "consistently perform better than their weak supervisors," demonstrating a generalisation phenomenon that they termed weak-to-strong generalization.^[18] Naive fine-tuning did not recover the full capability of the strong model, but combining it with an auxiliary confidence loss closed much of the gap: in some NLP settings, a GPT-2-level supervisor could elicit performance close to GPT-3.5 levels from a GPT-4 student.^[18]^[19] The paper framed this as an empirical foothold for a research programme that could, in principle, scale to genuinely superhuman models. MIT Technology Review's December 14, 2023 write-up characterized the work as the team's "first major public deliverable" and as a "proof of concept" for the broader four-year plan.^[20]

What were the Superalignment Fast Grants?

Concurrent with the weak-to-strong paper, OpenAI launched an academic funding program called "Superalignment Fast Grants" in December 2023, in partnership with former Google CEO Eric Schmidt.^[5] The program offered grants of US$100,000 to US$2 million to academic labs, nonprofits and individual researchers working on technical problems related to aligning superhuman systems. Graduate students were eligible for a separate one-year US$150,000 OpenAI Superalignment Fellowship comprising US$75,000 in stipend and US$75,000 in compute and research funding.^[5]

The headline budget was US$10 million.^[21] Topic areas explicitly listed in the call for proposals included weak-to-strong generalisation, scalable oversight, interpretability, honesty and trust evaluations, and adversarial robustness. According to a public April 2024 accounting by Jan Leike, the program funded 50 out of roughly 2,700 applications, awarding a total of US$9,895,000, with a median grant of US$150,000, an average of US$198,000, a smallest grant of US$50,000 and a largest of US$500,000.^[21]

Superalignment Fast Grants	Figure
Applications received	~2,700
Grants awarded	~50
Total awarded	US$9,895,000
Median grant	US$150,000
Average grant	US$198,000
Smallest / largest grant	US$50,000 / US$500,000

What were Prover-Verifier Games and the later work?

In July 2024, after the Superalignment team had formally been dissolved, a separate paper drafted under the team's banner, "Prover-Verifier Games Improve Legibility of LLM Outputs" (arXiv:2407.13692, submitted July 18, 2024), appeared with authorship including Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese and Yuri Burda.^[22] The paper proposed a training scheme inspired by prover-verifier games, in which a small verifier model is trained to predict whether a solution is correct, a "helpful prover" is trained to produce correct solutions that the verifier accepts, and a "sneaky prover" is trained to produce incorrect solutions that fool the verifier; the procedure is iterated to push the verifier and provers in opposite directions.^[22] The aim was to improve the legibility (i.e. human-checkability) of model outputs in settings where a less-capable verifier (analogous to a human auditor) had to assess the work of a more-capable solver.

Beyond the two flagship papers, members of the team contributed to a range of smaller-scale technical work on topics including scalable oversight, automated red-teaming, debate-style training, honesty evaluations, and the use of GPT-4-class models to assist in the evaluation of other models. The research output was generally consistent with the agenda described in the July 2023 announcement: rather than attempting to solve the alignment of superintelligent systems directly, the team tried to demonstrate, on contemporary models, tractable analogues of the techniques that might eventually be required.

How did the November 2023 board crisis affect the team?

The Superalignment team's organisational footing was shaken less than five months after its founding by the corporate crisis that engulfed OpenAI in November 2023. The board's removal of CEO Sam Altman was announced on the afternoon of November 17, 2023; Altman was reinstated on November 22.^[23]^[24]

Sutskever, as a member of OpenAI's nonprofit board, played a central role in the firing. Reporting and his own subsequent public statements indicated that he authored or commissioned a memo arguing that Altman had been "not consistently candid" with the board.^[23]^[24] Within days, however, sustained pressure from employees, including a letter signed by roughly 745 of about 770 staff threatening to follow Altman to a new venture at Microsoft unless the board resigned, caused Sutskever to reverse course. He publicly signed the employee letter and, on November 20, posted on X: "I deeply regret my participation in the board's actions. I never intended to harm OpenAI."^[24]

When Altman was reinstated on November 22 with a restructured interim board (Bret Taylor as chair, Adam D'Angelo and Lawrence Summers as initial directors), Sutskever stepped down from the board.^[24] He remained an employee but, by widely cited reporting in the following months, was not seen at OpenAI's San Francisco offices. The November crisis was widely interpreted as having damaged Sutskever's political standing within the company and, by extension, his ability to advocate effectively for the Superalignment team's resource needs.^[11]

Fortune's May 2024 reporting on the team's unmet compute pledge made this connection explicit, quoting sources familiar with the matter as saying that Sutskever's reduced influence after November 2023 made it harder for him to defend the team's compute requests internally.^[11]

Why was the Superalignment team disbanded?

The team's terminal phase began on May 14, 2024, one day after OpenAI's May 13, 2024 Spring Update launch event for the GPT-4o model, when Sutskever announced he was leaving OpenAI. Within days the two co-leads had both resigned, OpenAI had folded the standalone unit into its other research groups, and Leike had publicly attributed his departure to the company prioritising "shiny products" over safety.

Sutskever's resignation (May 14, 2024)

Sutskever posted his resignation on X on the afternoon of May 14, 2024, writing: "After almost a decade, I have made the decision to leave OpenAI. The company's trajectory has been nothing short of miraculous, and I'm confident that OpenAI will build AGI that is both safe and beneficial under the leadership of @sama, @gdb, and @miramurati, and now, under the excellent research leadership of @merettm. It was an honor and a privilege to have worked together, and I will miss everyone dearly. So long, and thanks for everything. I am excited for what comes next, a project that is very personally meaningful to me about which I will share details in due time."^[25]

Altman replied with his own post, calling Sutskever "easily one of the greatest minds of our generation, a guiding light of our field, and a dear friend" and writing that "OpenAI would not be what it is without him."^[25] OpenAI named research director Jakub Pachocki, who had succeeded Sutskever as the company's chief scientist a few months earlier, as Sutskever's replacement in that role.^[25]

Leike's resignation (May 15, 2024)

Less than 24 hours later, on the morning of May 15, 2024, Leike posted a brief message announcing "I resigned" without further elaboration.^[7] He left initially without making a public statement of reasons, but reporting indicated that he had been at odds with company leadership over the team's resource allocation and the broader pace of capabilities work.^[9]^[11]

What did Jan Leike say in his departure thread (May 17, 2024)?

Two days later, on the evening of Friday May 17, 2024, Leike posted an extended thread on X explaining his decision. The thread, posted across multiple consecutive messages, contained passages that became among the most widely quoted public statements ever made by a departing AI safety researcher.^[7]^[26]^[27]

The opening message read: "Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI."^[7] He continued: "I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point."^[28]

Leike argued for a sharply different allocation of effort: "I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics. These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there."^[26]^[27]

He described the resource constraints faced by his team in concrete terms: "Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done."^[7]^[27]

The thread reached its most quoted passage with: "Building smarter-than-human machines is an inherently dangerous endeavor. OpenAI is shouldering an enormous responsibility on behalf of all of humanity. But over the past years, safety culture and processes have taken a backseat to shiny products."^[26]^[27] He closed by addressing OpenAI's remaining staff directly: "To all OpenAI employees, I want to say: Learn to feel the AGI. Act with the gravitas appropriate for what you're building. I believe you can 'ship' the cultural change that's needed. I am counting on you. The world is counting on you."^[28] In his final post in the thread, he wrote: "OpenAI must become a safety-first AGI company."^[7]^[27]

How was the team dissolved?

Within hours of Leike's thread, multiple outlets, first Bloomberg and Wired, then Axios, CNBC and others, reported that OpenAI had dissolved the Superalignment team as a standalone organisational unit and would be folding its work into broader research efforts across the company.^[9]^[29]^[30] OpenAI confirmed the change to reporters but framed it as a "deeper integration" of long-term safety work into the company's other research groups.^[9]^[29] Co-founder John Schulman, until then head of post-training, was identified as the new scientific lead for the company's alignment work, although there would no longer be a dedicated alignment team.^[29] Schulman himself would leave OpenAI in August 2024, joining Anthropic.^[31]

What other departures surrounded the dissolution?

The May 2024 resignations did not happen in isolation. In April 2024, Superalignment team members Leopold Aschenbrenner and Pavel Izmailov had been fired by OpenAI; Aschenbrenner later said publicly that he had been dismissed after sharing a "benign brainstorming document" with three external researchers and after raising internal concerns about the company's security posture against foreign espionage.^[32]^[33] OpenAI maintained the dismissals were unrelated to such concerns.

Other safety-focused researchers who left OpenAI in the months around the team's dissolution included Daniel Kokotajlo, a member of the governance team who resigned in April 2024 saying he had "gradually lost trust in OpenAI leadership and their ability to responsibly handle AGI," and William Saunders, a Superalignment team member who had left earlier in 2024.^[33]^[34] Kokotajlo's case drew particular attention because he reportedly refused to sign OpenAI's non-disparagement agreement on departure, forfeiting equity that he described as roughly 85% of his family's net worth.^[34]

What happened after the Superalignment team was disbanded?

The dissolution of the Superalignment team prompted both organisational changes within OpenAI and a redistribution of senior alignment talent across the industry.

OpenAI's Safety and Security Committee

On May 28, 2024, ten days after the team's dissolution, OpenAI announced the formation of a new Safety and Security Committee, chaired by board director Bret Taylor and including directors Adam D'Angelo and Nicole Seligman alongside CEO Sam Altman.^[35] The committee also included internal staff such as head of preparedness Aleksander Mądry, head of safety systems Lilian Weng, co-founder John Schulman, security chief Matt Knight and chief scientist Jakub Pachocki.^[35] Its initial remit was a 90-day review of OpenAI's processes and safeguards. Critics, including TechCrunch in its coverage of the announcement, noted that the committee was composed entirely of insiders, including the very executives whose decisions had been the subject of the recent departures.^[35]

Sutskever: Safe Superintelligence Inc.

On June 19, 2024, about five weeks after his departure, Sutskever announced a new company, Safe Superintelligence Inc. (SSI), co-founded with former Apple AI lead Daniel Gross and former OpenAI researcher Daniel Levy.^[12] The company's announcement stated that its "first product will be the safe superintelligence, and it will not do anything else up until then," positioning SSI as an explicit ideological successor to the Superalignment project's aims but located outside any existing commercial AI lab.^[12] SSI is headquartered in Palo Alto, California and Tel Aviv, Israel. The company raised US$1 billion in September 2024 at a reported US$5 billion valuation, and in March 2025 closed a further round at a US$30 billion valuation.^[12]

Leike: Anthropic

On May 28, 2024, eleven days after his resignation thread, Leike announced he had joined Anthropic to lead a new Alignment Science team.^[13]^[14] His mandate, as described in his own announcement post and in coverage by TechCrunch and CNBC, was to work on "scalable oversight, weak-to-strong generalization, and automated alignment research," essentially the same research agenda he had been pursuing at OpenAI's Superalignment team.^[14] Leike reports to Anthropic chief science officer Jared Kaplan, and existing Anthropic researchers working on scalable oversight were re-organised to report into the new team.^[14]

The personnel migration from OpenAI's safety teams to Anthropic continued through the rest of 2024 and into 2025. OpenAI co-founder John Schulman, who had briefly been named OpenAI's alignment lead after the dissolution of Superalignment, joined Anthropic in August 2024.^[31]

Who else works on superalignment?

Although the OpenAI Superalignment team is the most public association of the term "superalignment," the underlying research problem of aligning systems more capable than their human supervisors has been studied across multiple labs, both before and after the OpenAI program existed.

Anthropic has pursued Constitutional AI and related techniques since 2022, in which models are trained against a written constitution of principles rather than purely against human preference labels, and has invested heavily in mechanistic interpretability research.^[17] After absorbing Leike, Schulman and other former OpenAI safety researchers in 2024, Anthropic became the principal industrial home for the weak-to-strong and scalable oversight research agenda originally articulated at OpenAI.^[14]

Google DeepMind has maintained an alignment and safety team since long before the term "superalignment" entered common usage, and in May 2024 published the Frontier Safety Framework specifying capability thresholds at which additional safety measures would be triggered.^[17] DeepMind has also been a major contributor to interpretability and scalable oversight research, including work on debate and recursive reward modelling that predates the OpenAI team.

Academic and nonprofit work on the problem includes research at the Alignment Research Center founded by former OpenAI researcher Paul Christiano, work at the Center for AI Safety, and a growing body of literature on subtopics such as deceptive alignment, inner alignment and outer alignment.^[15] The Superalignment Fast Grants program, while attached to OpenAI's now-disbanded team, distributed funding to approximately 50 grantees across this broader ecosystem.^[21]

How was the Superalignment program received?

The Superalignment program drew critical scrutiny from several directions both during and after its existence.

Conceptual critiques

Some prominent figures in machine learning have rejected the conceptual framing on which superalignment rests, arguing that the underlying threat model is overstated. Meta chief AI scientist Yann LeCun and AI researcher Andrew Ng have both characterized fears of misaligned superintelligence as premature or as a form of hype; LeCun has compared safety research on hypothetical superintelligent systems to "designing seat belts for a car that doesn't exist yet."^[36] Other critics have argued that an exclusive focus on existential-scale risks distracts from contemporary harms such as bias, surveillance, labour displacement, and misinformation that are already attributable to deployed systems.

Implementation critiques

A second strand of criticism concerned the implementation of the OpenAI program specifically. Even commentators sympathetic to the Superalignment research agenda observed that the 20% compute pledge was made public without any external auditing mechanism and that the program's organisational survival depended on the continuing political authority of its two co-leads, both of whom were known to be in tension with the company's commercial leadership.^[11] Fortune's May 21, 2024 reporting on the unfulfilled compute pledge crystallised this critique by demonstrating that the public commitment had functioned, in practice, more as a reputational announcement than as an enforceable internal allocation.^[11]

Resignation aftermath

The combined resignations of Sutskever and Leike, and Leike's subsequent thread, generated extensive coverage in mainstream outlets including Bloomberg, Reuters, Wired, the Washington Post, Time, CNN, Fortune, the Financial Times and TechCrunch.^[9]^[11]^[25]^[26]^[29] Several outlets framed the events as a turning point in the public credibility of self-regulating AI labs. CNBC's coverage on May 17, 2024 noted that the team's dissolution came "less than one year after announcing it," highlighting the gap between the public framing of the four-year goal and the reality of the team's brief organisational life.^[29]

A persistent theme in this coverage was the comparison between OpenAI's public commitments and its observed conduct. The four-year goal, the 20% compute pledge and the explicit framing of superintelligence alignment as a humanity-scale problem had been presented in the July 2023 announcement as institutional commitments backed by senior leadership, and the team's dissolution within ten months, without any public accounting of what fraction of secured compute had been delivered, was widely read as undermining the credibility of comparable voluntary commitments by frontier labs.

The episode also accelerated calls, in the press, in academic commentary and in policy circles, for external mechanisms (such as the UK and US AI Safety Institutes, see AI safety) to verify safety commitments made by frontier labs, rather than relying on voluntary internal pledges.^[17] By 2025, both the UK AI Safety Institute and the US AI Safety Institute (later renamed the Center for AI Standards and Innovation) had pre-deployment testing arrangements in place with several major labs, partly in response to the perceived inadequacy of purely internal safety governance.

Effect on the safety research workforce

A secondary effect of the dissolution was the redistribution of senior safety researchers across the industry, with Anthropic in particular emerging as the principal beneficiary. By the end of 2024, Anthropic had hired Leike, Schulman and several other former OpenAI safety staff; the company's research output on scalable oversight, weak-to-strong generalisation and automated alignment in the months that followed showed substantial continuity with the agenda originally articulated at OpenAI's Superalignment team.^[14] Safe Superintelligence Inc., in turn, became a destination for a smaller number of former OpenAI staff who shared Sutskever's preference for working outside any existing commercial AI product organisation. The combined effect was that the research agenda first organised under the "superalignment" banner did not disappear with the team's dissolution; rather, it was reconstituted across a small number of other labs, with somewhat different institutional incentives and oversight structures.

ELI5: Superalignment in simple terms

Imagine you are a chess beginner trying to coach a grandmaster. You cannot check every move the grandmaster makes, because some of their decisions are smarter than anything you understand. Superalignment is the science of making sure that a super-smart AI, one that is much cleverer than the people watching over it, still does what humans actually want, even when humans are too slow or too limited to double-check its work. OpenAI started a special team in July 2023 to crack this problem, promised it one fifth of the company's computers, and gave it four years. Less than a year later the two scientists in charge quit, one of them saying safety had taken "a backseat to shiny products," and the team was broken up. The problem itself, how to keep a smarter-than-human machine on humanity's side, is still unsolved.

References

^Sutskever, Ilya and Leike, Jan. "Introducing Superalignment." OpenAI, July 5, 2023. openai.com/...introducing-superalignment
^"OpenAI dissolves Superalignment AI safety team." CNBC, May 17, 2024. cnbc.com/...openai-superalignment-sutskever-leike
^Wiggers, Kyle. "OpenAI is forming a new team to bring 'superintelligent' AI under control." TechCrunch, July 5, 2023. techcrunch.com/...uperintelligent-ai-under-control
^Burns, Collin et al. "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision." arXiv:2312.09390, December 14, 2023. arxiv.org/...2312.09390
^"Superalignment Fast Grants." OpenAI, December 14, 2023. openai.com/...superalignment-fast-grants
^Sutskever, Ilya. X post announcing departure from OpenAI, May 14, 2024.
^Leike, Jan. "Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI." X (Twitter), May 17, 2024. x.com/...1791498174659715494
^"Top OpenAI researcher resigns, saying company prioritized 'shiny products' over AI safety." Fortune, May 17, 2024. fortune.com/...openai-researcher-resigns-safety
^Wiggers, Kyle. "OpenAI created a team to control 'superintelligent' AI, then let it wither, source says." TechCrunch, May 18, 2024. techcrunch.com/...i-then-let-it-wither-source-says
^"OpenAI's long-term safety team has disbanded." Axios, May 17, 2024. axios.com/...ai-superalignment-risk-ilya-sutskever
^Kahn, Jeremy. "OpenAI promised 20% of its computing power to combat the most dangerous kind of AI but never delivered, sources say." Fortune, May 21, 2024. fortune.com/...skever-leike-altman-brockman-murati
^"Safe Superintelligence Inc." Wikipedia. en.wikipedia.org/...Safe_Superintelligence_Inc.
^"Jan Leike." Wikipedia. en.wikipedia.org/...Jan_Leike
^Wiggers, Kyle. "Anthropic hires former OpenAI safety lead to head up new team." TechCrunch, May 28, 2024. techcrunch.com/...-safety-lead-to-head-up-new-team
^"AI alignment." Wikipedia. en.wikipedia.org/...AI_alignment
^Amodei, Dario, Olah, Chris et al. "Concrete Problems in AI Safety." arXiv:1606.06565, 2016. arxiv.org/...1606.06565
^Anthropic, "Constitutional AI: Harmlessness from AI Feedback," 2022; Google DeepMind, "Frontier Safety Framework," May 2024.
^Burns, Collin, Izmailov, Pavel, Kirchner, Jan Hendrik, Baker, Bowen, Gao, Leo, Aschenbrenner, Leopold, Chen, Yining, Ecoffet, Adrien, Joglekar, Manas, Leike, Jan, Sutskever, Ilya and Wu, Jeffrey. "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision." arXiv:2312.09390, December 14, 2023.
^"Weak-to-strong generalization." OpenAI, December 14, 2023. openai.com/...weak-to-strong-generalization
^Heaven, Will Douglas. "Now we know what OpenAI's superalignment team has been up to." MIT Technology Review, December 14, 2023. technologyreview.com/...-alignment-rogue-agi-gpt-4
^Leike, Jan. "Some statistics on the superalignment fast grants." X (Twitter), April 10, 2024. x.com/...1778136079302082721
^Kirchner, Jan Hendrik, Chen, Yining, Edwards, Harri, Leike, Jan, McAleese, Nat and Burda, Yuri. "Prover-Verifier Games Improve Legibility of LLM Outputs." arXiv:2407.13692, July 18, 2024. arxiv.org/...2407.13692
^"Removal of Sam Altman from OpenAI." Wikipedia. en.wikipedia.org/...oval_of_Sam_Altman_from_OpenAI
^"Ilya Sutskever." Wikipedia. en.wikipedia.org/...Ilya_Sutskever
^"Ilya Sutskever, Co-Founder and Chief Scientist, Leaves OpenAI." Time, May 14, 2024. time.com/...ilya-sutskever-leaves-open-ai
^"Jan Leike's Resignation Damning of OpenAI's 'Core Priorities' and Safety Culture." CCN, May 2024. ccn.com/...amning-of-openai-core-safety-priorities
^"More OpenAI drama: Exec quits over concerns about focus on profit over safety." CNN Business, May 17, 2024. cnn.com/...openai-exec-exits-safety-concerns
^Leike, Jan. X (Twitter) thread, May 17, 2024. x.com/...1791498174659715494
^"OpenAI dissolves team focused on long-term AI risks, less than one year after announcing it." CNBC, May 17, 2024. cnbc.com/...openai-superalignment-sutskever-leike
^"OpenAI Dissolves 'Superalignment Team,' Distributes AI Safety Efforts Across Organization." PYMNTS, May 2024.
^"OpenAI co-founder John Schulman says he will leave and join rival Anthropic." CNBC, August 6, 2024. cnbc.com/...lman-says-he-will-join-rival-anthropic
^"Leopold Aschenbrenner." Wikipedia. en.wikipedia.org/...Leopold_Aschenbrenner
^"OpenAI's AI safety teams lost at least seven researchers in recent months." The Decoder, 2024.
^"Daniel Kokotajlo (researcher)." Wikipedia. en.wikipedia.org/...Daniel_Kokotajlo_(researcher)
^Wiggers, Kyle. "OpenAI's new safety committee is made up of all insiders." TechCrunch, May 28, 2024. techcrunch.com/...ittee-is-made-up-of-all-insiders
^Ng, Andrew and LeCun, Yann. Public statements on AI doom rhetoric, 2023-2024.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributor · v7 · 5,814 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

Jan Leike Leopold Aschenbrenner Lilian Weng Mila (Quebec AI Institute)Preparedness Framework (OpenAI)Recursive reward modeling Safe Superintelligence Inc Scalable oversight Shane Legg Situational Awareness Weak-to-Strong Generalization

What is superalignment?

What was OpenAI's Superalignment team?

Who led the Superalignment team?

What was the four-year goal?

What was the 20% compute pledge?

What research did the Superalignment team produce?

What is weak-to-strong generalization (December 14, 2023)?

What were the Superalignment Fast Grants?

What were Prover-Verifier Games and the later work?

How did the November 2023 board crisis affect the team?

Why was the Superalignment team disbanded?

Sutskever's resignation (May 14, 2024)

Leike's resignation (May 15, 2024)

What did Jan Leike say in his departure thread (May 17, 2024)?

How was the team dissolved?

What other departures surrounded the dissolution?

What happened after the Superalignment team was disbanded?

OpenAI's Safety and Security Committee

Sutskever: Safe Superintelligence Inc.

Leike: Anthropic

Who else works on superalignment?

How was the Superalignment program received?

Conceptual critiques

Implementation critiques

Resignation aftermath

Effect on the safety research workforce

ELI5: Superalignment in simple terms

See also

References

Improve this article

Related Articles

Constitutional AI

Reward hacking

MACHIAVELLI (benchmark)

Redwood Research

Constitutional Classifiers

Frontier Model Forum

What links here

Related Articles

Constitutional AI

Reward hacking

MACHIAVELLI (benchmark)

Redwood Research

Constitutional Classifiers

Frontier Model Forum

What links here