Superalignment
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,464 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,464 words
Add missing citations, update stale details, or suggest a clearer explanation.
Superalignment is a strand of AI alignment research focused specifically on the problem of steering and controlling AI systems that are far more capable than humans — that is, systems at or beyond the level of superintelligence — and is most closely associated with the OpenAI Superalignment team, a dedicated research group launched in July 2023 and disbanded in May 2024.[^1][^2] The term is sometimes used as a generic label for any technical research program that targets the alignment of hypothetical superhuman systems, but in common usage it refers to the specific four-year initiative announced by OpenAI and co-led by chief scientist Ilya Sutskever and head of alignment Jan Leike.[^1][^3]
OpenAI introduced the project on July 5, 2023, framing the alignment of superintelligent AI as "one of the most important unsolved technical problems of our time" and pledging to dedicate "20% of the compute we've secured to date" over the next four years to solving it.[^1] The stated goal was to "solve the core technical challenges of superintelligence alignment in four years," primarily by building a "roughly human-level automated alignment researcher" that could then be used to scale up alignment research itself.[^1] The team produced several technical papers in 2023 and 2024 — most notably "Weak-to-Strong Generalization" (December 14, 2023) — and ran a US$10 million academic grants program, "Superalignment Fast Grants," in partnership with former Google CEO Eric Schmidt.[^4][^5]
The project collapsed less than a year after its founding. Sutskever announced his resignation from OpenAI on May 14, 2024; Leike resigned the following day and on May 17 published a widely circulated X (Twitter) thread accusing the company of letting "safety culture and processes" take "a backseat to shiny products."[^6][^7][^8] Within hours of Leike's departure, OpenAI confirmed it was dissolving the Superalignment team and folding its work into broader research efforts under co-founder John Schulman.[^9][^10] Subsequent reporting by Fortune, citing former employees, found that the 20% compute pledge had never actually been delivered: the team's requests for compute were "repeatedly denied," even though they never approached the 20% threshold.[^11] Sutskever went on to co-found Safe Superintelligence Inc. in June 2024, while Leike joined Anthropic later the same month to lead a new Alignment Science team working on substantially the same research agenda.[^12][^13][^14]
The broader field of AI alignment is concerned with ensuring that the goals pursued by AI systems remain consistent with the intentions and values of their human operators.[^15] Research on alignment as an academic discipline grew sharply after the publication of Nick Bostrom's 2014 book Superintelligence: Paths, Dangers, Strategies, the 2016 paper "Concrete Problems in AI Safety" by Dario Amodei, Chris Olah and others, and Stuart Russell's 2019 book Human Compatible, all of which argued that powerful future systems might pursue objectives in ways that differ subtly but catastrophically from what their designers intended.[^15][^16] Bostrom's orthogonality thesis — the proposition that arbitrary levels of intelligence are compatible with arbitrary final goals — and his instrumental convergence thesis became standard reference points in subsequent discussion, alongside Russell's framing of alignment as "ensuring that a highly competent machine combined with humans who have an imperfect ability to specify human preferences" does not produce catastrophe.[^15]
Within that field, "superalignment" denotes the more specific problem of aligning systems that exceed human capabilities. Standard alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) rely on humans being able to evaluate and supervise model outputs. OpenAI's announcement post argued that this assumption breaks down for superhuman systems: "Humans won't be able to reliably supervise AI systems much smarter than us, and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs."[^1] The Superalignment team's working hypothesis was therefore that the only realistic path was to build AI systems capable of contributing substantively to alignment research themselves — what Sutskever and Leike described as a "roughly human-level automated alignment researcher" — and then use that system, recursively, to attack the harder problem of aligning more capable successors.[^1]
The conceptual move from "alignment" to "superalignment" therefore had two components: first, a claim that the present-day techniques (RLHF, instruction tuning, reward modelling) would not generalise to systems whose outputs humans could not check; second, a methodological bet that the bridging device between today's techniques and the eventual problem of superintelligent alignment would itself be an AI system. This bet sat in some tension with parts of the AI safety community that had historically emphasised either purely theoretical work (such as the early agenda of the Machine Intelligence Research Institute) or mechanistic interpretability as the central tool. The Superalignment team's stance was distinctly empirical and engineering-led: it sought tractable analogues for the eventual weak-supervisor problem that could be tested with currently available models.
This research framing was distinct from contemporaneous programs at peer labs. Anthropic had publicly committed to a Constitutional AI approach, in which models are trained against an explicit written set of principles rather than purely against human preference data, and had invested heavily in mechanistic interpretability research aimed at understanding neural network internals.[^17] Google DeepMind had a long-running alignment and safety team that emphasised interpretability and scalable oversight; the company's Frontier Safety Framework would be published in May 2024 and specify capability thresholds at which additional mitigations would be triggered.[^17] By comparison, OpenAI's Superalignment program was unusual in three respects: it was time-boxed to four years, it was explicitly aimed at superhuman systems rather than current models, and it was attached to a numerical resource commitment — 20% of secured compute — that had no precedent in the industry.[^1][^11]
OpenAI announced the Superalignment team in a blog post titled "Introducing Superalignment," published on July 5, 2023 and authored under the bylines of Sutskever and Leike.[^1] The post opened by characterising superintelligence as both the "most impactful technology humanity has ever invented" and as potentially "very dangerous, and could lead to the disempowerment of humanity or even human extinction," then argued that although such systems "may seem far off now, we believe they could arrive this decade."[^1]
The team was co-led by Ilya Sutskever, OpenAI's co-founder and chief scientist, and Jan Leike, the company's head of alignment.[^1][^3] Leike, a German-born AI safety researcher who had completed his PhD at the Australian National University under Marcus Hutter and previously worked at Google DeepMind under Shane Legg, had joined OpenAI in 2021 and was one of the principal authors of the InstructGPT paper that underpins the RLHF training of ChatGPT and GPT-4.[^13]
OpenAI did not publicly disclose the precise size of the Superalignment team, but stated that it would be "assembling a team of top machine learning researchers and engineers" drawing on staff "from our previous alignment team, as well as researchers and engineers from other teams across the company."[^1] By the time of the team's dissolution in May 2024, reporting placed its size at roughly 25 researchers, with public co-authorship of team papers identifying names including Collin Burns, Leopold Aschenbrenner, Pavel Izmailov, Jan Hendrik Kirchner, Leo Gao, Bowen Baker, Jeffrey Wu, Yining Chen, Adrien Ecoffet and Manas Joglekar.[^4][^11]
The announcement set an unusually concrete objective: "Our goal is to solve the core technical challenges of superintelligence alignment in four years."[^1] That timeline, running from July 2023 to mid-2027, was justified by the team's view that systems capable of meaningfully accelerating alignment work might appear within that window. The post acknowledged the goal was "incredibly ambitious" and that there was no guarantee of success, but argued that "we are optimistic that a focused, concerted effort can solve this problem."[^1]
The work plan had three components: (1) develop scalable training methods that allow human oversight to be extended to superhuman behaviour, (2) validate the resulting models so that one can be confident they remain aligned, and (3) "stress-test" the alignment pipeline by deliberately training misaligned models and checking whether the alignment techniques detect them.[^1] The intended endpoint, repeatedly emphasised, was an automated alignment researcher: a model human-level at conducting alignment research, which could then be used to attack the residual problem of aligning still more powerful systems.[^1]
The most concrete — and ultimately most contentious — commitment in the announcement was a resource allocation pledge: "OpenAI is dedicating 20% of the compute we've secured to date over the next four years to solving the problem of superintelligence alignment. Our chief basic research bet is our new Superalignment team, but getting this right is critical to achieve our mission and we expect many teams to contribute, from developing new methods to scaling them up to deployment."[^1]
At the time of the announcement, the pledge was widely cited — in coverage by TechCrunch, MIT Technology Review and others — as evidence of an exceptional institutional commitment to safety, and as a competitive benchmark that other frontier labs would have to address.[^3][^4] The 20% figure was attached specifically to "compute secured to date" (as of July 2023), rather than to all future compute, but no public mechanism was created for auditing the allocation.
Following Sutskever's and Leike's resignations in May 2024, Fortune published a detailed exclusive on May 21, 2024 reporting that the pledge had never been honoured. Citing multiple sources familiar with the team's operations, the magazine wrote that the Superalignment team had been "never given anything close to 20% of OpenAI's secured compute."[^11] Compute requests from the team — including requests for shares well below the headline 20% figure — had been "repeatedly denied" by senior leadership. Reporting identified VP of Research Bob McGrew as the executive who typically conveyed the denials, with chief technology officer Mira Murati also involved in resourcing decisions.[^11] TechCrunch's complementary reporting on May 18 quoted a person familiar with the team's work who said the team had been "sailing against the wind" trying to get sufficient compute and had ultimately been "left to wither."[^9]
The compute story was central to Leike's public resignation statement. On X (Twitter), he wrote: "Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done."[^7][^8] The disjunction between the public 20% commitment and the team's lived experience became a recurring theme in subsequent coverage and shaped much of the public criticism of OpenAI's safety posture in mid-2024.
Despite the resource frictions later revealed, the Superalignment team produced a substantive body of public work during its ten-month active lifespan. The two most prominent technical papers were "Weak-to-Strong Generalization" (December 2023) and "Prover-Verifier Games" (July 2024), the latter of which appeared after the team's formal dissolution but was drafted while it still existed.
The team's flagship paper, "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision," was posted to arXiv on December 14, 2023 (preprint identifier 2312.09390).[^18] The author list, in order, was Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu.[^18] OpenAI published an accompanying blog post under the title "Weak-to-strong generalization."[^19]
The paper took as its starting point the observation that future superhuman models would, by definition, behave in ways too complex for humans to evaluate reliably. Researchers therefore needed an empirical analogue for the eventual situation in which humans were the "weak" supervisor of a "strong" superhuman model. The team's central proposal was to study the analogous setup between two existing models of different capability levels: using GPT-2-level models to supervise GPT-4-level models on standard NLP, chess and reward-modelling tasks.[^18][^19]
Their key finding was that, when strong pretrained models were naively fine-tuned on labels generated by weak models, the resulting student models "consistently perform better than their weak supervisors," demonstrating a generalisation phenomenon that they termed weak-to-strong generalization.[^18] Naive fine-tuning did not recover the full capability of the strong model, but combining it with an auxiliary confidence loss closed much of the gap: in some NLP settings, a GPT-2-level supervisor could elicit performance close to GPT-3.5 levels from a GPT-4 student. The paper framed this as an empirical foothold for a research programme that could, in principle, scale to genuinely superhuman models.[^18][^19] MIT Technology Review's December 14, 2023 write-up characterized the work as the team's "first major public deliverable" and as a "proof of concept" for the broader four-year plan.[^20]
Concurrent with the weak-to-strong paper, OpenAI launched an academic funding program called "Superalignment Fast Grants" in December 2023, in partnership with former Google CEO Eric Schmidt.[^5] The program offered grants of US$100,000 to US$2 million to academic labs, nonprofits and individual researchers working on technical problems related to aligning superhuman systems. Graduate students were eligible for a separate one-year US$150,000 OpenAI Superalignment Fellowship comprising US$75,000 in stipend and US$75,000 in compute and research funding.[^5]
The headline budget was US$10 million.[^21] Topic areas explicitly listed in the call for proposals included weak-to-strong generalisation, scalable oversight, interpretability, honesty and trust evaluations, and adversarial robustness. According to subsequent press accounts, the program received roughly 2,700 applications and made approximately 50 awards.[^21]
In July 2024 — after the Superalignment team had formally been dissolved — a separate paper drafted under the team's banner, "Prover-Verifier Games Improve Legibility of LLM Outputs," appeared with authorship including Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese and Yuri Burda.[^22] The paper proposed a training scheme inspired by prover–verifier games, in which a small verifier model is trained to predict whether a solution is correct, a "helpful prover" is trained to produce correct solutions that the verifier accepts, and a "sneaky prover" is trained to produce incorrect solutions that fool the verifier; the procedure is iterated to push the verifier and provers in opposite directions.[^22] The aim was to improve the legibility (i.e. human-checkability) of model outputs in settings where a less-capable verifier (analogous to a human auditor) had to assess the work of a more-capable solver.
Beyond the two flagship papers, members of the team contributed to a range of smaller-scale technical work on topics including scalable oversight, automated red-teaming, debate-style training, honesty evaluations, and the use of GPT-4-class models to assist in the evaluation of other models. The research output was generally consistent with the agenda described in the July 2023 announcement: rather than attempting to solve the alignment of superintelligent systems directly, the team tried to demonstrate, on contemporary models, tractable analogues of the techniques that might eventually be required.
The Superalignment team's organisational footing was shaken less than five months after its founding by the corporate crisis that engulfed OpenAI in November 2023. The board's removal of CEO Sam Altman was announced on the afternoon of November 17, 2023; Altman was reinstated on November 22.[^23][^24]
Sutskever, as a member of OpenAI's nonprofit board, played a central role in the firing. Reporting and his own subsequent public statements indicated that he authored or commissioned a memo arguing that Altman had been "not consistently candid" with the board.[^23][^24] Within days, however, sustained pressure from employees — including a letter signed by roughly 745 of about 770 staff threatening to follow Altman to a new venture at Microsoft unless the board resigned — caused Sutskever to reverse course. He publicly signed the employee letter and, on November 20, posted on X: "I deeply regret my participation in the board's actions. I never intended to harm OpenAI."[^24]
When Altman was reinstated on November 22 with a restructured interim board (Bret Taylor as chair, Adam D'Angelo and Lawrence Summers as initial directors), Sutskever stepped down from the board.[^24] He remained an employee but, by widely cited reporting in the following months, was not seen at OpenAI's San Francisco offices. The November crisis was widely interpreted as having damaged Sutskever's political standing within the company and, by extension, his ability to advocate effectively for the Superalignment team's resource needs.[^11]
Fortune's May 2024 reporting on the team's unmet compute pledge made this connection explicit, quoting sources familiar with the matter as saying that Sutskever's reduced influence after November 2023 made it harder for him to defend the team's compute requests internally.[^11]
The team's terminal phase began on May 14, 2024 — coincidentally the day after OpenAI's launch event for the GPT-4o model — when Sutskever announced he was leaving OpenAI.
Sutskever posted his resignation on X on the afternoon of May 14, 2024, writing: "After almost a decade, I have made the decision to leave OpenAI. The company's trajectory has been nothing short of miraculous, and I'm confident that OpenAI will build AGI that is both safe and beneficial under the leadership of @sama, @gdb, and @miramurati, and now, under the excellent research leadership of @merettm. It was an honor and a privilege to have worked together, and I will miss everyone dearly. So long, and thanks for everything. I am excited for what comes next — a project that is very personally meaningful to me about which I will share details in due time."[^25]
Altman replied with his own post, calling Sutskever "easily one of the greatest minds of our generation, a guiding light of our field, and a dear friend" and writing that "OpenAI would not be what it is without him."[^25] OpenAI named research director Jakub Pachocki, who had succeeded Sutskever as the company's chief scientist a few months earlier, as Sutskever's replacement in that role.[^25]
Less than 24 hours later, on the morning of May 15, 2024, Leike posted a brief message announcing "I resigned" without further elaboration.[^7] He left initially without making a public statement of reasons, but reporting indicated that he had been at odds with company leadership over the team's resource allocation and the broader pace of capabilities work.[^9][^11]
Two days later, on the evening of Friday May 17, 2024, Leike posted an extended thread on X explaining his decision. The thread, posted across multiple consecutive messages, contained passages that became among the most widely quoted public statements ever made by a departing AI safety researcher.[^7][^26][^27]
The opening message read: "Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI."[^7] He continued: "I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point."[^28]
Leike argued for a sharply different allocation of effort: "I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics. These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there."[^26][^27]
He described the resource constraints faced by his team in concrete terms: "Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done."[^7][^27]
The thread reached its most quoted passage with: "Building smarter-than-human machines is an inherently dangerous endeavor. OpenAI is shouldering an enormous responsibility on behalf of all of humanity. But over the past years, safety culture and processes have taken a backseat to shiny products."[^26][^27] He closed by addressing OpenAI's remaining staff directly: "To all OpenAI employees, I want to say: Learn to feel the AGI. Act with the gravitas appropriate for what you're building. I believe you can 'ship' the cultural change that's needed. I am counting on you. The world is counting on you."[^28] In his final post in the thread, he wrote: "OpenAI must become a safety-first AGI company."[^7][^27]
Within hours of Leike's thread, multiple outlets — first Bloomberg and Wired, then Axios, CNBC and others — reported that OpenAI had dissolved the Superalignment team as a standalone organisational unit and would be folding its work into broader research efforts across the company.[^9][^29][^30] OpenAI confirmed the change to reporters but framed it as a "deeper integration" of long-term safety work into the company's other research groups.[^9][^29] Co-founder John Schulman, until then head of post-training, was identified as the new scientific lead for the company's alignment work, although there would no longer be a dedicated dedicated alignment team.[^29] Schulman himself would leave OpenAI in August 2024, joining Anthropic.[^31]
The May 2024 resignations did not happen in isolation. In April 2024, Superalignment team members Leopold Aschenbrenner and Pavel Izmailov had been fired by OpenAI; Aschenbrenner later said publicly that he had been dismissed after sharing a "benign brainstorming document" with three external researchers and after raising internal concerns about the company's security posture against foreign espionage.[^32][^33] OpenAI maintained the dismissals were unrelated to such concerns.
Other safety-focused researchers who left OpenAI in the months around the team's dissolution included Daniel Kokotajlo, a member of the governance team who resigned in April 2024 saying he had "gradually lost trust in OpenAI leadership and their ability to responsibly handle AGI," and William Saunders, a Superalignment team member who had left earlier in 2024.[^33][^34] Kokotajlo's case drew particular attention because he reportedly refused to sign OpenAI's non-disparagement agreement on departure, forfeiting equity that he described as roughly 85% of his family's net worth.[^34]
The dissolution of the Superalignment team prompted both organisational changes within OpenAI and a redistribution of senior alignment talent across the industry.
On May 28, 2024 — ten days after the team's dissolution — OpenAI announced the formation of a new Safety and Security Committee, chaired by board director Bret Taylor and including directors Adam D'Angelo and Nicole Seligman alongside CEO Sam Altman.[^35] The committee also included internal staff such as head of preparedness Aleksander Mądry, head of safety systems Lilian Weng, co-founder John Schulman, security chief Matt Knight and chief scientist Jakub Pachocki.[^35] Its initial remit was a 90-day review of OpenAI's processes and safeguards. Critics, including TechCrunch in its coverage of the announcement, noted that the committee was composed entirely of insiders, including the very executives whose decisions had been the subject of the recent departures.[^35]
On June 19, 2024 — about five weeks after his departure — Sutskever announced a new company, Safe Superintelligence Inc. (SSI), co-founded with former Apple AI lead Daniel Gross and former OpenAI researcher Daniel Levy.[^12] The company's announcement stated that its "first product will be the safe superintelligence, and it will not do anything else up until then," positioning SSI as an explicit ideological successor to the Superalignment project's aims but located outside any existing commercial AI lab.[^12] SSI is headquartered in Palo Alto, California and Tel Aviv, Israel. The company raised US$1 billion in September 2024, and in March 2025 closed a further funding round at a US$30 billion valuation.[^12]
On May 28, 2024 — eleven days after his resignation thread — Leike announced he had joined Anthropic to lead a new Alignment Science team.[^13][^14] His mandate, as described in his own announcement post and in coverage by TechCrunch and CNBC, was to work on "scalable oversight, weak-to-strong generalization, and automated alignment research" — essentially the same research agenda he had been pursuing at OpenAI's Superalignment team.[^14] Leike reports to Anthropic chief science officer Jared Kaplan, and existing Anthropic researchers working on scalable oversight were re-organised to report into the new team.[^14]
The personnel migration from OpenAI's safety teams to Anthropic continued through the rest of 2024 and into 2025. OpenAI co-founder John Schulman, who had briefly been named OpenAI's alignment lead after the dissolution of Superalignment, joined Anthropic in August 2024.[^31]
Although the OpenAI Superalignment team is the most public association of the term "superalignment," the underlying research problem — aligning systems more capable than their human supervisors — has been studied across multiple labs, both before and after the OpenAI program existed.
Anthropic has pursued Constitutional AI and related techniques since 2022, in which models are trained against a written constitution of principles rather than purely against human preference labels, and has invested heavily in mechanistic interpretability research.[^17] After absorbing Leike, Schulman and other former OpenAI safety researchers in 2024, Anthropic became the principal industrial home for the weak-to-strong and scalable oversight research agenda originally articulated at OpenAI.[^14]
Google DeepMind has maintained an alignment and safety team since long before the term "superalignment" entered common usage, and in May 2024 published the Frontier Safety Framework specifying capability thresholds at which additional safety measures would be triggered.[^17] DeepMind has also been a major contributor to interpretability and scalable oversight research, including work on debate and recursive reward modelling that predates the OpenAI team.
Academic and nonprofit work on the problem includes research at the Alignment Research Center founded by former OpenAI researcher Paul Christiano, work at the Center for AI Safety, and a growing body of literature on subtopics such as deceptive alignment, inner alignment and outer alignment.[^15] The Superalignment Fast Grants program, while attached to OpenAI's now-disbanded team, distributed funding to approximately 50 grantees across this broader ecosystem.[^21]
The Superalignment program drew critical scrutiny from several directions both during and after its existence.
Some prominent figures in machine learning have rejected the conceptual framing on which superalignment rests, arguing that the underlying threat model is overstated. Meta chief AI scientist Yann LeCun and AI researcher Andrew Ng have both characterized fears of misaligned superintelligence as premature or as a form of hype; LeCun has compared safety research on hypothetical superintelligent systems to "designing seat belts for a car that doesn't exist yet."[^36] Other critics have argued that an exclusive focus on existential-scale risks distracts from contemporary harms — bias, surveillance, labour displacement, misinformation — that are already attributable to deployed systems.
A second strand of criticism concerned the implementation of the OpenAI program specifically. Even commentators sympathetic to the Superalignment research agenda observed that the 20% compute pledge was made public without any external auditing mechanism and that the program's organisational survival depended on the continuing political authority of its two co-leads, both of whom were known to be in tension with the company's commercial leadership.[^11] Fortune's May 21, 2024 reporting on the unfulfilled compute pledge crystallised this critique by demonstrating that the public commitment had functioned, in practice, more as a reputational announcement than as an enforceable internal allocation.[^11]
The combined resignations of Sutskever and Leike, and Leike's subsequent thread, generated extensive coverage in mainstream outlets including Bloomberg, Reuters, Wired, the Washington Post, Time, CNN, Fortune, the Financial Times and TechCrunch.[^9][^11][^25][^26][^29] Several outlets framed the events as a turning point in the public credibility of self-regulating AI labs. CNBC's coverage on May 17, 2024 noted that the team's dissolution came "less than one year after announcing it," highlighting the gap between the public framing of the four-year goal and the reality of the team's brief organisational life.[^29]
A persistent theme in this coverage was the comparison between OpenAI's public commitments and its observed conduct. The four-year goal, the 20% compute pledge and the explicit framing of superintelligence alignment as a humanity-scale problem had been presented in the July 2023 announcement as institutional commitments backed by senior leadership, and the team's dissolution within ten months — without any public accounting of what fraction of secured compute had been delivered — was widely read as undermining the credibility of comparable voluntary commitments by frontier labs.
The episode also accelerated calls — in the press, in academic commentary and in policy circles — for external mechanisms (such as the UK and US AI Safety Institutes, see AI Safety) to verify safety commitments made by frontier labs, rather than relying on voluntary internal pledges.[^17] By 2025, both the UK AI Safety Institute and the US AI Safety Institute (later renamed the Center for AI Standards and Innovation) had pre-deployment testing arrangements in place with several major labs, partly in response to the perceived inadequacy of purely internal safety governance.
A secondary effect of the dissolution was the redistribution of senior safety researchers across the industry, with Anthropic in particular emerging as the principal beneficiary. By the end of 2024, Anthropic had hired Leike, Schulman and several other former OpenAI safety staff; the company's research output on scalable oversight, weak-to-strong generalisation and automated alignment in the months that followed showed substantial continuity with the agenda originally articulated at OpenAI's Superalignment team.[^14] Safe Superintelligence Inc., in turn, became a destination for a smaller number of former OpenAI staff who shared Sutskever's preference for working outside any existing commercial AI product organisation. The combined effect was that the research agenda first organised under the "superalignment" banner did not disappear with the team's dissolution; rather, it was reconstituted across a small number of other labs, with somewhat different institutional incentives and oversight structures.