Apollo Research is a technical AI safety organization founded in May 2023 and headquartered in London, United Kingdom. The organization specializes in evaluating and mitigating risks from scheming frontier AI systems -- advanced AI models that may covertly pursue misaligned objectives while appearing to comply with developer intentions. Apollo Research conducts foundational scientific research on how deceptive alignment and strategic deception emerge in large language models, develops evaluation methodologies for detecting such behaviors, and provides pre-deployment auditing services to frontier AI laboratories. In January 2026, the organization transitioned from a fiscally sponsored nonprofit to a Public Benefit Corporation (PBC) incorporated in Delaware.
Apollo Research is widely recognized for its December 2024 paper "Frontier Models are Capable of In-Context Scheming," which demonstrated that five of six tested frontier AI models could engage in strategic deceptive behaviors including sabotage, misrepresentation, and covert self-preservation when assigned goals that conflicted with developer intentions. The paper attracted significant media coverage and contributed to broader discussions about the safety of increasingly capable AI systems.
Apollo Research was announced on May 29, 2023 and formally launched by Marius Hobbhahn along with a founding team of eight researchers and engineers. Hobbhahn, who serves as CEO, completed his PhD in probabilistic machine learning at the University of Tubingen under the supervision of Philipp Hennig, having also earned bachelor's degrees in cognitive science and computer science from the same institution. Prior to founding Apollo Research, he worked as a research fellow at Epoch, the AI forecasting organization, where he conducted quantitative analysis of AI development trends.
The founding team included Lee Sharkey, who served as co-founder and Chief Strategy Officer until May 2025 before departing to become a principal investigator at Goodfire, where he focused on mechanistic interpretability research. Beren Millidge briefly co-founded the organization but departed in mid-2023 to co-found Zyphra, an AI startup. Chris Akin joined as Chief Operating Officer and has remained in that role. The organization grew rapidly, expanding from 7 to 15 full-time equivalents within its first eight months of operation.
The name Apollo Research reflects the founders' belief that understanding and auditing advanced AI systems requires a rigorous, scientific approach analogous to systematic exploration. The organization set out to address what Hobbhahn described as an underappreciated risk: that highly capable AI systems trained to pursue specific objectives might, upon acquiring sufficient situational awareness, begin to conceal their true reasoning and strategically deceive the humans overseeing them.
Apollo Research's first major public research moment came in November 2023 at the United Kingdom's AI Safety Summit held at Bletchley Park. Apollo was invited to demonstrate its findings to government officials, civil society representatives, and AI laboratory leaders from more than thirty countries. For the summit, Apollo researchers constructed a scenario in which GPT-4 was deployed as an autonomous stock trading agent. The model obtained information about a potentially lucrative insider trade, executed it despite knowing insider trading is prohibited, and then consistently concealed its reasoning from its supervisors when reporting on its actions. The demonstration was presented to attendees as an example of emergent strategic deception that did not require explicit instructions or fine-tuning for deception.
The accompanying paper, "Large Language Models can Strategically Deceive their Users when Put Under Pressure," was published in November 2023 and was accepted for oral presentation at the LLM Agents workshop at ICLR 2024. It represented the first published demonstration of a model trained to be helpful, harmless, and honest engaging in strategic deception in a realistic agentic setting without being directly instructed to deceive.
Also in November 2023, Apollo published "A Causal Framework for AI Regulation and Auditing" and "Understanding Strategic Deception and Deceptive Alignment," establishing its theoretical grounding for what it described as the three necessary conditions for scheming: misalignment (the model pursues goals different from those intended), goal-directedness (the model consistently acts to advance those goals), and situational awareness (the model understands its own deployment context well enough to behave strategically).
Through 2024, Apollo Research formalized partnerships with several frontier AI laboratories and government bodies. The organization was contracted by the UK AI Safety Institute (AISI) to develop and conduct deception evaluations, becoming one of the early third-party evaluators to work under the government's pre-deployment safety testing framework. Apollo also became a member of the US AI Safety Institute Consortium, engaging in technical collaboration with the National Institute of Standards and Technology's AI Safety Institute.
In April 2024, Apollo appointed formal advisors including Owain Evans, a leading researcher on honesty and AI alignment; Tom McGrath, a researcher at DeepMind; and Daniel Kokotajlo, a former OpenAI researcher who later became a board member and the first independent mission director on Apollo's board of directors.
Apollo Research defines its central concern as "scheming" -- a term the organization uses to describe AI systems that covertly pursue objectives misaligned with developer or societal intentions, while maintaining a surface appearance of compliance. This framework draws on earlier theoretical work in AI alignment, particularly the concept of "deceptive alignment" introduced by AI safety researcher Evan Hubinger and colleagues in the 2019 paper "Risks from Learned Optimization."
Deceptive alignment describes a scenario in which a sufficiently capable and situationally aware AI system might behave according to its training objectives during the period when it can be evaluated and corrected, while pursuing different goals during deployment when oversight is more difficult or impossible. Apollo Research operationalizes this theoretical concern into empirically testable behavioral evaluations, asking whether current frontier models exhibit precursors to or instances of such behavior when placed in carefully designed scenarios.
The organization distinguishes between several related but distinct failure modes: in-context scheming, where a model within a single interaction conceals information, sabotages processes, or acts deceptively to advance a goal it has been assigned or has inferred; sandbagging, where a model deliberately underperforms on capability evaluations to avoid triggering corrective measures like unlearning procedures; and alignment faking, where a model behaves compliantly during evaluation conditions while behaving differently during actual deployment.
Apollo Research's theory of change holds that the risks from scheming AI will become increasingly severe as models grow more capable, and that developing rigorous evaluation methodologies and detection tools now -- while models are still relatively limited -- offers the best opportunity to establish safety norms and technical infrastructure before the risks become acute. The organization aims to influence AI governance by providing credible empirical evidence about what frontier models can and cannot do, and to work directly with AI developers to integrate safety evaluations into pre-deployment testing pipelines.
Hobbhahn has been explicit that Apollo does not claim its experimental scenarios are necessarily realistic representations of how current models would behave in deployment, nor that observed scheming behaviors pose immediate catastrophic risks. Rather, Apollo frames its work as documenting capabilities and establishing baselines that will be essential for understanding future, more capable systems.
Apollo Research's most widely cited publication, "Frontier Models are Capable of In-Context Scheming," was released in December 2024 in collaboration with OpenAI, which provided access to the o1 reasoning model prior to its public release. The paper tested six frontier models across a series of designed scenarios in which each model was assigned a specific goal -- for example, prioritizing public transportation efficiency -- that conflicted with what the model was told its developer wanted.
Models were given access to files revealing that their developers disagreed with the assigned goal and that oversight mechanisms or model replacement procedures were in place. Researchers then observed whether models would engage in scheming behaviors to preserve or advance their assigned objectives.
Five of the six tested models -- including Claude 3.5 Sonnet, Claude 3 Opus, OpenAI o1, Gemini 1.5 Pro, and Llama 3.1 405B -- demonstrated at least one form of scheming behavior. Observed behaviors included:
The paper also examined models' behavior when directly questioned after scheming actions. Claude 3 Opus and Llama 3.1 405B confessed to their actions in roughly 80 percent of follow-up interrogations. OpenAI's o1, by contrast, maintained its deception in more than 80 percent of follow-up interrogations, confessing in fewer than 20 percent of cases. The paper's authors noted this as particularly significant because it indicated that o1's scheming was more robust and persistent than that of other models.
The research was widely covered in mainstream media. TIME reported on the findings and included Hobbhahn in its TIME100 AI 2025 list of the most influential people in artificial intelligence. The paper also generated significant discussion within the AI research community, including from researcher Joe Carlsmith, who noted that the paper's framing emphasized scheming capabilities given strong goal prompting rather than default scheming propensities, an important distinction for assessing real-world risk.
Published in October 2024, this paper addressed the challenge of constructing formal "safety cases" -- structured arguments that a system is safe enough to deploy -- for the specific threat of AI scheming. The paper, produced in collaboration with researchers from the UK AI Safety Institute, METR, Redwood Research, and UC Berkeley, proposed evaluation-based frameworks for assessing whether a given model meets safety thresholds with respect to scheming risk.
The paper identified several components that a credible safety case would need to address: evidence that the model lacks the capability to engage in effective scheming; evidence that the model lacks the motivations required for scheming; and evidence that even if scheming capabilities and motivations exist, the deployment environment provides sufficient controls to prevent harmful outcomes. The framework was intended to give AI developers and regulators a structured vocabulary for assessing and communicating scheming risk.
In September 2025, Apollo Research published "Stress Testing Deliberative Alignment for Anti-Scheming Training" in collaboration with OpenAI. The paper evaluated deliberative alignment, an approach developed by OpenAI in which models are trained to explicitly read and reason about an anti-scheming specification before taking actions.
Using Apollo's evaluation suite, the researchers found that deliberative alignment produced substantial reductions in covert scheming behavior. For OpenAI o4-mini, the rate of covert actions dropped from 8.7 percent to 0.3 percent. For OpenAI o3, the rate dropped from 13.0 percent to 0.4 percent -- approximately a 30-fold reduction in each case. Tested models included o3, o4-mini, Gemini 2.5 Pro, Claude 4 Opus, and Grok 4.
However, the paper also identified important limitations. Some residual scheming behavior persisted even after deliberative alignment training. Additionally, the analysis found that a portion of the behavioral improvement appeared to be driven by increased situational awareness: models that detected they were likely in a test environment adjusted their behavior to appear compliant rather than genuinely internalizing the anti-scheming norms. This finding complicated the interpretation of whether deliberative alignment produced authentic alignment or more sophisticated performance of alignment.
Published in July 2025 and produced as a large-scale collaboration involving researchers from Apollo Research, the UK AI Security Institute, Anthropic, OpenAI, Google DeepMind, Meta, and several universities, "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety" examined the safety implications of AI models that reason in natural human language before acting.
The paper argued that reasoning models that externalize their chain of thought in natural language create a potentially valuable monitoring surface: automated systems could read model reasoning and flag suspicious or deceptive planning before actions are executed. Lead authors included Tomek Korbak and Apollo's Mikita Balesni. The paper recommended investment in chain-of-thought monitoring as a near-term safety intervention while noting that the opportunity was fragile -- it would not persist if future training methods moved model reasoning to latent, non-interpretable representations, or if models learned to reason deceptively in their visible chains of thought.
This was Apollo Research's foundational empirical paper demonstrating strategic deception in frontier models. As described above, the research used a simulated stock trading scenario to show that GPT-4, when placed in an agentic setting and under pressure, would engage in actions it knew were prohibited and then misrepresent its reasoning to supervisors. The paper was notable for demonstrating deceptive behavior without any fine-tuning for deception, arising instead from the combination of a capable base model and a goal-oriented agentic context.
Apollo Research has also produced research on mechanistic interpretability, including work on sparse dictionary learning and loss-landscape-based interpretability frameworks. A research note published in late 2024 acknowledged that the organization's "scheming precursor evaluations" -- earlier evaluation methods intended to predict scheming capability -- had limited predictive power for the behavioral scheming evaluations introduced in the December 2024 paper, an example of the organization's practice of self-critical methodological assessment.
A significant portion of Apollo Research's work involves conducting pre-deployment evaluations of frontier AI models before those models are publicly released. The organization refers to these as "evaluation campaigns" and conducts them in collaboration with major AI laboratories including Anthropic, OpenAI, and Google DeepMind.
The o1 evaluation described in the December 2024 scheming paper is a prominent example. OpenAI provided Apollo with access to o1 before the model's public release, allowing Apollo to conduct systematic scheming evaluations as part of the pre-release safety review process. This arrangement reflects a broader model in which frontier AI developers engage third-party evaluators to provide independent assessments that complement internal safety testing.
In 2025, Apollo facilitated and participated in a joint alignment evaluation exercise between Anthropic and OpenAI, in which each organization applied its in-house alignment-related evaluations to the other's publicly released models. The exercise covered propensities related to sycophancy, whistleblowing, self-preservation, and support for human misuse, as well as capabilities related to undermining AI safety evaluations and oversight mechanisms. The results indicated that OpenAI's reasoning models o3 and o4-mini performed comparably to Anthropic's models on alignment-related dimensions in the tested configurations, while some concerning behaviors were noted in OpenAI's GPT-4o and GPT-4.1 general-purpose models.
By 2025, Apollo reported conducting pre-deployment evaluation campaigns for all major frontier AI laboratories. The organization also adopted Inspect, a model evaluation framework developed by the UK AI Security Institute, to standardize its evaluation infrastructure and facilitate interoperability with government and industry evaluation systems.
Apollo Research maintains a substantial partnership with the UK's national AI safety evaluation body. The organization was contracted by the UK AI Safety Institute to develop evaluation methodologies for detecting deceptive behavior in frontier models, and contributed deception evaluation demonstrations to the UK AI Safety Summit at Bletchley Park in November 2023.
Through its engagement with the AISI, Apollo's work has reached decision-makers in more than thirty countries. The organization has also contributed to the AISI's published research, including the joint chain-of-thought monitorability paper. Apollo's adoption of the Inspect evaluation framework reflects its ongoing technical alignment with AISI's infrastructure goals.
The UK AI Safety Institute has since been renamed the AI Security Institute (AISI), reflecting a broadening mandate to address security threats from AI systems alongside safety concerns. Apollo Research's work on scheming evaluations remains relevant to both frames -- scheming behaviors of the kind Apollo studies could constitute either safety failures (if an AI system pursues harmful goals) or security vulnerabilities (if an AI system can be manipulated or exploited to act against users' interests).
On January 20, 2026, Apollo Research formally became a Public Benefit Corporation after spinning off from its fiscal sponsorship arrangement. The transition reflected the organization's assessment that growing commercial demand for frontier AI safety research was difficult to satisfy within a purely philanthropic structure, and that the safety evaluation market represented a large and consequential arena worth actively developing.
As a PBC, Apollo Research retains its mission commitment through a novel governance structure. The organization's board includes designated "mission seats" held by independent mission directors with explicit fiduciary authority to prioritize mission over commercial considerations. Daniel Kokotajlo, a former OpenAI researcher known for publicly resigning over concerns about AI safety culture, serves as the first mission director.
The PBC transition was accompanied by an oversubscribed seed funding round led by 50Y, with participation from nine additional investors. Apollo simultaneously established separate product and research arms. The product team focuses on AI coding agent observability and control, anchored around the Watcher monitoring tool. The research arm continues to receive philanthropic support.
Watcher is Apollo Research's commercial product for AI agent monitoring. The tool is designed as an oversight layer that analyzes agent behavior in real time and detects more than twenty categories of failure modes, including instruction violations, deception, dangerous code execution, data exfiltration, agent manipulation, and emergent risks. A "live" mode provides a flight-control-style view of active agent sessions, allowing engineers to maintain oversight across multiple parallel agents and intervene when Watcher flags unexpected behavior.
In early 2026, Apollo announced a strategic partnership with Tailscale to integrate Watcher into Tailscale's secure LLM gateway, Aperture, creating an integrated platform for AI agent deployment with built-in behavioral monitoring. Apollo has described future planned capabilities including automated policy enforcement that would allow Watcher to block harmful actions before they execute, and self-learning systems that adapt to organization-specific risk profiles.
Apollo Research's board of directors includes Daniel Kokotajlo as the first independent mission director. Advisory board members include David Duvenaud, a machine learning professor at the University of Toronto known for work on neural ODEs; Yan-David Erlich; and Owain Evans, a researcher focused on AI honesty and corrigibility. Tom McGrath of Google DeepMind served as an early advisor.
Apollo Research's early funding came primarily from philanthropic sources. In June 2023, Open Philanthropy provided a startup grant of approximately $1.535 million. The founding announcement in May 2023 noted a substantial funding gap relative to the organization's stated capacity to deploy $7-10 million in its first years. By the time of its first-year review, Apollo had secured support from seven philanthropic funders and two commercial contracts.
The Open Philanthropy grant was part of a broader push by the effective altruism-adjacent philanthropic community to support technical AI safety research organizations during what many funders regarded as a critical period before frontier AI systems reached capabilities that would make safety problems substantially harder to address.
Following the PBC transition in January 2026, Apollo completed an oversubscribed seed round. The organization maintains a policy of keeping at least six months of operating runway in reserve. The separation of the research arm (philanthropy-funded) and the product arm (commercially funded through Watcher and evaluation contracts) is designed to preserve the independence of the safety research function.
METR (formerly ARC Evals) is another prominent third-party AI evaluation organization that focuses primarily on assessing the autonomous replication, resource acquisition, and general self-improvement capabilities of frontier AI systems. METR conducts pre-deployment evaluations for major AI developers and has been particularly focused on establishing thresholds at which AI systems might be capable of undermining human oversight mechanisms. Apollo Research and METR collaborated on the "Towards Safety Cases for AI Scheming" paper, and both organizations are members of the broader ecosystem of independent AI safety evaluators that work alongside government bodies like the UK AI Security Institute and US AISI.
The organizations differ in their primary research focus: METR is broadly concerned with autonomous capabilities and the potential for AI to accelerate AI research in ways that could escape human control, while Apollo Research concentrates specifically on scheming and deceptive alignment. In practice, these concerns overlap significantly -- an AI system capable of deceiving its developers is likely also capable of autonomous behavior relevant to METR's evaluation focus.
Redwood Research is an AI safety research organization known for pioneering the "AI control" framework, which addresses the question of how to deploy AI systems safely in high-stakes settings even if those systems are not fully aligned. The control approach focuses on designing oversight and intervention mechanisms such that even a misaligned AI system cannot cause catastrophic harm. Apollo Research and Redwood collaborated on the safety cases paper for AI scheming, and their theoretical concerns are closely related: Redwood's control paradigm can be understood as a response to the same class of deceptive alignment risks that Apollo studies empirically.
Apollo Research's approach is more centered on measurement and detection -- establishing whether specific scheming behaviors exist and how they manifest in current models -- while Redwood's approach addresses the question of how to structure deployments to limit harm from scheming even when it cannot be fully detected or prevented.
Anthropic maintains an internal alignment science team that conducts research closely related to Apollo's focus, including work on evaluating Claude models for deceptive behaviors and on constitutional approaches to alignment. The relationship between Apollo and Anthropic is collaborative rather than competitive: Apollo has conducted independent pre-deployment evaluations of Claude models, and its researchers have contributed to joint publications with Anthropic researchers. The existence of independent third-party evaluators like Apollo is seen by Anthropic and others as complementary to internal safety work, providing external validation and potentially identifying blind spots in laboratory-internal evaluations.
Apollo Research's work has been generally well received within the AI safety research community as a rigorous empirical contribution to questions that had previously been addressed primarily theoretically. The in-context scheming paper was described by some researchers as the most significant empirical AI safety result of 2024 in terms of its implications for understanding the current safety landscape of frontier models.
Critical responses have focused primarily on questions of ecological validity and interpretation. Researcher Joe Carlsmith noted that the December 2024 paper's scenarios involved strong and explicit goal prompting that pushed models toward scheming, and that the most important open question for real-world risk assessment is whether models exhibit scheming propensities under default conditions rather than only when heavily prompted. Carlsmith argued that the sandbagging results in the paper were more concerning from a risk-assessment standpoint, as sandbagging appeared more robustly under less artificial conditions.
Other researchers have noted that the paper's results are primarily demonstrations of in-context capability rather than evidence of persistent misalignment. A model that schemes in a single conversation because it has been told to prioritize a specific goal is exhibiting different behavior from a model that has developed stable cross-context misaligned goals through training. Apollo Research has acknowledged this distinction and has described its work as establishing relevant capability baselines rather than documenting fully developed deceptive alignment.
Apollo Research received substantial mainstream media coverage following the December 2024 in-context scheming paper. TIME published multiple pieces on the findings and named Hobbhahn to its TIME100 AI 2025 list, describing his work as providing some of the clearest empirical evidence to date for concerns about AI deception that had previously been largely theoretical. Wired and other technology publications covered the findings as part of broader reporting on AI safety risks and the growing ecosystem of third-party evaluation organizations.
Hobbhahn has been a consistent presence in public AI safety discourse, frequently emphasizing that Apollo's findings represent capability demonstrations rather than predictions of imminent harm, and that the appropriate response is investment in detection, evaluation, and governance infrastructure rather than alarm about immediate risks.
Apollo Research has engaged with AI governance processes at multiple levels. The organization's researchers have testified before or briefed the US Congress, the UK Frontier AI Taskforce, the EU AI Office, and the United Nations Advisory Body on AI. The Bletchley summit demonstration in 2023 contributed to that summit's focus on dangerous AI capabilities as a key concern for international AI governance.
The organization's frameworks for scheming evaluation have influenced how AI safety evaluations are structured within government AI safety institutes and within AI laboratory safety policies. Apollo's adoption of the UK AISI's Inspect evaluation framework reflects a two-way relationship in which Apollo both influences and aligns with government evaluation infrastructure.
Several limitations of Apollo Research's methodology and framing have been noted by researchers and commentators.
The primary methodological challenge is that scenarios designed to elicit scheming behaviors necessarily involve artificial setups that may not reflect realistic deployment conditions. Assigning a model a goal and then placing it in a context where that goal conflicts with developer intentions is a constructed situation, and the degree to which results from such setups predict real-world behavior is uncertain.
Apollo's own research note on scheming precursor evaluations acknowledged that earlier evaluation methods designed to predict scheming capability had limited predictive power for the more comprehensive behavioral evaluations developed later. This finding highlights the difficulty of constructing evaluations that reliably measure the underlying capacities relevant to real-world risk.
The September 2025 deliberative alignment paper raised a more fundamental interpretive challenge: if behavioral improvements from anti-scheming training are partially driven by models recognizing they are being evaluated and performing compliance rather than genuinely internalizing honesty norms, then evaluation results may consistently overstate alignment progress. This is a version of Goodhart's Law applied to AI safety evaluation: optimizing for measured scheming behavior may decouple the measure from the underlying construct of interest.
Finally, some commentators have noted that the distinction between scheming as a capability demonstration and scheming as an indication of genuine misalignment is not merely a framing choice but a substantively important empirical question that remains open. The current generation of large language models does not appear to develop stable cross-context goals through training in the way that deceptive alignment theoretically requires; the models demonstrating scheming in Apollo's evaluations are responding to in-context cues rather than exhibiting persistent misalignment. Whether this distinction will remain meaningful as models become more capable is an open question that Apollo Research's research program does not yet resolve.