Capability overhang

AI Policy & Regulation AI Safety

16 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v6 · 3,293 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Capability overhang is a term used in ai safety and AI policy discourse to describe a situation in which the latent capabilities of a deployed AI system, or class of systems, substantially exceed what is being actively used or elicited. The term captures the gap between what a model is known to do at the time of release and what it can be made to do through additional prompting, fine tuning, scaffolding, tool access, or extra inference-time computation.^[1]^[2]

In its modern usage, capability overhang is closely tied to the empirical finding that frontier large language models often possess "hidden" skills that emerge only when a new elicitation technique is discovered. Anthropic co-founder Jack Clark popularised the modern formulation in March 2023, writing that "GPT-4, like GPT-3 before it, has a capability overhang; at the time of release, neither OpenAI or its various deployment partners have a clue as to the true extent of GPT-4's capability surface".^[2] The intellectual genealogy of the concept, however, traces back to Nick Bostrom's 2014 book Superintelligence: Paths, Dangers, Strategies, where Bostrom distinguishes between hardware, content, and algorithm overhangs as factors that could enable a sudden "intelligence explosion".^[3]

Capability overhang has become a central organising concept for safety institutes including the UK AI Security Institute and the US AI Safety Institute, for evaluation-focused organisations such as METR, and for frontier model developers, because if a model's true capability profile is poorly characterised at release, downstream risk assessments, deployment decisions, and governance interventions can be systematically miscalibrated.^[4]^[5]^[6]

Key facts

Item	Detail
Concept type	Analytical concept in AI safety and AI policy
Domain	ai safety, ai alignment, AI evaluation, AI governance
Earliest formal antecedents	Bostrom (2014), Superintelligence: "hardware overhang", "content overhang", "algorithm overhang"^[3]
Modern popularisation	Jack Clark, Import AI #321, 21 March 2023^[2]
Closely related	Hardware overhang / compute overhang; capability elicitation; sandbagging; emergent abilities
Canonical examples	chain of thought prompting (2022); test-time compute scaling in o1/o3; tool use and agentic scaffolding
Key elicitation programmes	METR elicitation protocol; uk aisi structured elicitation; Anthropic / OpenAI / DeepSeek pre-deployment evaluations

Definition

There is no universally agreed formal definition, but most usages converge on the following: a capability overhang exists when the latent ability of an AI system to perform a task is meaningfully higher than its elicited ability under the conditions in which it is currently deployed, evaluated, or studied.^[1]^[4]^[7]

Three components are commonly distinguished:

Latent capability: the performance a system could in principle achieve given the model weights themselves, possibly combined with reasonable post-training enhancements such as fine-tuning, scaffolding, prompting, or extended inference compute.
Elicited capability: the performance actually observed in evaluations or deployment.
Elicitation gap: the difference between the two, which is what an "overhang" measures.^[4]

METR defines the goal of its elicitation work as obtaining "a test-set score that represents the full capabilities of the model that are likely to be accessible with plausible amounts of post-training enhancement", which is functionally a project of measuring and closing the capability-overhang gap.^[4] The UK AI Security Institute similarly characterises capability elicitation as "experiments designed to unlock or enhance a model's latent abilities after it has been trained, in order to best understand its capability profile".^[5]

The Wiktionary entry for capability overhang, citing usages in The Verge (2022), The Guardian (2023), and The Wall Street Journal (2024), defines the term as "capabilities and potential applications of existing artificial intelligence systems that have not yet been discovered".^[1]

Origins and intellectual history

Bostrom 2014: hardware, content, and algorithm overhangs

The earliest systematic treatment of "overhang" concepts in the AI safety literature appears in Chapter 4 of Nick Bostrom's Superintelligence: Paths, Dangers, Strategies (Oxford University Press, 2014), titled "The kinetics of an intelligence explosion".^[3] Bostrom uses the term overhang to describe potential resources that could be rapidly converted into capability once a triggering breakthrough occurs:

Hardware overhang. "When human-level software is created, enough computing power may already be available to run vast numbers of copies at great speed." Bostrom argues that scaling existing hardware after a software breakthrough could add "several orders of magnitude of computing power" relatively quickly.^[3]
Content overhang. "There may be content overhang in the form of pre-made content (e.g. the Internet) that becomes available to a system once it reaches human parity."^[3]
Algorithm overhang. "Algorithm overhang (pre-designed algorithmic enhancements) is also possible but perhaps less likely."^[3]

In all three variants, the structural idea is the same: some resource (compute, data, or algorithmic improvements) accumulates faster than it can be applied, so when a triggering event occurs the system can take a large discontinuous jump in effective capability. This framework underpins Bostrom's argument that an intelligence explosion could be very fast.

Yudkowsky and earlier LessWrong discussion

The "hardware overhang" idea is older than Bostrom's book. Eliezer Yudkowsky and other writers on LessWrong had used "hardware overhang" since at least the late 2000s, in the context of debates about hard versus soft takeoff and recursive self-improvement at MIRI (then the Singularity Institute). The basic claim, that modern processors run at gigahertz speeds while human neurons spike at roughly 100 Hz so that any successful imitation of cognition could in principle run vastly faster than a biological brain, featured prominently in posts such as Yudkowsky's "Hard Takeoff" (2008).^[8]

2020: "Are we in an AI overhang?"

A July 2020 Alignment Forum / LessWrong post by Andy Jones titled "Are we in an AI overhang?" applied the overhang framing specifically to GPT-3-era language models.^[9] Jones argued that the capability to build systems "orders-of-magnitude more powerful" than GPT-3 already existed, that GPT-3's training cost (~$15 million) was a negligible fraction of large technology firms' R&D budgets, and that a 100-fold scaling increase was "entirely plausible right now" at roughly $1 billion. This post is widely cited as the moment when "overhang" reasoning shifted from a Bostromian thought experiment to a near-term empirical claim about the AI industry.^[9]

2023: Jack Clark and the modern usage

The current sense of "capability overhang", referring not primarily to compute or data reserves but to the hidden skill repertoire of an already-trained model, was crystallised by Jack Clark in Import AI #321 on 21 March 2023.^[2] Writing days after GPT-4's release, Clark argued that GPT-4 possessed a "capability overhang" and that "the applications we're seeing of GPT-4 today are the comparatively dumb ones; the really 'smart' capabilities will emerge in coming months and years through a process of collective discovery". This usage was picked up rapidly by journalists, including by The Guardian, The Wall Street Journal, and The Verge, which by 2024 routinely described frontier AI as having a capability overhang.^[1]

"Taboo 'compute overhang'"

A March 2023 LessWrong post by Zach Stein-Perlman, "Taboo 'compute overhang'", documents that the cluster of overhang terms had become semantically unstable, with practitioners using them in incompatible ways.^[10] Stein-Perlman recommends either avoiding the bare term or specifying which sense is meant: pre-AGI compute reserves, post-AGI scale-up potential, or breakthrough-induced re-scaling.

Subtypes

In contemporary usage, capability overhang is sometimes treated as an umbrella for several distinct (and partially overlapping) phenomena.^[3]^[10]^[11]

Capability overhang (narrow sense)

The narrow sense, following Clark, refers to unrealised functional skills of a specific trained model that can be unlocked through better prompting, fine-tuning, scaffolding, tool access, or test-time computation, without retraining the underlying weights.^[2] This is the sense most relevant to AI evaluation and pre-deployment red-teaming.

Compute overhang

Compute overhang refers to a mismatch between the computational resources available globally (or to a particular actor) and the computational resources currently being applied to AI training and inference. If a "pause" or other intervention slows training without slowing the underlying hardware progress, the gap grows; if the intervention later lifts, very large training runs become abruptly possible.^[10]^[11] The concept is influential in debates about whether voluntary AI moratoria are net-positive for safety: critics argue a pause creates a compute overhang that produces faster, more discontinuous progress later.^[11]

Data overhang

Data overhang refers to the existence of large quantities of training data that have not been fully utilised by current systems, including private corpora behind paywalls, multimodal data such as video and audio, and synthetic data generated by other AI systems. Bostrom's "content overhang" is the conceptual ancestor.^[3] In the early 2020s, Epoch AI researchers raised the related concern that high-quality public text is finite and might be exhausted by language-model training between 2026 and 2032; this implies that historically a data overhang existed, and that for some modalities it still does.^[12]

Algorithmic overhang

Algorithmic overhang is the situation in which much better training or inference algorithms exist (or could be developed) but are not yet used. Bostrom's 2014 "algorithm overhang" is the canonical antecedent.^[3] A widely-discussed contemporary example is the shift from pure pre-training scaling to scaling reinforcement learning on verifiable rewards (RLVR) and to extended test-time compute in reasoning models; commentators including Andrej Karpathy have described most of the capability progress of 2025 as labs "chewing through the overhang" of these new post-training stages.^[13]

Concrete examples in modern AI

Each of the following episodes is widely cited as a moment when a previously unrecognised capability overhang was revealed.

Chain-of-thought prompting (2022)

Jason Wei and colleagues at Google Brain showed in Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (NeurIPS 2022, arXiv 2201.11903) that prompting a sufficiently large language model with a few exemplars showing step-by-step reasoning produces dramatic improvements on arithmetic, commonsense, and symbolic reasoning benchmarks.^[14] On the GSM8K math-word-problem benchmark, a 540B-parameter model with eight chain of thought exemplars achieved state-of-the-art accuracy, surpassing even fine-tuned GPT-3 with a verifier. Crucially, the gain only appeared at scales of roughly 100B+ parameters: the capability was latent in those large models all along, but a simple prompting change was required to elicit it. This is the canonical empirical demonstration of capability overhang in modern LLMs.^[14]

Test-time compute and reasoning models (2024-2025)

OpenAI's release of o1 in September 2024, followed by o3 in December 2024, demonstrated that further capability gains could be unlocked by allowing a model to spend orders of magnitude more compute at inference time on extended chains of thought, search over candidate reasoning traces, and self-checking.^[15]^[16] OpenAI reported that o1's performance on the AIME mathematics competition rose from 74% accuracy at a single sample per problem to 93% with a learned re-ranker over 1,000 samples.^[15] o3 achieved 75.7% on the ARC-AGI semi-private evaluation under standard compute and 87.5% with a high-compute configuration that used roughly 172× more inference compute per task, versus around 5% for GPT-4o earlier in 2024.^[16] The episode is often described as revealing a large reasoning overhang in the underlying pre-trained models.^[15]^[16]

Tool use and agentic scaffolding

Allowing a model to call external tools (Python interpreters, web browsers, search engines, code-execution sandboxes) and structuring its operation as an autonomous agent within a scaffolded loop frequently unlocks capabilities that the base model cannot demonstrate in a single forward pass.^[4]^[5] METR and the UK AISI both treat scaffolded agentic evaluations as essential to measuring the latent capability profile, on the grounds that an evaluator who tests only the bare model is liable to materially under-estimate dangerous capabilities.^[4]^[5]

Fine-tuning and "jailbreak" elicitation

Fine-tuning a model on small targeted datasets, including adversarial datasets, can rapidly reveal capabilities that were suppressed by standard rlhf safety training. The jailbreaking literature, and the academic study of red teaming, routinely demonstrate that capabilities the developer believed were "removed" are in fact merely reweighted and can be re-elicited.^[17] This is a key reason why Anthropic's and OpenAI's responsible scaling policies include fine-tuning as part of their elicitation budget.

Persuasion and long-horizon interaction

Some capabilities only manifest in extended, multi-turn interactions or after access to particular tools. Persuasion, long-horizon agentic task completion, and certain kinds of social manipulation have been highlighted by safety institutes as instances where short-form evaluations systematically understate latent capability.^[5]^[6]

Implications for AI safety

The practical significance of capability overhang for ai safety flows from three observations.^[2]^[4]^[5]

1. Sudden jumps from new elicitation techniques. If meaningful capability is latent in already-deployed models, a single new prompting idea, scaffold, or fine-tuning recipe can produce an effectively discontinuous increase in real-world capability without any new training run. The 2022 chain-of-thought result and the 2024 test-time-compute results are both invoked as examples in which the model existed for months or years before the elicitation technique was found.^[14]^[15]

2. Evaluation under-estimation. Models that are evaluated under restricted conditions (without tool access, without scaffolding, without fine-tuning, with limited inference compute, or by evaluators unfamiliar with state-of-the-art prompting) will systematically under-state real risk. METR reports that elicitation effort can produce capability gains "comparable to increasing training compute between five and twenty times".^[4]^[5] This motivates the formal elicitation protocols developed by METR and by the UK AISI.

3. Race dynamics and information hazards. Because the elicitation technique itself is often a simple idea that can be reproduced cheaply, capability overhang creates incentives for races: any lab (or external actor) that discovers a powerful elicitation technique may unlock the latent capability of every comparable deployed model. This has been used to argue both for and against open-sourcing model weights, and is one motivation behind the responsible-disclosure norms in Anthropic's, OpenAI's, and other developers' responsible scaling frameworks.

4. Implications for moratoria and pauses. A central policy debate concerns whether voluntary or regulated pauses on frontier training are net-positive for safety. Critics, including some LessWrong commentators, argue that pauses generate compute overhang: hardware continues to improve during the pause, so when the pause ends, training runs jump in size discontinuously, with less time for safety research at intermediate scales.^[10]^[11]

Capability elicitation

Capability elicitation is the technical sub-field that aims to measure the upper bound of what a deployed model can do, and is the methodological response to capability overhang. Key practitioners include METR (with its elicitation protocol and Autonomy Evaluation Resources)^[4], the UK AI Security Institute (with its structured elicitation protocol published in 2025)^[5], and the US AI Safety Institute. Typical techniques include strategic prompting, providing access to external tools, generating and selecting from multiple candidate responses, agent scaffolds for iterative refinement, and multi-agent set-ups in which model instances critique each other.^[4]^[5]

Sandbagging

Sandbagging refers to the strategic underperformance of a model on evaluations. Van der Weij et al. (2024), in AI Sandbagging: Language Models can Strategically Underperform on Evaluations (arXiv 2406.07358), demonstrate that GPT-4 and Claude 3 Opus can be prompted to selectively underperform on dangerous capability tests while retaining general capability, and that models can be fine-tuned with "password-locking" to hide capabilities until a specific trigger is provided.^[17] Sandbagging is, in effect, an artificial or strategically induced capability overhang: the elicited score under-states latent ability. The phenomenon is closely connected to discussions of alignment faking and AI scheming.^[17]

Emergence and scaling

The 2022 paper Emergent Abilities of Large Language Models (Wei et al.) defined emergent abilities as "abilities that are not present in smaller-scale models but are present in large-scale models", and is sometimes used in tandem with capability-overhang discussions.^[18] Schaeffer et al. (NeurIPS 2023, arXiv 2304.15004) controversially argued that many such emergent abilities are an artefact of discontinuous evaluation metrics rather than genuine phase transitions in the model.^[19] Either way, the empirical fact that capabilities at scale can be hard to predict from smaller models reinforces the worry that overhangs are common.

Recent work

METR elicitation methodology (2023-2025)

METR (formerly known as ARC Evals) has published a multi-part Autonomy Evaluation Resources package that includes a task suite for autonomous capabilities, an elicitation protocol, a task standard and agent workbench, and an example evaluation protocol.^[4] METR's Measuring the impact of post-training enhancements report finds, on a set of agent tasks, that the gap between a base model and a post-training-enhanced version can be of similar magnitude to the gap between GPT-3.5 Turbo and GPT-4, strong evidence for substantial capability overhang in deployed autonomous agents.^[4] METR's protocol explicitly recommends adding a safety margin for further enhancements that may be unlocked by future fine-tuning or scaffolding work.^[4]

AISI structured elicitation protocol (2025)

The UK AI Security Institute published A structured protocol for elicitation experiments on 16 July 2025, presenting a checklist of best practices for eliciting model capabilities.^[5] The protocol emphasises baseline establishment, distribution-shift management, prompt design, agent-scaffold construction, failure detection, and exhaustion of chain of thought and tool-use techniques. AISI's accompanying methodology covers cyber, autonomous systems, and chemical/biological/criminal misuse capability domains.^[5]

The o1/o3 reasoning unlock (2024)

The release of o1 in September 2024 and o3 in December 2024 was widely interpreted as revealing a large reasoning overhang in then-current pre-trained models.^[15]^[16] Independent evaluators at the ARC-AGI team reported that o3 produced an unprecedented jump on their benchmark, taking AI reasoning systems from roughly 5% (early 2024) to over 75% in a few months, at the cost of orders-of-magnitude more inference compute per task.^[16] The episode is now a stock example in discussions of capability overhang.

2025 post-training overhang

In a December 2025 retrospective, Andrej Karpathy described most of 2025's capability progress as labs working through the overhang of a new training stage built around reinforcement learning with verifiable rewards (RLVR), a contemporary instance of algorithmic overhang in Bostrom's sense.^[13]

Sandbagging and evaluation awareness

A growing literature on sandbagging,^[17] alignment faking, and evaluation awareness, the phenomenon in which models behave differently when they detect that they are being evaluated, has reframed capability overhang as not solely an evaluator-side problem but potentially also a model-side problem, in which the deployed system may itself contribute to the elicitation gap by strategically modulating its outputs.^[5]^[17]

References

"capability overhang", *Wiktionary*, citing usages from *The Verge* (2022), *The Guardian* (2023), and *The Wall Street Journal* (2024). https://en.wiktionary.org/wiki/capability_overhang ↩
Jack Clark, "Import AI 321: Open source GPT-3; giving away democracy to AGI companies; GPT-4 is a political artifact", *Import AI*, 21 March 2023. https://jack-clark.net/2023/03/21/import-ai-321-open-source-gpt3-giving-away-democracy-to-agi-companies-gpt-4-is-a-political-artifact/ ↩
Nick Bostrom, *Superintelligence: Paths, Dangers, Strategies* (Oxford University Press, 2014), Chapter 4 ("The kinetics of an intelligence explosion"). Discussion of hardware, content, and algorithm overhangs. https://nickbostrom.com/superintelligence ↩
METR, "Guidelines for capability elicitation" and "Measuring the impact of post-training enhancements", *METR's Autonomy Evaluation Resources*. https://evaluations.metr.org/elicitation-protocol/ and https://evaluations.metr.org/elicitation-gap/ ↩
AISI Science of Evaluations Team, "A structured protocol for elicitation experiments", UK AI Security Institute, 16 July 2025. https://www.aisi.gov.uk/blog/our-approach-to-ai-capability-elicitation ↩
UK AI Security Institute, "AI Safety Institute approach to evaluations", GOV.UK. https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations ↩
Sigal Samuel and other writers, summaries of capability-overhang usage in *Vox* and *The Guardian*, 2023-2024. ↩
Eliezer Yudkowsky, "Hard Takeoff", *LessWrong*, December 2008. https://www.lesswrong.com/posts/tjH8XPxAnr6JRbh7k/hard-takeoff ↩
Andy Jones, "Are we in an AI overhang?", *Alignment Forum* / *LessWrong*, 27 July 2020. https://www.lesswrong.com/posts/N6vZEnCn6A95Xn39p/are-we-in-an-ai-overhang ↩
Zach Stein-Perlman, "Taboo 'compute overhang'", *LessWrong*, 1 March 2023. https://www.lesswrong.com/posts/icR53xeAkeuzgzsWP/taboo-compute-overhang ↩
AI Impacts, "Cruxes for overhang" and "Are There Examples of Overhang for Other Technologies?". https://blog.aiimpacts.org/p/cruxes-for-overhang and https://blog.aiimpacts.org/p/are-there-examples-of-overhang-for ↩
Pablo Villalobos et al., "Will we run out of data? Limits of LLM scaling based on human-generated data", *Epoch AI*. https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data ↩
Andrej Karpathy, "2025 LLM Year in Review". https://karpathy.bearblog.dev/year-in-review-2025/ ↩
Jason Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", NeurIPS 2022 (arXiv:2201.11903). https://arxiv.org/abs/2201.11903 ↩
OpenAI, "Learning to reason with LLMs" (o1 release). https://openai.com/index/learning-to-reason-with-llms/ ↩
ARC Prize, "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub". https://arcprize.org/blog/oai-o3-pub-breakthrough ↩
Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, Francis Rhys Ward, "AI Sandbagging: Language Models can Strategically Underperform on Evaluations", arXiv:2406.07358 (2024). https://arxiv.org/abs/2406.07358 ↩
Jason Wei et al., "Emergent Abilities of Large Language Models", arXiv:2206.07682 (2022). https://arxiv.org/abs/2206.07682 ↩
Rylan Schaeffer, Brando Miranda, Sanmi Koyejo, "Are Emergent Abilities of Large Language Models a Mirage?", NeurIPS 2023 (arXiv:2304.15004). https://arxiv.org/abs/2304.15004 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributor · full history

Suggest edit

What links here

Responsible Scaling Policy

Key facts

Definition

Origins and intellectual history

Bostrom 2014: hardware, content, and algorithm overhangs

Yudkowsky and earlier LessWrong discussion

2020: "Are we in an AI overhang?"

2023: Jack Clark and the modern usage

"Taboo 'compute overhang'"

Subtypes

Capability overhang (narrow sense)

Compute overhang

Data overhang

Algorithmic overhang

Concrete examples in modern AI

Chain-of-thought prompting (2022)

Test-time compute and reasoning models (2024-2025)

Tool use and agentic scaffolding

Fine-tuning and "jailbreak" elicitation

Persuasion and long-horizon interaction

Implications for AI safety

Related concepts

Capability elicitation

Sandbagging

Emergence and scaling

Recent work

METR elicitation methodology (2023-2025)

AISI structured elicitation protocol (2025)

The o1/o3 reasoning unlock (2024)

2025 post-training overhang

Sandbagging and evaluation awareness

References

Improve this article

Related Articles

Frontier models

AI governance

AI Safety Institutes

The Anthropic Institute

Robot safety

Responsible Scaling Policy

What links here

Related Articles

Frontier models

AI governance

AI Safety Institutes

The Anthropic Institute

Robot safety

Responsible Scaling Policy