Capability overhang
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,305 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,305 words
Add missing citations, update stale details, or suggest a clearer explanation.
Capability overhang is a term used in ai safety and AI policy discourse to describe a situation in which the latent capabilities of a deployed AI system, or class of systems, substantially exceed what is being actively used or elicited. The term captures the gap between what a model is known to do at the time of release and what it can be made to do through additional prompting, fine tuning, scaffolding, tool access, or extra inference-time computation.[^1][^2]
In its modern usage, capability overhang is closely tied to the empirical finding that frontier large language models often possess "hidden" skills that emerge only when a new elicitation technique is discovered. Anthropic co-founder Jack Clark popularised the modern formulation in March 2023, writing that "GPT-4, like GPT-3 before it, has a capability overhang; at the time of release, neither OpenAI or its various deployment partners have a clue as to the true extent of GPT-4's capability surface".[^2] The intellectual genealogy of the concept, however, traces back to Nick Bostrom's 2014 book Superintelligence: Paths, Dangers, Strategies, where Bostrom distinguishes between hardware, content, and algorithm overhangs as factors that could enable a sudden "intelligence explosion".[^3]
Capability overhang has become a central organising concept for safety institutes including the UK AI Security Institute and the US AI Safety Institute, for evaluation-focused organisations such as METR, and for frontier model developers, because if a model's true capability profile is poorly characterised at release, downstream risk assessments, deployment decisions, and governance interventions can be systematically miscalibrated.[^4][^5][^6]
| Item | Detail |
|---|---|
| Concept type | Analytical concept in AI safety and AI policy |
| Domain | ai safety, ai alignment, AI evaluation, AI governance |
| Earliest formal antecedents | Bostrom (2014), Superintelligence — "hardware overhang", "content overhang", "algorithm overhang"[^3] |
| Modern popularisation | Jack Clark, Import AI #321, 21 March 2023[^2] |
| Closely related | Hardware overhang / compute overhang; capability elicitation; sandbagging; emergent abilities |
| Canonical examples | chain of thought prompting (2022); test-time compute scaling in o1/o3; tool use and agentic scaffolding |
| Key elicitation programmes | METR elicitation protocol; uk aisi structured elicitation; Anthropic / OpenAI / DeepSeek pre-deployment evaluations |
There is no universally agreed formal definition, but most usages converge on the following: a capability overhang exists when the latent ability of an AI system to perform a task is meaningfully higher than its elicited ability under the conditions in which it is currently deployed, evaluated, or studied.[^1][^4][^7]
Three components are commonly distinguished:
METR defines the goal of its elicitation work as obtaining "a test-set score that represents the full capabilities of the model that are likely to be accessible with plausible amounts of post-training enhancement", which is functionally a project of measuring and closing the capability-overhang gap.[^4] The UK AI Security Institute similarly characterises capability elicitation as "experiments designed to unlock or enhance a model's latent abilities after it has been trained, in order to best understand its capability profile".[^5]
The Wiktionary entry for capability overhang — citing usages in The Verge (2022), The Guardian (2023), and The Wall Street Journal (2024) — defines the term as "capabilities and potential applications of existing artificial intelligence systems that have not yet been discovered".[^1]
The earliest systematic treatment of "overhang" concepts in the AI safety literature appears in Chapter 4 of Nick Bostrom's Superintelligence: Paths, Dangers, Strategies (Oxford University Press, 2014), titled "The kinetics of an intelligence explosion".[^3] Bostrom uses the term overhang to describe potential resources that could be rapidly converted into capability once a triggering breakthrough occurs:
In all three variants, the structural idea is the same: some resource (compute, data, or algorithmic improvements) accumulates faster than it can be applied, so when a triggering event occurs the system can take a large discontinuous jump in effective capability. This framework underpins Bostrom's argument that an intelligence explosion could be very fast.
The "hardware overhang" idea is older than Bostrom's book. Eliezer Yudkowsky and other writers on LessWrong had used "hardware overhang" since at least the late 2000s, in the context of debates about hard versus soft takeoff and recursive self-improvement at MIRI (then the Singularity Institute). The basic claim — that modern processors run at gigahertz speeds while human neurons spike at roughly 100 Hz, so that any successful imitation of cognition could in principle run vastly faster than a biological brain — featured prominently in posts such as Yudkowsky's "Hard Takeoff" (2008).[^8]
A July 2020 Alignment Forum / LessWrong post by Andy Jones titled "Are we in an AI overhang?" applied the overhang framing specifically to GPT-3-era language models.[^9] Jones argued that the capability to build systems "orders-of-magnitude more powerful" than GPT-3 already existed, that GPT-3's training cost (~$15 million) was a negligible fraction of large technology firms' R&D budgets, and that a 100-fold scaling increase was "entirely plausible right now" at roughly $1 billion. This post is widely cited as the moment when "overhang" reasoning shifted from a Bostromian thought experiment to a near-term empirical claim about the AI industry.[^9]
The current sense of "capability overhang" — referring not primarily to compute or data reserves but to the hidden skill repertoire of an already-trained model — was crystallised by Jack Clark in Import AI #321 on 21 March 2023.[^2] Writing days after GPT-4's release, Clark argued that GPT-4 possessed a "capability overhang" and that "the applications we're seeing of GPT-4 today are the comparatively dumb ones; the really 'smart' capabilities will emerge in coming months and years through a process of collective discovery". This usage was picked up rapidly by journalists, including by The Guardian, The Wall Street Journal, and The Verge, which by 2024 routinely described frontier AI as having a capability overhang.[^1]
A March 2023 LessWrong post by Zach Stein-Perlman, "Taboo 'compute overhang'", documents that the cluster of overhang terms had become semantically unstable, with practitioners using them in incompatible ways.[^10] Stein-Perlman recommends either avoiding the bare term or specifying which sense — pre-AGI compute reserves, post-AGI scale-up potential, or breakthrough-induced re-scaling — is meant.
In contemporary usage, capability overhang is sometimes treated as an umbrella for several distinct (and partially overlapping) phenomena.[^3][^10][^11]
The narrow sense, following Clark, refers to unrealised functional skills of a specific trained model that can be unlocked through better prompting, fine-tuning, scaffolding, tool access, or test-time computation, without retraining the underlying weights.[^2] This is the sense most relevant to AI evaluation and pre-deployment red-teaming.
Compute overhang refers to a mismatch between the computational resources available globally (or to a particular actor) and the computational resources currently being applied to AI training and inference. If a "pause" or other intervention slows training without slowing the underlying hardware progress, the gap grows; if the intervention later lifts, very large training runs become abruptly possible.[^10][^11] The concept is influential in debates about whether voluntary AI moratoria are net-positive for safety: critics argue a pause creates a compute overhang that produces faster, more discontinuous progress later.[^11]
Data overhang refers to the existence of large quantities of training data that have not been fully utilised by current systems, including private corpora behind paywalls, multimodal data such as video and audio, and synthetic data generated by other AI systems. Bostrom's "content overhang" is the conceptual ancestor.[^3] In the early 2020s, Epoch AI researchers raised the related concern that high-quality public text is finite and might be exhausted by language-model training between 2026 and 2032; this implies that historically a data overhang existed, and that for some modalities it still does.[^12]
Algorithmic overhang is the situation in which much better training or inference algorithms exist (or could be developed) but are not yet used. Bostrom's 2014 "algorithm overhang" is the canonical antecedent.[^3] A widely-discussed contemporary example is the shift from pure pre-training scaling to scaling reinforcement learning on verifiable rewards (RLVR) and to extended test-time compute in reasoning models; commentators including Andrej Karpathy have described most of the capability progress of 2025 as labs "chewing through the overhang" of these new post-training stages.[^13]
Each of the following episodes is widely cited as a moment when a previously unrecognised capability overhang was revealed.
Jason Wei and colleagues at Google Brain showed in Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (NeurIPS 2022, arXiv 2201.11903) that prompting a sufficiently large language model with a few exemplars showing step-by-step reasoning produces dramatic improvements on arithmetic, commonsense, and symbolic reasoning benchmarks.[^14] On the GSM8K math-word-problem benchmark, a 540B-parameter model with eight chain of thought exemplars achieved state-of-the-art accuracy, surpassing even fine-tuned GPT-3 with a verifier. Crucially, the gain only appeared at scales of roughly 100B+ parameters: the capability was latent in those large models all along, but a simple prompting change was required to elicit it. This is the canonical empirical demonstration of capability overhang in modern LLMs.[^14]
OpenAI's release of o1 in September 2024, followed by o3 in December 2024, demonstrated that further capability gains could be unlocked by allowing a model to spend orders of magnitude more compute at inference time on extended chains of thought, search over candidate reasoning traces, and self-checking.[^15][^16] OpenAI reported that o1's performance on the AIME mathematics competition rose from 74% accuracy at a single sample per problem to 93% with a learned re-ranker over 1,000 samples.[^15] o3 achieved 75.7% on the ARC-AGI semi-private evaluation under standard compute and 87.5% with a high-compute configuration that used roughly 172× more inference compute per task — versus around 5% for GPT-4o earlier in 2024.[^16] The episode is often described as revealing a large reasoning overhang in the underlying pre-trained models.[^15][^16]
Allowing a model to call external tools — Python interpreters, web browsers, search engines, code-execution sandboxes — and structuring its operation as an autonomous agent within a scaffolded loop frequently unlocks capabilities that the base model cannot demonstrate in a single forward pass.[^4][^5] METR and the UK AISI both treat scaffolded agentic evaluations as essential to measuring the latent capability profile, on the grounds that an evaluator who tests only the bare model is liable to materially under-estimate dangerous capabilities.[^4][^5]
Fine-tuning a model on small targeted datasets, including adversarial datasets, can rapidly reveal capabilities that were suppressed by standard rlhf safety training. The jailbreaking literature, and the academic study of red teaming, routinely demonstrate that capabilities the developer believed were "removed" are in fact merely reweighted and can be re-elicited.[^17] This is a key reason why Anthropic's and OpenAI's responsible scaling policies include fine-tuning as part of their elicitation budget.
Some capabilities only manifest in extended, multi-turn interactions or after access to particular tools. Persuasion, long-horizon agentic task completion, and certain kinds of social manipulation have been highlighted by safety institutes as instances where short-form evaluations systematically understate latent capability.[^5][^6]
The practical significance of capability overhang for ai safety flows from three observations.[^2][^4][^5]
1. Sudden jumps from new elicitation techniques. If meaningful capability is latent in already-deployed models, a single new prompting idea, scaffold, or fine-tuning recipe can produce an effectively discontinuous increase in real-world capability without any new training run. The 2022 chain-of-thought result and the 2024 test-time-compute results are both invoked as examples in which the model existed for months or years before the elicitation technique was found.[^14][^15]
2. Evaluation under-estimation. Models that are evaluated under restricted conditions — without tool access, without scaffolding, without fine-tuning, with limited inference compute, or by evaluators unfamiliar with state-of-the-art prompting — will systematically under-state real risk. METR reports that elicitation effort can produce capability gains "comparable to increasing training compute between five and twenty times".[^4][^5] This motivates the formal elicitation protocols developed by METR and by the UK AISI.
3. Race dynamics and information hazards. Because the elicitation technique itself is often a simple idea that can be reproduced cheaply, capability overhang creates incentives for races: any lab (or external actor) that discovers a powerful elicitation technique may unlock the latent capability of every comparable deployed model. This has been used to argue both for and against open-sourcing model weights, and is one motivation behind the responsible-disclosure norms in Anthropic's, OpenAI's, and other developers' responsible scaling frameworks.
4. Implications for moratoria and pauses. A central policy debate concerns whether voluntary or regulated pauses on frontier training are net-positive for safety. Critics, including some LessWrong commentators, argue that pauses generate compute overhang: hardware continues to improve during the pause, so when the pause ends, training runs jump in size discontinuously, with less time for safety research at intermediate scales.[^10][^11]
Capability elicitation is the technical sub-field that aims to measure the upper bound of what a deployed model can do, and is the methodological response to capability overhang. Key practitioners include METR (with its elicitation protocol and Autonomy Evaluation Resources)[^4], the UK AI Security Institute (with its structured elicitation protocol published in 2025)[^5], and the US AI Safety Institute. Typical techniques include strategic prompting, providing access to external tools, generating and selecting from multiple candidate responses, agent scaffolds for iterative refinement, and multi-agent set-ups in which model instances critique each other.[^4][^5]
Sandbagging refers to the strategic underperformance of a model on evaluations. Van der Weij et al. (2024), in AI Sandbagging: Language Models can Strategically Underperform on Evaluations (arXiv 2406.07358), demonstrate that GPT-4 and Claude 3 Opus can be prompted to selectively underperform on dangerous capability tests while retaining general capability, and that models can be fine-tuned with "password-locking" to hide capabilities until a specific trigger is provided.[^17] Sandbagging is, in effect, an artificial or strategically induced capability overhang: the elicited score under-states latent ability. The phenomenon is closely connected to discussions of alignment faking and AI scheming.[^17]
The 2022 paper Emergent Abilities of Large Language Models (Wei et al.) defined emergent abilities as "abilities that are not present in smaller-scale models but are present in large-scale models", and is sometimes used in tandem with capability-overhang discussions.[^18] Schaeffer et al. (NeurIPS 2023, arXiv 2304.15004) controversially argued that many such emergent abilities are an artefact of discontinuous evaluation metrics rather than genuine phase transitions in the model.[^19] Either way, the empirical fact that capabilities at scale can be hard to predict from smaller models reinforces the worry that overhangs are common.
METR (formerly known as ARC Evals) has published a multi-part Autonomy Evaluation Resources package that includes a task suite for autonomous capabilities, an elicitation protocol, a task standard and agent workbench, and an example evaluation protocol.[^4] METR's Measuring the impact of post-training enhancements report finds, on a set of agent tasks, that the gap between a base model and a post-training-enhanced version can be of similar magnitude to the gap between GPT-3.5 Turbo and GPT-4 — strong evidence for substantial capability overhang in deployed autonomous agents.[^4] METR's protocol explicitly recommends adding a safety margin for further enhancements that may be unlocked by future fine-tuning or scaffolding work.[^4]
The UK AI Security Institute published A structured protocol for elicitation experiments on 16 July 2025, presenting a checklist of best practices for eliciting model capabilities.[^5] The protocol emphasises baseline establishment, distribution-shift management, prompt design, agent-scaffold construction, failure detection, and exhaustion of chain of thought and tool-use techniques. AISI's accompanying methodology covers cyber, autonomous systems, and chemical/biological/criminal misuse capability domains.[^5]
The release of o1 in September 2024 and o3 in December 2024 was widely interpreted as revealing a large reasoning overhang in then-current pre-trained models.[^15][^16] Independent evaluators at the ARC-AGI team reported that o3 produced an unprecedented jump on their benchmark, taking AI reasoning systems from roughly 5% (early 2024) to over 75% in a few months, at the cost of orders-of-magnitude more inference compute per task.[^16] The episode is now a stock example in discussions of capability overhang.
In a December 2025 retrospective, Andrej Karpathy described most of 2025's capability progress as labs working through the overhang of a new training stage built around reinforcement learning with verifiable rewards (RLVR) — a contemporary instance of algorithmic overhang in Bostrom's sense.[^13]
A growing literature on sandbagging,[^17] alignment faking, and evaluation awareness — the phenomenon in which models behave differently when they detect that they are being evaluated — has reframed capability overhang as not solely an evaluator-side problem but potentially also a model-side problem, in which the deployed system may itself contribute to the elicitation gap by strategically modulating its outputs.[^5][^17]