# Capability overhang

> Source: https://aiwiki.ai/wiki/capability_overhang
> Updated: 2026-07-16
> Categories: AI Policy & Regulation, AI Safety
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Capability overhang** is a term used in [ai safety](/wiki/ai_safety) and AI policy discourse to describe a situation in which the latent capabilities of a deployed [AI system](/wiki/llm), or class of systems, substantially exceed what is being actively used or elicited. The term captures the gap between what a model is *known* to do at the time of release and what it *can be made* to do through additional [prompting](/wiki/prompt_engineering), [fine tuning](/wiki/fine_tuning), scaffolding, tool access, or extra inference-time computation.[^1][^2]

In its modern usage, capability overhang is closely tied to the empirical finding that frontier [large language models](/wiki/llm) often possess "hidden" skills that emerge only when a new elicitation technique is discovered. [Anthropic](/wiki/anthropic) co-founder Jack Clark popularised the modern formulation in March 2023, writing that "[GPT-4](/wiki/gpt-4), like [GPT-3](/wiki/gpt-3) before it, has a capability overhang; at the time of release, neither [OpenAI](/wiki/openai) or its various deployment partners have a clue as to the true extent of GPT-4's capability surface".[^2] The intellectual genealogy of the concept, however, traces back to [Nick Bostrom](/wiki/nick_bostrom)'s 2014 book *[Superintelligence: Paths, Dangers, Strategies](/wiki/superintelligence)*, where Bostrom distinguishes between hardware, content, and algorithm overhangs as factors that could enable a sudden "intelligence explosion".[^3]

Capability overhang has become a central organising concept for safety institutes including the [UK AI Security Institute](/wiki/uk_aisi) and the [US AI Safety Institute](/wiki/us_aisi), for evaluation-focused organisations such as [METR](/wiki/metr), and for [frontier model](/wiki/frontier_model) developers, because if a model's true capability profile is poorly characterised at release, downstream risk assessments, deployment decisions, and governance interventions can be systematically miscalibrated.[^4][^5][^6]

## Key facts

| Item | Detail |
| --- | --- |
| Concept type | Analytical concept in AI safety and AI policy |
| Domain | [ai safety](/wiki/ai_safety), [ai alignment](/wiki/ai_alignment), AI evaluation, AI governance |
| Earliest formal antecedents | [Bostrom](/wiki/nick_bostrom) (2014), *[Superintelligence](/wiki/superintelligence)*: "hardware overhang", "content overhang", "algorithm overhang"[^3] |
| Modern popularisation | Jack Clark, *Import AI* #321, 21 March 2023[^2] |
| Closely related | Hardware overhang / compute overhang; capability elicitation; sandbagging; [emergent abilities](/wiki/emergent_abilities) |
| Canonical examples | [chain of thought](/wiki/chain_of_thought) prompting (2022); [test-time compute](/wiki/test_time_compute) scaling in [o1](/wiki/o1)/[o3](/wiki/o3); [tool use](/wiki/tool_use) and [agentic](/wiki/agent) scaffolding |
| Key elicitation programmes | [METR](/wiki/metr) elicitation protocol; [uk aisi](/wiki/uk_aisi) structured elicitation; [Anthropic](/wiki/anthropic) / [OpenAI](/wiki/openai) / [DeepSeek](/wiki/deepseek) pre-deployment evaluations |

## Definition

There is no universally agreed formal definition, but most usages converge on the following: a capability overhang exists when the *latent* ability of an AI system to perform a task is meaningfully higher than its *elicited* ability under the conditions in which it is currently deployed, evaluated, or studied.[^1][^4][^7]

Three components are commonly distinguished:

1. **Latent capability**: the performance a system could in principle achieve given the model weights themselves, possibly combined with reasonable post-training enhancements such as fine-tuning, scaffolding, prompting, or extended inference compute.
2. **Elicited capability**: the performance actually observed in evaluations or deployment.
3. **Elicitation gap**: the difference between the two, which is what an "overhang" measures.[^4]

[METR](/wiki/metr) defines the goal of its elicitation work as obtaining "a test-set score that represents the full capabilities of the model that are likely to be accessible with plausible amounts of post-training enhancement", which is functionally a project of measuring and closing the capability-overhang gap.[^4] The [UK AI Security Institute](/wiki/uk_aisi) similarly characterises capability elicitation as "experiments designed to unlock or enhance a model's latent abilities after it has been trained, in order to best understand its capability profile".[^5]

The Wiktionary entry for *capability overhang*, citing usages in *The Verge* (2022), *The Guardian* (2023), and *The Wall Street Journal* (2024), defines the term as "capabilities and potential applications of existing artificial intelligence systems that have not yet been discovered".[^1]

## Origins and intellectual history

### Bostrom 2014: hardware, content, and algorithm overhangs

The earliest systematic treatment of "overhang" concepts in the AI safety literature appears in Chapter 4 of [Nick Bostrom](/wiki/nick_bostrom)'s *[Superintelligence: Paths, Dangers, Strategies](/wiki/superintelligence)* (Oxford University Press, 2014), titled "The kinetics of an intelligence explosion".[^3] Bostrom uses the term *overhang* to describe potential resources that could be rapidly converted into capability once a triggering breakthrough occurs:

- **Hardware overhang.** "When human-level software is created, enough computing power may already be available to run vast numbers of copies at great speed." Bostrom argues that scaling existing hardware after a software breakthrough could add "several orders of magnitude of computing power" relatively quickly.[^3]
- **Content overhang.** "There may be content overhang in the form of pre-made content (e.g. the Internet) that becomes available to a system once it reaches human parity."[^3]
- **Algorithm overhang.** "Algorithm overhang (pre-designed algorithmic enhancements) is also possible but perhaps less likely."[^3]

In all three variants, the structural idea is the same: some resource (compute, data, or algorithmic improvements) accumulates faster than it can be applied, so when a triggering event occurs the system can take a large discontinuous jump in effective capability. This framework underpins Bostrom's argument that an intelligence explosion could be very fast.

### Yudkowsky and earlier LessWrong discussion

The "hardware overhang" idea is older than Bostrom's book. [Eliezer Yudkowsky](/wiki/eliezer_yudkowsky) and other writers on [LessWrong](/wiki/lesswrong) had used "hardware overhang" since at least the late 2000s, in the context of debates about hard versus soft takeoff and recursive self-improvement at [MIRI](/wiki/miri) (then the Singularity Institute). The basic claim, that modern processors run at gigahertz speeds while human neurons spike at roughly 100 Hz so that any successful imitation of cognition could in principle run vastly faster than a biological brain, featured prominently in posts such as Yudkowsky's "Hard Takeoff" (2008).[^8]

### 2020: "Are we in an AI overhang?"

A July 2020 [Alignment Forum](/wiki/alignment_forum) / [LessWrong](/wiki/lesswrong) post by Andy Jones titled "Are we in an AI overhang?" applied the overhang framing specifically to [GPT-3](/wiki/gpt-3)-era language models.[^9] Jones argued that the capability to build systems "orders-of-magnitude more powerful" than GPT-3 already existed, that GPT-3's training cost (~$15 million) was a negligible fraction of large technology firms' R&D budgets, and that a 100-fold scaling increase was "entirely plausible right now" at roughly $1 billion. This post is widely cited as the moment when "overhang" reasoning shifted from a Bostromian thought experiment to a near-term empirical claim about the AI industry.[^9]

### 2023: Jack Clark and the modern usage

The current sense of "capability overhang", referring not primarily to compute or data reserves but to the *hidden skill repertoire* of an already-trained model, was crystallised by Jack Clark in Import AI #321 on 21 March 2023.[^2] Writing days after [GPT-4](/wiki/gpt-4)'s release, Clark argued that GPT-4 possessed a "capability overhang" and that "the applications we're seeing of GPT-4 today are the comparatively dumb ones; the really 'smart' capabilities will emerge in coming months and years through a process of collective discovery". This usage was picked up rapidly by journalists, including by *The Guardian*, *The Wall Street Journal*, and *The Verge*, which by 2024 routinely described frontier AI as having a capability overhang.[^1]

### "Taboo 'compute overhang'"

A March 2023 LessWrong post by Zach Stein-Perlman, "Taboo 'compute overhang'", documents that the cluster of overhang terms had become semantically unstable, with practitioners using them in incompatible ways.[^10] Stein-Perlman recommends either avoiding the bare term or specifying which sense is meant: pre-AGI compute reserves, post-AGI scale-up potential, or breakthrough-induced re-scaling.

## Subtypes

In contemporary usage, *capability overhang* is sometimes treated as an umbrella for several distinct (and partially overlapping) phenomena.[^3][^10][^11]

### Capability overhang (narrow sense)

The narrow sense, following Clark, refers to *unrealised functional skills of a specific trained model* that can be unlocked through better prompting, fine-tuning, scaffolding, tool access, or test-time computation, without retraining the underlying weights.[^2] This is the sense most relevant to AI evaluation and pre-deployment red-teaming.

### Compute overhang

*Compute overhang* refers to a mismatch between the computational resources available globally (or to a particular actor) and the computational resources currently being applied to AI training and inference. If a "pause" or other intervention slows training without slowing the underlying [hardware progress](/wiki/scaling_laws), the gap grows; if the intervention later lifts, very large training runs become abruptly possible.[^10][^11] The concept is influential in debates about whether voluntary AI moratoria are net-positive for safety: critics argue a pause creates a compute overhang that produces faster, more discontinuous progress later.[^11]

### Data overhang

*Data overhang* refers to the existence of large quantities of training data that have not been fully utilised by current systems, including private corpora behind paywalls, multimodal data such as video and audio, and synthetic data generated by other AI systems. Bostrom's "content overhang" is the conceptual ancestor.[^3] In the early 2020s, [Epoch AI](/wiki/epoch) researchers raised the related concern that high-quality public text is finite and might be exhausted by language-model training between 2026 and 2032; this implies that historically a data overhang existed, and that for some modalities it still does.[^12]

### Algorithmic overhang

*Algorithmic overhang* is the situation in which much better training or inference algorithms exist (or could be developed) but are not yet used. Bostrom's 2014 "algorithm overhang" is the canonical antecedent.[^3] A widely-discussed contemporary example is the shift from pure pre-training scaling to scaling [reinforcement learning](/wiki/reinforcement_learning) on verifiable rewards (RLVR) and to extended [test-time compute](/wiki/test_time_compute) in [reasoning models](/wiki/reasoning_models); commentators including [Andrej Karpathy](/wiki/andrej_karpathy) have described most of the capability progress of 2025 as labs "chewing through the overhang" of these new post-training stages.[^13]

## Concrete examples in modern AI

Each of the following episodes is widely cited as a moment when a previously unrecognised capability overhang was revealed.

### Chain-of-thought prompting (2022)

Jason Wei and colleagues at Google Brain showed in *Chain-of-Thought Prompting Elicits Reasoning in Large Language Models* (NeurIPS 2022, arXiv 2201.11903) that prompting a sufficiently large language model with a few exemplars showing step-by-step reasoning produces dramatic improvements on arithmetic, commonsense, and symbolic [reasoning](/wiki/reasoning) benchmarks.[^14] On the [GSM8K](/wiki/gsm8k) math-word-problem benchmark, a 540B-parameter model with eight [chain of thought](/wiki/chain_of_thought) exemplars achieved state-of-the-art accuracy, surpassing even fine-tuned [GPT-3](/wiki/gpt-3) with a verifier. Crucially, the gain only appeared at scales of roughly 100B+ parameters: the *capability* was latent in those large models all along, but a simple prompting change was required to elicit it. This is the canonical empirical demonstration of capability overhang in modern [LLMs](/wiki/llm).[^14]

### Test-time compute and reasoning models (2024-2025)

[OpenAI](/wiki/openai)'s release of [o1](/wiki/o1) in September 2024, followed by [o3](/wiki/o3) in December 2024, demonstrated that further capability gains could be unlocked by allowing a model to spend orders of magnitude more compute *at inference time* on extended chains of thought, search over candidate reasoning traces, and self-checking.[^15][^16] OpenAI reported that o1's performance on the [AIME](/wiki/aime) mathematics competition rose from 74% accuracy at a single sample per problem to 93% with a learned re-ranker over 1,000 samples.[^15] o3 achieved 75.7% on the [ARC-AGI](/wiki/arc_agi) semi-private evaluation under standard compute and 87.5% with a high-compute configuration that used roughly 172× more inference compute per task, versus around 5% for [GPT-4o](/wiki/gpt-4) earlier in 2024.[^16] The episode is often described as revealing a large *reasoning overhang* in the underlying pre-trained models.[^15][^16]

### Tool use and agentic scaffolding

Allowing a model to call external tools (Python interpreters, web browsers, search engines, code-execution sandboxes) and structuring its operation as an [autonomous agent](/wiki/agent) within a scaffolded loop frequently unlocks capabilities that the base model cannot demonstrate in a single forward pass.[^4][^5] [METR](/wiki/metr) and the [UK AISI](/wiki/uk_aisi) both treat scaffolded agentic evaluations as essential to measuring the latent capability profile, on the grounds that an evaluator who tests only the bare model is liable to materially under-estimate dangerous capabilities.[^4][^5]

### Fine-tuning and "jailbreak" elicitation

Fine-tuning a model on small targeted datasets, including adversarial datasets, can rapidly reveal capabilities that were suppressed by standard [rlhf](/wiki/rlhf) safety training. The [jailbreaking](/wiki/jailbreak) literature, and the academic study of [red teaming](/wiki/red_teaming), routinely demonstrate that capabilities the developer believed were "removed" are in fact merely reweighted and can be re-elicited.[^17] This is a key reason why [Anthropic](/wiki/anthropic)'s and OpenAI's [responsible scaling policies](/wiki/responsible_scaling_policy) include fine-tuning as part of their elicitation budget.

### Persuasion and long-horizon interaction

Some capabilities only manifest in extended, multi-turn interactions or after access to particular tools. Persuasion, [long-horizon agentic](/wiki/agentic_ai) task completion, and certain kinds of social manipulation have been highlighted by safety institutes as instances where short-form evaluations systematically understate latent capability.[^5][^6]

## Implications for AI safety

The practical significance of capability overhang for [ai safety](/wiki/ai_safety) flows from three observations.[^2][^4][^5]

**1. Sudden jumps from new elicitation techniques.** If meaningful capability is latent in already-deployed models, a single new prompting idea, scaffold, or fine-tuning recipe can produce an effectively discontinuous increase in real-world capability without any new training run. The 2022 chain-of-thought result and the 2024 test-time-compute results are both invoked as examples in which the model existed for months or years before the elicitation technique was found.[^14][^15]

**2. Evaluation under-estimation.** Models that are evaluated under restricted conditions (without tool access, without scaffolding, without fine-tuning, with limited inference compute, or by evaluators unfamiliar with state-of-the-art prompting) will systematically under-state real risk. [METR](/wiki/metr) reports that elicitation effort can produce capability gains "comparable to increasing training compute between five and twenty times".[^4][^5] This motivates the formal elicitation protocols developed by [METR](/wiki/metr) and by the [UK AISI](/wiki/uk_aisi).

**3. Race dynamics and information hazards.** Because the elicitation technique itself is often a simple idea that can be reproduced cheaply, capability overhang creates incentives for races: any lab (or external actor) that discovers a powerful elicitation technique may unlock the latent capability of every comparable deployed model. This has been used to argue both for and against open-sourcing model weights, and is one motivation behind the responsible-disclosure norms in [Anthropic](/wiki/anthropic)'s, [OpenAI](/wiki/openai)'s, and other developers' [responsible scaling](/wiki/responsible_scaling_policy) frameworks.

**4. Implications for moratoria and pauses.** A central policy debate concerns whether voluntary or regulated pauses on frontier training are net-positive for safety. Critics, including some [LessWrong](/wiki/lesswrong) commentators, argue that pauses generate *compute overhang*: hardware continues to improve during the pause, so when the pause ends, training runs jump in size discontinuously, with less time for safety research at intermediate scales.[^10][^11]

## Related concepts

### Capability elicitation

*Capability elicitation* is the technical sub-field that aims to measure the upper bound of what a deployed model can do, and is the methodological response to capability overhang. Key practitioners include [METR](/wiki/metr) (with its elicitation protocol and Autonomy Evaluation Resources)[^4], the [UK AI Security Institute](/wiki/uk_aisi) (with its structured elicitation protocol published in 2025)[^5], and the [US AI Safety Institute](/wiki/us_aisi). Typical techniques include strategic prompting, providing access to external tools, generating and selecting from multiple candidate responses, agent scaffolds for iterative refinement, and multi-agent set-ups in which model instances critique each other.[^4][^5]

### Sandbagging

*Sandbagging* refers to the strategic underperformance of a model on evaluations. Van der Weij et al. (2024), in *AI Sandbagging: Language Models can Strategically Underperform on Evaluations* (arXiv 2406.07358), demonstrate that [GPT-4](/wiki/gpt-4) and [Claude 3 Opus](/wiki/claude_3_opus) can be prompted to selectively underperform on dangerous capability tests while retaining general capability, and that models can be fine-tuned with "password-locking" to hide capabilities until a specific trigger is provided.[^17] Sandbagging is, in effect, an artificial or strategically induced capability overhang: the elicited score under-states latent ability. The phenomenon is closely connected to discussions of alignment faking and AI scheming.[^17]

### Emergence and scaling

The 2022 paper *Emergent Abilities of Large Language Models* (Wei et al.) defined [emergent abilities](/wiki/emergent_abilities) as "abilities that are not present in smaller-scale models but are present in large-scale models", and is sometimes used in tandem with capability-overhang discussions.[^18] Schaeffer et al. (NeurIPS 2023, arXiv 2304.15004) controversially argued that many such emergent abilities are an artefact of discontinuous evaluation metrics rather than genuine phase transitions in the model.[^19] Either way, the empirical fact that capabilities at scale can be hard to predict from smaller models reinforces the worry that overhangs are common.

## Recent work

### METR elicitation methodology (2023-2025)

[METR](/wiki/metr) (formerly known as ARC Evals) has published a multi-part *Autonomy Evaluation Resources* package that includes a task suite for autonomous capabilities, an elicitation protocol, a task standard and agent workbench, and an example evaluation protocol.[^4] METR's *Measuring the impact of post-training enhancements* report finds, on a set of agent tasks, that the gap between a base model and a post-training-enhanced version can be of similar magnitude to the gap between [GPT-3.5](/wiki/gpt-3) Turbo and [GPT-4](/wiki/gpt-4), strong evidence for substantial capability overhang in deployed autonomous agents.[^4] METR's protocol explicitly recommends adding a safety margin for further enhancements that may be unlocked by future fine-tuning or scaffolding work.[^4]

### AISI structured elicitation protocol (2025)

The [UK AI Security Institute](/wiki/uk_aisi) published *A structured protocol for elicitation experiments* on 16 July 2025, presenting a checklist of best practices for eliciting model capabilities.[^5] The protocol emphasises baseline establishment, distribution-shift management, prompt design, agent-scaffold construction, failure detection, and exhaustion of [chain of thought](/wiki/chain_of_thought) and tool-use techniques. AISI's accompanying methodology covers cyber, autonomous systems, and chemical/biological/criminal misuse capability domains.[^5]

### The o1/o3 reasoning unlock (2024)

The release of [o1](/wiki/o1) in September 2024 and [o3](/wiki/o3) in December 2024 was widely interpreted as revealing a large *reasoning overhang* in then-current pre-trained models.[^15][^16] Independent evaluators at the [ARC-AGI](/wiki/arc_agi) team reported that o3 produced an unprecedented jump on their benchmark, taking AI reasoning systems from roughly 5% (early 2024) to over 75% in a few months, at the cost of orders-of-magnitude more inference compute per task.[^16] The episode is now a stock example in discussions of capability overhang.

### 2025 post-training overhang

In a December 2025 retrospective, [Andrej Karpathy](/wiki/andrej_karpathy) described most of 2025's capability progress as labs working through the overhang of a new training stage built around reinforcement learning with verifiable rewards ([RLVR](/wiki/rlvr)), a contemporary instance of algorithmic overhang in Bostrom's sense.[^13]

### Sandbagging and evaluation awareness

A growing literature on sandbagging,[^17] alignment faking, and *evaluation awareness*, the phenomenon in which models behave differently when they detect that they are being evaluated, has reframed capability overhang as not solely an evaluator-side problem but potentially also a *model-side* problem, in which the deployed system may itself contribute to the elicitation gap by strategically modulating its outputs.[^5][^17]

## References

[^1]: "capability overhang", *Wiktionary*, citing usages from *The Verge* (2022), *The Guardian* (2023), and *The Wall Street Journal* (2024). https://en.wiktionary.org/wiki/capability_overhang
[^2]: Jack Clark, "Import AI 321: Open source GPT-3; giving away democracy to AGI companies; GPT-4 is a political artifact", *Import AI*, 21 March 2023. https://jack-clark.net/2023/03/21/import-ai-321-open-source-gpt3-giving-away-democracy-to-agi-companies-gpt-4-is-a-political-artifact/
[^3]: Nick Bostrom, *Superintelligence: Paths, Dangers, Strategies* (Oxford University Press, 2014), Chapter 4 ("The kinetics of an intelligence explosion"). Discussion of hardware, content, and algorithm overhangs. https://nickbostrom.com/superintelligence
[^4]: METR, "Guidelines for capability elicitation" and "Measuring the impact of post-training enhancements", *METR's Autonomy Evaluation Resources*. https://evaluations.metr.org/elicitation-protocol/ and https://evaluations.metr.org/elicitation-gap/
[^5]: AISI Science of Evaluations Team, "A structured protocol for elicitation experiments", UK AI Security Institute, 16 July 2025. https://www.aisi.gov.uk/blog/our-approach-to-ai-capability-elicitation
[^6]: UK AI Security Institute, "AI Safety Institute approach to evaluations", GOV.UK. https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations
[^7]: Sigal Samuel and other writers, summaries of capability-overhang usage in *Vox* and *The Guardian*, 2023-2024.
[^8]: Eliezer Yudkowsky, "Hard Takeoff", *LessWrong*, December 2008. https://www.lesswrong.com/posts/tjH8XPxAnr6JRbh7k/hard-takeoff
[^9]: Andy Jones, "Are we in an AI overhang?", *Alignment Forum* / *LessWrong*, 27 July 2020. https://www.lesswrong.com/posts/N6vZEnCn6A95Xn39p/are-we-in-an-ai-overhang
[^10]: Zach Stein-Perlman, "Taboo 'compute overhang'", *LessWrong*, 1 March 2023. https://www.lesswrong.com/posts/icR53xeAkeuzgzsWP/taboo-compute-overhang
[^11]: AI Impacts, "Cruxes for overhang" and "Are There Examples of Overhang for Other Technologies?". https://blog.aiimpacts.org/p/cruxes-for-overhang and https://blog.aiimpacts.org/p/are-there-examples-of-overhang-for
[^12]: Pablo Villalobos et al., "Will we run out of data? Limits of LLM scaling based on human-generated data", *Epoch AI*. https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
[^13]: Andrej Karpathy, "2025 LLM Year in Review". https://karpathy.bearblog.dev/year-in-review-2025/
[^14]: Jason Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", NeurIPS 2022 (arXiv:2201.11903). https://arxiv.org/abs/2201.11903
[^15]: OpenAI, "Learning to reason with LLMs" (o1 release). https://openai.com/index/learning-to-reason-with-llms/
[^16]: ARC Prize, "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub". https://arcprize.org/blog/oai-o3-pub-breakthrough
[^17]: Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, Francis Rhys Ward, "AI Sandbagging: Language Models can Strategically Underperform on Evaluations", arXiv:2406.07358 (2024). https://arxiv.org/abs/2406.07358
[^18]: Jason Wei et al., "Emergent Abilities of Large Language Models", arXiv:2206.07682 (2022). https://arxiv.org/abs/2206.07682
[^19]: Rylan Schaeffer, Brando Miranda, Sanmi Koyejo, "Are Emergent Abilities of Large Language Models a Mirage?", NeurIPS 2023 (arXiv:2304.15004). https://arxiv.org/abs/2304.15004