Emergent abilities refer to capabilities in large language models (LLMs) that are absent in smaller models but appear once a model reaches a sufficient scale. The term was popularized by a 2022 paper from Google Research, and it has since become one of the most debated concepts in modern artificial intelligence. Whether these abilities represent genuine qualitative shifts in computation or statistical artifacts of evaluation methodology remains an open question, with significant implications for AI safety and the future of machine learning research.
The concept of emergence has deep roots in philosophy and the natural sciences. In physics, phase transitions (such as water freezing into ice) produce qualitative changes in macroscopic properties that cannot be predicted from a single molecule's behavior. In biology, flocking behavior emerges from simple rules followed by individual birds. The application of this concept to LLMs follows a similar logic: certain capabilities appear to arise not from any single component of the model but from the collective interactions of billions of parameters.
Jason Wei, Yi Tay, and colleagues at Google Research formalized this idea in their June 2022 paper "Emergent Abilities of Large Language Models," published in Transactions on Machine Learning Research [1]. They defined an emergent ability as "an ability that is not present in smaller models but is present in larger models." Crucially, the paper emphasized that such abilities "cannot be predicted simply by extrapolating the performance of smaller models." In other words, performance on certain tasks stays near random for models across many orders of magnitude of scale, then jumps sharply once a critical threshold is crossed.
This definition draws on the broader scientific notion of emergence articulated by Nobel laureate Philip Anderson in his 1972 essay "More Is Different," which argued that quantitative increases in complexity can produce qualitatively new phenomena [2]. Wei et al. applied this principle to neural networks, arguing that LLMs exhibit a version of this phenomenon.
The foundational paper surveyed a wide range of tasks across multiple model families, including GPT-3, LaMDA, PaLM, and Chinchilla. The authors examined performance on benchmarks from BIG-Bench and the Massive Multitask Language Understanding (MMLU) benchmark, among others. They identified over 130 tasks where performance appeared to follow an emergent pattern.
The paper documented two primary settings in which emergence was observed:
Few-shot prompting: The model is given a small number of input-output examples and must generalize to new inputs. For many tasks, models below a certain size performed at or near chance, while models above that threshold achieved significantly above-chance accuracy.
Augmented prompting strategies: Techniques like chain-of-thought prompting, where the model is asked to show its reasoning step by step, unlocked abilities that were not present even in large models under standard prompting. These augmented strategies themselves appeared to be emergent, since they harmed performance in smaller models.
The paper's central finding was that the relationship between model scale and task performance was not smooth and predictable. Instead, it exhibited what looked like sharp, sudden transitions from near-random to above-random performance.
Several specific tasks have become canonical examples of emergence in the literature. The table below summarizes some of the most frequently cited cases.
| Task | Description | Model family | Approximate emergence threshold | Source |
|---|---|---|---|---|
| Multi-digit arithmetic (3-digit addition) | Adding three-digit numbers correctly | GPT-3 / LaMDA | ~13B parameters (GPT-3); ~68B parameters (LaMDA) | Wei et al., 2022 [1] |
| Word unscrambling | Rearranging shuffled letters to form a valid word | GPT-3 | ~13B parameters | Wei et al., 2022 [1] |
| Chain-of-thought reasoning (GSM8K) | Solving grade-school math word problems using step-by-step reasoning | PaLM | ~62B parameters (~10^23 FLOPs) | Wei et al., 2022 [1] |
| Multi-task language understanding (MMLU) | Answering college-level exam questions across 57 subjects | GPT-3 / Chinchilla | ~70B parameters | Hendrycks et al., 2021 [3] |
| International phonetic alphabet transliteration | Converting text to IPA notation | BIG-Bench models | ~10^23 training FLOPs | BIG-Bench, 2022 [4] |
| Persian QA | Answering questions in Persian | GPT-3 / PaLM | ~62B parameters | Wei et al., 2022 [1] |
Arithmetic provides one of the clearest illustrations of the phenomenon. When evaluated on two-digit addition, GPT-3 at 350 million parameters and 1.3 billion parameters both performed at roughly chance levels. At 6.7 billion parameters, performance improved only marginally. But at 175 billion parameters, the model could solve two-digit addition reliably. Three-digit addition required even larger scales. This pattern, where performance is flat for several orders of magnitude and then jumps, is the signature of what Wei et al. called emergence [1].
Chain-of-thought (CoT) prompting is a technique in which the model is prompted to produce intermediate reasoning steps before arriving at a final answer. Wei et al. (2022) showed that on GSM8K, a benchmark of grade-school math word problems, CoT prompting actually performed worse than standard prompting for models below approximately 10^22 FLOPs of training compute. Above that threshold, CoT dramatically outperformed direct answer prompting, eventually reaching a 57% solve rate on GSM8K for PaLM 540B [5]. This dual pattern, where the technique hurts small models and helps large ones, is itself considered emergent.
In word unscrambling tasks, models must rearrange shuffled letters to reconstruct a valid English word. Small and medium-sized models perform at near-zero accuracy, as the task requires a form of combinatorial search that simple pattern matching cannot solve. Performance jumps sharply in the 13B to 175B parameter range for GPT-3 family models [1].
Beyond arithmetic, multi-step reasoning tasks (such as logical deduction, tracking state changes in narratives, or navigating causal chains) show similar patterns. The common thread is that these tasks require more complex internal processing than straightforward pattern retrieval or next-token prediction. Models appear to develop the capacity for this deeper processing only at sufficient scale.
In April 2023, Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo of Stanford University published "Are Emergent Abilities of Large Language Models a Mirage?" [6]. The paper won an Outstanding Paper Award at NeurIPS 2023, signaling the research community's recognition of its importance.
Schaeffer et al. advanced a provocative thesis: the appearance of emergent abilities is primarily an artifact of the metrics researchers use to evaluate model performance, not a fundamental property of the models themselves. Their argument rested on several key observations.
Many of the benchmarks used to demonstrate emergence rely on nonlinear or discontinuous metrics, particularly exact-match accuracy. In exact-match scoring, a response is either perfectly correct (score of 1) or completely wrong (score of 0). There is no partial credit. The authors argued that this creates a misleading picture. A model might be gradually improving its internal representation of a task, producing outputs that are increasingly close to correct, but exact-match accuracy will register zero until the model crosses the threshold of producing a perfectly correct answer.
To test this, Schaeffer et al. re-evaluated tasks previously claimed to be emergent using continuous or linear metrics such as Token Edit Distance (which measures how many edits are needed to transform the model's output into the correct answer) and Brier Score (a probabilistic scoring measure). Under these alternative metrics, the apparent sharp transitions disappeared, replaced by smooth, continuous improvements across model scales [6].
The paper made and tested three specific predictions:
All three predictions were confirmed empirically using the InstructGPT/GPT-3 model family and a meta-analysis of BIG-Bench tasks [6].
If Schaeffer et al. are correct, the practical consequences are significant. It would mean that model capabilities are in principle more predictable than the emergence narrative suggests. It would also mean that the AI safety concerns arising from unpredictable capability jumps are less pressing, since improvements would follow smooth, forecastable curves when measured appropriately.
The mirage hypothesis did not go unchallenged. Several counter-arguments have been raised, and the debate continues to shape research in the field.
Jason Wei, the lead author of the original 2022 paper, responded publicly in a blog post titled "Common Arguments Regarding Emergent Abilities" [7]. He acknowledged that some tasks showing emergence under exact match do exhibit smooth improvement under alternative metrics. However, he argued that this observation misses the point in practical terms. For many real-world applications, exact-match accuracy is the metric that matters. If you ask a model "What is 15 + 23?" the answer is either 38 or it is wrong; partial credit for producing "37" is not useful for evaluating whether the model can actually do arithmetic.
Wei also noted that finding a surrogate metric that improves smoothly is significant only if it enables prediction of when the target metric (e.g., exact match) will cross a useful threshold. As of his writing, he had not seen substantial evidence that smooth surrogate metrics could reliably predict the onset of exact-match competence.
Additionally, Wei acknowledged a legitimate limitation of the original emergence claims: the models studied were available in only a few discrete sizes (e.g., 350M, 1.3B, 6.7B, 175B for GPT-3). With more intermediate sizes, the performance curve might appear smoother than the published data suggested. But this is a limitation of available data, not necessarily evidence against the underlying phenomenon.
A separate line of research has attempted to move the debate forward by shifting the independent variable. Rather than measuring emergence as a function of model size (parameter count) or training compute (FLOPs), researchers have examined performance as a function of pre-training loss.
Du et al. (2024) trained three LLMs of different sizes (1.5B, 6B, and 32B parameters) and evaluated their performance on twelve diverse downstream tasks at multiple checkpoints throughout training [8]. They found that certain tasks, including MMLU, C-Eval, and GSM8K, exhibited a distinct threshold in pre-training loss. Once the loss dropped below a critical value, performance improved abruptly. This pattern held across different model sizes, suggesting that the threshold is tied to the training process itself rather than to parameter count alone.
This finding partially reconciles the two perspectives. It suggests that emergence may be real in the sense that there are genuine thresholds in learning, but these thresholds may be more predictable when indexed to pre-training loss rather than to model size. Pre-training loss acts as a more reliable predictor of downstream task performance, potentially independent of model architecture or size.
Another counter-argument focuses on statistical power. When smaller models are evaluated on difficult tasks, they may perform slightly above chance, but the signal is drowned out by noise because evaluations typically use limited numbers of test examples. Increasing the number of evaluation samples can reveal above-chance performance in smaller models that was previously invisible. This suggests that some apparent emergence is a consequence of inadequate evaluation sample sizes rather than a true discontinuity in capability.
The debate over emergence connects to a deeper question in deep learning theory: do neural networks undergo phase transitions during training and scaling, or do all improvements happen smoothly?
Proponents of genuine emergence argue that LLMs undergo something analogous to phase transitions in physical systems. In this view, certain internal representations or computational circuits form only when the network is large enough to support them. Below a critical scale, the model lacks the capacity to represent the solution to a particular task, and performance is random. Above that scale, the necessary representations "crystallize," and performance jumps.
This view is supported by mechanistic interpretability research, which has found that specific circuits within transformer networks appear to form at particular scales. For example, induction heads (attention patterns that implement simple copying and in-context learning) appear to form during a relatively narrow phase of training [9]. These findings suggest that at least some capability gains are genuinely discontinuous at the level of internal mechanism, even if they appear smooth under certain external metrics.
Skeptics counter that all neural network improvements are fundamentally smooth at a fine-grained level. What appears to be a phase transition is an artifact of coarse measurement. If you measured the relevant internal computations with sufficient granularity, you would see gradual improvement. The "jump" in external metrics like exact-match accuracy is simply the point at which a gradually improving internal capability crosses the threshold needed to produce correct outputs consistently.
This view is not necessarily in conflict with the practical significance of emergence. Even if the underlying improvement is smooth, the fact that useful capabilities appear to switch on suddenly, as measured by practically relevant metrics, matters for applications and for safety.
The concept of emergent abilities has a complex relationship with scaling laws, the empirical regularities governing how model performance improves with increased compute, data, and parameters.
The scaling laws discovered by Kaplan et al. (2020) at OpenAI and later refined by Hoffmann et al. (2022) in the Chinchilla paper describe smooth, predictable relationships between model scale and aggregate performance metrics like cross-entropy loss [10][11]. These laws accurately predict that a model trained with twice the compute will achieve a specific, calculable reduction in loss.
Emergent abilities appear to violate this smooth picture. While aggregate loss decreases smoothly with scale, performance on specific downstream tasks can jump sharply. This creates an apparent paradox: how can the overall loss curve be smooth if individual task performance curves are discontinuous?
One resolution is that aggregate loss averages over thousands of sub-tasks and capabilities. Each individual sub-task might have its own threshold, but because these thresholds are distributed across different scales, their average appears smooth. The loss curve is like a smoothed staircase: each step is an individual emergence event, but when averaged together, the steps blend into a ramp.
Another resolution, consistent with the mirage hypothesis, is that there is no paradox because emergence itself is an artifact. Under this view, both loss and task performance improve smoothly. The apparent disconnect arises only because researchers use discontinuous metrics for task evaluation while using continuous metrics (loss) for aggregate performance.
Recent work in 2025 has proposed a middle ground. A comprehensive survey by Berti et al. (2025) suggests that emergence is better understood as a property of the interaction between model capability, task difficulty, and evaluation methodology [12]. When all three factors are properly accounted for, the behavior of LLMs at scale becomes more predictable, though not entirely so.
The following table summarizes the approximate scale at which various model families have been observed to exhibit particular abilities. These figures are drawn from the original Wei et al. paper and subsequent replications.
| Model family | Parameters | Training compute (FLOPs) | Notable emergent abilities observed |
|---|---|---|---|
| GPT-3 (350M) | 350M | ~10^20 | None observed |
| GPT-3 (1.3B) | 1.3B | ~10^21 | None observed |
| GPT-3 (6.7B) | 6.7B | ~3x10^21 | Marginal improvements on some BIG-Bench tasks |
| GPT-3 (175B) | 175B | ~3.6x10^23 | Multi-digit arithmetic, word unscrambling, analogical reasoning |
| LaMDA (68B) | 68B | ~10^23 | Three-digit arithmetic, word unscrambling |
| PaLM (62B) | 62B | ~10^23 | Chain-of-thought reasoning on GSM8K |
| PaLM (540B) | 540B | ~2.5x10^24 | Strong CoT performance (57% on GSM8K), multi-step logical reasoning |
| Chinchilla (70B) | 70B | ~5x10^23 | MMLU above chance, competitive with much larger models |
Note that Chinchilla, despite being smaller than GPT-3 175B, demonstrates strong performance on MMLU because it was trained on substantially more data relative to its size, in accordance with the Chinchilla scaling laws [11]. This illustrates that parameter count alone is an imperfect proxy for the relevant scale variable; training compute and data quantity also play critical roles.
The debate over emergent abilities has profound implications for AI alignment and safety research.
If emergent abilities are genuine, they pose a fundamental challenge to AI safety. A model that appears harmless at one scale might develop dangerous capabilities at a slightly larger scale, with no warning from the scaling curve. This unpredictability makes it difficult to conduct safety evaluations, since testing a smaller model may not reveal what a larger version will be capable of.
Concrete concerns include:
If Schaeffer et al. are correct that emergence is primarily a measurement artifact, the safety picture is somewhat more reassuring. Smooth, predictable capability improvements are easier to monitor and control. Safety researchers could, in principle, track continuous metrics to forecast when a model will cross practically important thresholds.
However, even under the mirage view, safety concerns do not disappear. Even if the underlying improvement is smooth, the practical onset of dangerous capabilities (as measured by real-world benchmarks) could still be abrupt. A model that gets 0% on a deception benchmark and then 50% at slightly larger scale presents the same real-world risk regardless of whether the improvement was "truly" discontinuous or merely appears so under the chosen metric.
The emergence debate has directly influenced AI governance discussions. Several proposed regulatory frameworks, including those discussed by the European Union AI Act drafters and U.S. executive orders on AI, reference the concept of unpredictable capabilities as a justification for mandatory pre-deployment testing of large models. If capabilities are unpredictable, regulators argue, developers cannot simply assert that their models are safe based on smaller-scale testing [13].
As of early 2026, the research community has not reached a definitive consensus, but the debate has matured considerably.
Most researchers now agree on several points:
Metric choice matters significantly. The appearance of sharp emergence is heavily influenced by whether discrete or continuous evaluation metrics are used. This is no longer controversial.
Pre-training loss is a better predictor than parameter count. The scale at which abilities appear is more reliably predicted by the model's pre-training loss than by its raw parameter count or training FLOPs [8][12].
Some form of nonlinear improvement is real. Even skeptics of the strongest emergence claims acknowledge that the rate of improvement on certain tasks accelerates in a way that is not easily explained by purely linear scaling.
The phenomenon depends on task complexity. Simple tasks (like sentiment classification) do not exhibit emergence. The effect is concentrated in tasks that require multi-step reasoning, compositional generalization, or other forms of complex computation.
Several important questions remain unresolved:
Mechanistic understanding: What happens inside the model at the point of apparent emergence? Mechanistic interpretability research is making progress, but a complete understanding of why specific capabilities appear at specific scales remains elusive.
Prediction: Can we reliably predict, before training a large model, which new capabilities it will exhibit? Current methods offer only rough estimates.
Large reasoning models: The emergence of large reasoning models (LRMs) that use reinforcement learning and inference-time search to enhance reasoning capabilities introduces new dimensions to the debate. These models amplify reasoning and self-reflection through compute applied at inference time rather than at training time, raising questions about whether emergence can occur along the inference-compute axis as well [12].
Emergent risks: As models grow more capable, they may develop harmful behaviors, including deception, manipulation, and reward hacking. The ability to forecast and mitigate these emergent risks is one of the most pressing challenges in AI safety research.
A comprehensive survey published in March 2025 by Berti et al. examined the phenomenon across multiple dimensions [12]. The survey concluded that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. It proposed that the "sharpness" of performance improvement is linked to the model crossing a critical learning threshold, as indicated by its pre-training loss, rather than being solely an artifact of the metric applied. The survey also extended the analysis beyond traditional LLMs to include large reasoning models, noting that reinforcement learning and inference-time computation introduce additional axes along which emergence might occur.
A separate 2025 statistical analysis published in Neural Networks challenged several common assumptions about scaling and emergence, arguing that careful decomposition of task difficulty and model capability reveals more structure in the scaling curves than previously appreciated [14].
Emergent abilities in large language models remain one of the most consequential and contested topics in AI research. The original observation by Wei et al. (2022), that models exhibit sudden jumps in capability at critical scales, captured something real and important about how LLMs behave. The challenge from Schaeffer et al. (2023) sharpened the field's understanding by demonstrating that metric choice can create misleading impressions of discontinuity. Subsequent work on pre-training loss thresholds, mechanistic interpretability, and statistical methodology has moved the conversation toward a more nuanced view.
The practical stakes are high. Whether capabilities emerge suddenly or improve smoothly determines how we should approach safety testing, governance, and the development of increasingly powerful AI systems. The research community continues to investigate these questions with urgency, knowing that the answers will shape the trajectory of artificial intelligence for years to come.