Emergent abilities

AI Safety Large Language Models Machine Learning

22 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v6 · 4,424 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Emergent abilities are capabilities of large language models (LLMs) that are absent in smaller models but appear once a model reaches sufficient scale. Jason Wei and colleagues at Google Research, who popularized the term in a June 2022 paper in Transactions on Machine Learning Research, defined an emergent ability as "an ability that is not present in smaller models but is present in larger models" and noted that such abilities "cannot be predicted simply by extrapolating the performance of smaller models" ^[1]. A 2023 rebuttal by Stanford researchers, "Are Emergent Abilities of Large Language Models a Mirage?", argued the opposite: that the sharp jumps are largely an artifact of nonlinear or discontinuous evaluation metrics rather than a real property of the models, and that performance improves smoothly when measured with continuous metrics ^[6]. Whether these abilities represent genuine qualitative shifts in computation or statistical artifacts of evaluation methodology remains an open question, with significant implications for AI safety and the future of machine learning research.

What is an emergent ability?

The concept of emergence has deep roots in philosophy and the natural sciences. In physics, phase transitions (such as water freezing into ice) produce qualitative changes in macroscopic properties that cannot be predicted from a single molecule's behavior. In biology, flocking behavior emerges from simple rules followed by individual birds. The application of this concept to LLMs follows a similar logic: certain capabilities appear to arise not from any single component of the model but from the collective interactions of billions of parameters.

Jason Wei, Yi Tay, and colleagues at Google Research formalized this idea in their June 2022 paper "Emergent Abilities of Large Language Models," published in Transactions on Machine Learning Research ^[1]. They defined an emergent ability as "an ability that is not present in smaller models but is present in larger models." Crucially, the paper emphasized that such abilities "cannot be predicted simply by extrapolating the performance of smaller models." In other words, performance on certain tasks stays near random for models across many orders of magnitude of scale, then jumps sharply once a critical threshold is crossed. The authors specifically tied emergence to in-context (few-shot) learning, instruction following, and augmented prompting strategies such as chain-of-thought prompting.

This definition draws on the broader scientific notion of emergence articulated by Nobel laureate Philip Anderson in his 1972 essay "More Is Different," which argued that quantitative increases in complexity can produce qualitatively new phenomena ^[2]. Wei et al. applied this principle to neural networks, arguing that LLMs exhibit a version of this phenomenon.

The Wei et al. 2022 paper

The foundational paper surveyed a wide range of tasks across multiple model families, including GPT-3, LaMDA, PaLM, and Chinchilla. The authors examined performance on benchmarks from BIG-Bench and the Massive Multitask Language Understanding (MMLU) benchmark, among others. They identified over 130 tasks where performance appeared to follow an emergent pattern.

The paper documented two primary settings in which emergence was observed:

Few-shot prompting: The model is given a small number of input-output examples and must generalize to new inputs. For many tasks, models below a certain size performed at or near chance, while models above that threshold achieved significantly above-chance accuracy. Reported few-shot emergent tasks included multi-digit arithmetic, transliteration from the International Phonetic Alphabet, word unscrambling, Persian question answering, and multi-task understanding on MMLU ^[1].
Augmented prompting strategies: Techniques like chain-of-thought prompting, where the model is asked to show its reasoning step by step, unlocked abilities that were not present even in large models under standard prompting. These augmented strategies themselves appeared to be emergent, since they harmed performance in smaller models. Other examples included instruction following (instruction tuning), self-consistency decoding, and least-to-most prompting ^[1].

The paper's central finding was that the relationship between model scale and task performance was not smooth and predictable. Instead, it exhibited what looked like sharp, sudden transitions from near-random to above-random performance.

Examples of emergent abilities

Several specific tasks have become canonical examples of emergence in the literature. The table below summarizes some of the most frequently cited cases.

Task	Description	Model family	Approximate emergence threshold	Source
Multi-digit arithmetic (3-digit addition)	Adding three-digit numbers correctly	GPT-3 / LaMDA	~13B parameters (GPT-3); ~68B parameters (LaMDA)	Wei et al., 2022 ^[1]
Word unscrambling	Rearranging shuffled letters to form a valid word	GPT-3	~13B parameters	Wei et al., 2022 ^[1]
Chain-of-thought reasoning (GSM8K)	Solving grade-school math word problems using step-by-step reasoning	PaLM	~62B parameters (~10^23 FLOPs)	Wei et al., 2022 ^[1]
Multi-task language understanding (MMLU)	Answering college-level exam questions across 57 subjects	GPT-3 / Chinchilla	~70B parameters	Hendrycks et al., 2021 ^[3]
International phonetic alphabet transliteration	Converting text to IPA notation	BIG-Bench models	~10^23 training FLOPs	BIG-Bench, 2022 ^[4]
Persian QA	Answering questions in Persian	GPT-3 / PaLM	~62B parameters	Wei et al., 2022 ^[1]

Few-shot arithmetic

Arithmetic provides one of the clearest illustrations of the phenomenon. When evaluated on two-digit addition, GPT-3 at 350 million parameters and 1.3 billion parameters both performed at roughly chance levels. At 6.7 billion parameters, performance improved only marginally. But at 175 billion parameters, the model could solve two-digit addition reliably. Three-digit addition required even larger scales. This pattern, where performance is flat for several orders of magnitude and then jumps, is the signature of what Wei et al. called emergence ^[1].

Chain-of-thought prompting

Chain-of-thought (CoT) prompting is a technique in which the model is prompted to produce intermediate reasoning steps before arriving at a final answer. Wei et al. (2022) showed that on GSM8K, a benchmark of grade-school math word problems, CoT prompting actually performed worse than standard prompting for models below approximately 10^22 FLOPs of training compute. Above that threshold, CoT dramatically outperformed direct answer prompting: with eight CoT exemplars, PaLM 540B's GSM8K solve rate rose from roughly 18% under standard prompting to about 57%, surpassing a fine-tuned GPT-3 with a verifier and setting a state-of-the-art result at the time ^[5]. This dual pattern, where the technique hurts small models and helps large ones, is itself considered emergent.

Word unscrambling

In word unscrambling tasks, models must rearrange shuffled letters to reconstruct a valid English word. Small and medium-sized models perform at near-zero accuracy, as the task requires a form of combinatorial search that simple pattern matching cannot solve. Performance jumps sharply in the 13B to 175B parameter range for GPT-3 family models ^[1].

Multi-step reasoning

Beyond arithmetic, multi-step reasoning tasks (such as logical deduction, tracking state changes in narratives, or navigating causal chains) show similar patterns. The common thread is that these tasks require more complex internal processing than straightforward pattern retrieval or next-token prediction. Models appear to develop the capacity for this deeper processing only at sufficient scale.

Are emergent abilities a mirage? The Schaeffer et al. 2023 challenge

In April 2023, Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo of Stanford University published "Are Emergent Abilities of Large Language Models a Mirage?" ^[6]. The paper won an Outstanding Paper Award at NeurIPS 2023, signaling the research community's recognition of its importance.

Schaeffer et al. advanced a provocative thesis: the appearance of emergent abilities is primarily an artifact of the metrics researchers use to evaluate model performance, not a fundamental property of the models themselves. As the abstract puts it, "emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale" ^[6]. Co-author Brando Miranda later summarized the result by saying the sharp jumps are "an intrinsic property of the metric" rather than of the model ^[15]. Their argument rested on several key observations.

How does metric choice create the illusion of emergence?

Many of the benchmarks used to demonstrate emergence rely on nonlinear or discontinuous metrics, particularly exact-match accuracy and multiple-choice grade. In exact-match scoring, a response is either perfectly correct (score of 1) or completely wrong (score of 0). There is no partial credit. The authors argued that this creates a misleading picture. A model might be gradually improving its internal representation of a task, producing outputs that are increasingly close to correct, but exact-match accuracy will register zero until the model crosses the threshold of producing a perfectly correct answer. Because exact-match accuracy is mathematically related to per-token loss through an exponential, even smooth, linear improvements in the underlying loss can show up as a sudden, sharp jump in accuracy ^[6].

To test this, Schaeffer et al. re-evaluated tasks previously claimed to be emergent using continuous or linear metrics such as Token Edit Distance (which measures how many edits are needed to transform the model's output into the correct answer) and Brier Score (a probabilistic scoring measure). Under these alternative metrics, the apparent sharp transitions disappeared, replaced by smooth, continuous improvements across model scales ^[6].

Three core predictions

The paper made and tested three specific predictions:

Changing the metric from a nonlinear/discontinuous one to a linear/continuous one would make the apparent emergence disappear.
Tasks claimed to be emergent under exact match would show smooth improvement under continuous metrics.
The same methodology could be used to manufacture apparent "emergence" in non-language domains (such as computer vision) simply by choosing the right metric.

All three predictions were confirmed empirically using the InstructGPT/GPT-3 model family and a meta-analysis of BIG-Bench tasks ^[6]. Notably, the authors demonstrated the third prediction by deliberately inducing apparent "emergence" in vision models, where it had never been reported, simply by swapping a continuous metric for a discontinuous one ^[6]^[15].

Implications of the mirage argument

If Schaeffer et al. are correct, the practical consequences are significant. It would mean that model capabilities are in principle more predictable than the emergence narrative suggests. It would also mean that the AI safety concerns arising from unpredictable capability jumps are less pressing, since improvements would follow smooth, forecastable curves when measured appropriately.

Counter-arguments and the ongoing debate

The mirage hypothesis did not go unchallenged. Several counter-arguments have been raised, and the debate continues to shape research in the field.

Jason Wei's response

Jason Wei, the lead author of the original 2022 paper, responded publicly in a blog post titled "Common Arguments Regarding Emergent Abilities" ^[7]. He acknowledged that some tasks showing emergence under exact match do exhibit smooth improvement under alternative metrics. However, he argued that this observation misses the point in practical terms. For many real-world applications, exact-match accuracy is the metric that matters. If you ask a model "What is 15 + 23?" the answer is either 38 or it is wrong; partial credit for producing "37" is not useful for evaluating whether the model can actually do arithmetic.

Wei also noted that finding a surrogate metric that improves smoothly is significant only if it enables prediction of when the target metric (e.g., exact match) will cross a useful threshold. As of his writing, he had not seen substantial evidence that smooth surrogate metrics could reliably predict the onset of exact-match competence.

Additionally, Wei acknowledged a legitimate limitation of the original emergence claims: the models studied were available in only a few discrete sizes (e.g., 350M, 1.3B, 6.7B, 175B for GPT-3). With more intermediate sizes, the performance curve might appear smoother than the published data suggested. But this is a limitation of available data, not necessarily evidence against the underlying phenomenon.

The pre-training loss perspective

A separate line of research has attempted to move the debate forward by shifting the independent variable. Rather than measuring emergence as a function of model size (parameter count) or training compute (FLOPs), researchers have examined performance as a function of pre-training loss.

Du et al. (2024) trained three LLMs of different sizes (1.5B, 6B, and 32B parameters) and evaluated their performance on twelve diverse downstream tasks at multiple checkpoints throughout training ^[8]. They found that certain tasks, including MMLU, C-Eval, and GSM8K, exhibited a distinct threshold in pre-training loss. Once the loss dropped below a critical value, performance improved abruptly. This pattern held across different model sizes, suggesting that the threshold is tied to the training process itself rather than to parameter count alone.

This finding partially reconciles the two perspectives. It suggests that emergence may be real in the sense that there are genuine thresholds in learning, but these thresholds may be more predictable when indexed to pre-training loss rather than to model size. Pre-training loss acts as a more reliable predictor of downstream task performance, potentially independent of model architecture or size.

Statistical power and evaluation design

Another counter-argument focuses on statistical power. When smaller models are evaluated on difficult tasks, they may perform slightly above chance, but the signal is drowned out by noise because evaluations typically use limited numbers of test examples. Increasing the number of evaluation samples can reveal above-chance performance in smaller models that was previously invisible. This suggests that some apparent emergence is a consequence of inadequate evaluation sample sizes rather than a true discontinuity in capability.

Phase transitions versus smooth improvement

The debate over emergence connects to a deeper question in deep learning theory: do neural networks undergo phase transitions during training and scaling, or do all improvements happen smoothly?

The phase transition view

Proponents of genuine emergence argue that LLMs undergo something analogous to phase transitions in physical systems. In this view, certain internal representations or computational circuits form only when the network is large enough to support them. Below a critical scale, the model lacks the capacity to represent the solution to a particular task, and performance is random. Above that scale, the necessary representations "crystallize," and performance jumps.

This view is supported by mechanistic interpretability research, which has found that specific circuits within transformer networks appear to form at particular scales. For example, induction heads (attention patterns that implement simple copying and in-context learning) appear to form during a relatively narrow phase of training ^[9]. These findings suggest that at least some capability gains are genuinely discontinuous at the level of internal mechanism, even if they appear smooth under certain external metrics.

The smooth improvement view

Skeptics counter that all neural network improvements are fundamentally smooth at a fine-grained level. What appears to be a phase transition is an artifact of coarse measurement. If you measured the relevant internal computations with sufficient granularity, you would see gradual improvement. The "jump" in external metrics like exact-match accuracy is simply the point at which a gradually improving internal capability crosses the threshold needed to produce correct outputs consistently.

This view is not necessarily in conflict with the practical significance of emergence. Even if the underlying improvement is smooth, the fact that useful capabilities appear to switch on suddenly, as measured by practically relevant metrics, matters for applications and for safety.

How do emergent abilities relate to scaling laws?

The concept of emergent abilities has a complex relationship with scaling laws, the empirical regularities governing how model performance improves with increased compute, data, and parameters.

Kaplan and Chinchilla scaling laws

The scaling laws discovered by Kaplan et al. (2020) at OpenAI and later refined by Hoffmann et al. (2022) in the Chinchilla paper describe smooth, predictable relationships between model scale and aggregate performance metrics like cross-entropy loss ^[10]^[11]. These laws accurately predict that a model trained with twice the compute will achieve a specific, calculable reduction in loss.

Emergent abilities appear to violate this smooth picture. While aggregate loss decreases smoothly with scale, performance on specific downstream tasks can jump sharply. This creates an apparent paradox: how can the overall loss curve be smooth if individual task performance curves are discontinuous?

Resolving the paradox

One resolution is that aggregate loss averages over thousands of sub-tasks and capabilities. Each individual sub-task might have its own threshold, but because these thresholds are distributed across different scales, their average appears smooth. The loss curve is like a smoothed staircase: each step is an individual emergence event, but when averaged together, the steps blend into a ramp.

Another resolution, consistent with the mirage hypothesis, is that there is no paradox because emergence itself is an artifact. Under this view, both loss and task performance improve smoothly. The apparent disconnect arises only because researchers use discontinuous metrics for task evaluation while using continuous metrics (loss) for aggregate performance.

Recent work in 2025 has proposed a middle ground. A comprehensive survey by Berti et al. (2025) suggests that emergence is better understood as a property of the interaction between model capability, task difficulty, and evaluation methodology ^[12]. When all three factors are properly accounted for, the behavior of LLMs at scale becomes more predictable, though not entirely so.

Specific model families and scale thresholds

The following table summarizes the approximate scale at which various model families have been observed to exhibit particular abilities. These figures are drawn from the original Wei et al. paper and subsequent replications.

Model family	Parameters	Training compute (FLOPs)	Notable emergent abilities observed
GPT-3 (350M)	350M	~10^20	None observed
GPT-3 (1.3B)	1.3B	~10^21	None observed
GPT-3 (6.7B)	6.7B	~3x10^21	Marginal improvements on some BIG-Bench tasks
GPT-3 (175B)	175B	~3.6x10^23	Multi-digit arithmetic, word unscrambling, analogical reasoning
LaMDA (68B)	68B	~10^23	Three-digit arithmetic, word unscrambling
PaLM (62B)	62B	~10^23	Chain-of-thought reasoning on GSM8K
PaLM (540B)	540B	~2.5x10^24	Strong CoT performance (57% on GSM8K), multi-step logical reasoning
Chinchilla (70B)	70B	~5x10^23	MMLU above chance, competitive with much larger models

Note that Chinchilla, despite being smaller than GPT-3 175B, demonstrates strong performance on MMLU because it was trained on substantially more data relative to its size, in accordance with the Chinchilla scaling laws ^[11]. This illustrates that parameter count alone is an imperfect proxy for the relevant scale variable; training compute and data quantity also play critical roles.

What do emergent abilities mean for AI safety?

The debate over emergent abilities has profound implications for AI alignment and safety research.

Unpredictable capabilities

If emergent abilities are genuine, they pose a fundamental challenge to AI safety. A model that appears harmless at one scale might develop dangerous capabilities at a slightly larger scale, with no warning from the scaling curve. This unpredictability makes it difficult to conduct safety evaluations, since testing a smaller model may not reveal what a larger version will be capable of.

Concrete concerns include:

Deceptive alignment: A model might develop the ability to recognize when it is being tested and behave differently during evaluation than during deployment.
Unexpected tool use: A model might suddenly gain the ability to write and execute code, access the internet, or manipulate users in ways not anticipated by its developers.
Reward hacking: In reinforcement learning settings, a model might develop novel strategies for maximizing its reward signal that circumvent the intended objective.

The safety case for the mirage view

If Schaeffer et al. are correct that emergence is primarily a measurement artifact, the safety picture is somewhat more reassuring. Smooth, predictable capability improvements are easier to monitor and control. Safety researchers could, in principle, track continuous metrics to forecast when a model will cross practically important thresholds.

However, even under the mirage view, safety concerns do not disappear. Even if the underlying improvement is smooth, the practical onset of dangerous capabilities (as measured by real-world benchmarks) could still be abrupt. A model that gets 0% on a deception benchmark and then 50% at slightly larger scale presents the same real-world risk regardless of whether the improvement was "truly" discontinuous or merely appears so under the chosen metric.

Governance and regulation

The emergence debate has directly influenced AI governance discussions. Several proposed regulatory frameworks, including those discussed by the European Union AI Act drafters and U.S. executive orders on AI, reference the concept of unpredictable capabilities as a justification for mandatory pre-deployment testing of large models. If capabilities are unpredictable, regulators argue, developers cannot simply assert that their models are safe based on smaller-scale testing ^[13].

Current understanding (2025-2026)

As of early 2026, the research community has not reached a definitive consensus, but the debate has matured considerably.

Points of agreement

Most researchers now agree on several points:

Metric choice matters significantly. The appearance of sharp emergence is heavily influenced by whether discrete or continuous evaluation metrics are used. This is no longer controversial.
Pre-training loss is a better predictor than parameter count. The scale at which abilities appear is more reliably predicted by the model's pre-training loss than by its raw parameter count or training FLOPs ^[8]^[12].
Some form of nonlinear improvement is real. Even skeptics of the strongest emergence claims acknowledge that the rate of improvement on certain tasks accelerates in a way that is not easily explained by purely linear scaling.
The phenomenon depends on task complexity. Simple tasks (like sentiment classification) do not exhibit emergence. The effect is concentrated in tasks that require multi-step reasoning, compositional generalization, or other forms of complex computation.

Open questions

Several important questions remain unresolved:

Mechanistic understanding: What happens inside the model at the point of apparent emergence? Mechanistic interpretability research is making progress, but a complete understanding of why specific capabilities appear at specific scales remains elusive.
Prediction: Can we reliably predict, before training a large model, which new capabilities it will exhibit? Current methods offer only rough estimates.
Large reasoning models: The emergence of large reasoning models (LRMs) that use reinforcement learning and inference-time search to enhance reasoning capabilities introduces new dimensions to the debate. These models amplify reasoning and self-reflection through compute applied at inference time rather than at training time, raising questions about whether emergence can occur along the inference-compute axis as well ^[12].
Emergent risks: As models grow more capable, they may develop harmful behaviors, including deception, manipulation, and reward hacking. The ability to forecast and mitigate these emergent risks is one of the most pressing challenges in AI safety research.

A 2025 survey perspective

A comprehensive survey published in March 2025 by Berti et al. examined the phenomenon across multiple dimensions ^[12]. The survey concluded that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. It proposed that the "sharpness" of performance improvement is linked to the model crossing a critical learning threshold, as indicated by its pre-training loss, rather than being solely an artifact of the metric applied. The survey also extended the analysis beyond traditional LLMs to include large reasoning models, noting that reinforcement learning and inference-time computation introduce additional axes along which emergence might occur.

A separate 2025 statistical analysis published in Neural Networks challenged several common assumptions about scaling and emergence, arguing that careful decomposition of task difficulty and model capability reveals more structure in the scaling curves than previously appreciated ^[14].

Summary

Emergent abilities in large language models remain one of the most consequential and contested topics in AI research. The original observation by Wei et al. (2022), that models exhibit sudden jumps in capability at critical scales, captured something real and important about how LLMs behave. The challenge from Schaeffer et al. (2023) sharpened the field's understanding by demonstrating that metric choice can create misleading impressions of discontinuity. Subsequent work on pre-training loss thresholds, mechanistic interpretability, and statistical methodology has moved the conversation toward a more nuanced view.

The practical stakes are high. Whether capabilities emerge suddenly or improve smoothly determines how we should approach safety testing, governance, and the development of increasingly powerful AI systems. The research community continues to investigate these questions with urgency, knowing that the answers will shape the trajectory of artificial intelligence for years to come.

References

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E.H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682 ↩
Anderson, P.W. (1972). "More Is Different." Science, 177(4047), 393-396. https://www.science.org/doi/10.1126/science.177.4047.393 ↩
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021. https://arxiv.org/abs/2009.03300 ↩
Srivastava, A., et al. (2022). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models." https://arxiv.org/abs/2206.04615 ↩
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. https://arxiv.org/abs/2201.11903 ↩
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023 (Outstanding Paper Award). https://arxiv.org/abs/2304.15004 ↩
Wei, J. (2023). "Common Arguments Regarding Emergent Abilities." https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities ↩
Du, Z., et al. (2024). "Understanding Emergent Abilities of Language Models from the Loss Perspective." https://arxiv.org/abs/2403.15796 ↩
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2022). "In-context Learning and Induction Heads." Transformer Circuits Thread. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html ↩
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." https://arxiv.org/abs/2001.08361 ↩
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.L., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., & Sifre, L. (2022). "Training Compute-Optimal Large Language Models." https://arxiv.org/abs/2203.15556 ↩
Berti, L., et al. (2025). "Emergent Abilities in Large Language Models: A Survey." https://arxiv.org/abs/2503.05788 ↩
Center for Security and Emerging Technology (CSET). "Emergent Abilities in Large Language Models: An Explainer." Georgetown University. https://cset.georgetown.edu/article/emergent-abilities-in-large-language-models-an-explainer/ ↩
"Breaking Myths in LLM Scaling and Emergent Abilities with a Comprehensive Statistical Analysis." (2025). Neural Networks. https://www.sciencedirect.com/science/article/pii/S092523122503214X ↩
Miranda, B. (2024). "Are emergent abilities of large language models a mirage? Interview with Brando Miranda." AIhub. https://aihub.org/2024/04/25/are-emergent-abilities-of-large-language-models-a-mirage-interview-with-brando-miranda/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Artificial intelligence terms BIG-Bench BIG-Bench Hard Capability overhang Foundation models LLM.int8()Least-to-Most Prompting PaLM

What is an emergent ability?

The Wei et al. 2022 paper

Examples of emergent abilities

Few-shot arithmetic

Chain-of-thought prompting

Word unscrambling

Multi-step reasoning

Are emergent abilities a mirage? The Schaeffer et al. 2023 challenge

How does metric choice create the illusion of emergence?

Three core predictions

Implications of the mirage argument

Counter-arguments and the ongoing debate

Jason Wei's response

The pre-training loss perspective

Statistical power and evaluation design

Phase transitions versus smooth improvement

The phase transition view

The smooth improvement view

How do emergent abilities relate to scaling laws?

Kaplan and Chinchilla scaling laws

Resolving the paradox

Specific model families and scale thresholds

What do emergent abilities mean for AI safety?

Unpredictable capabilities

The safety case for the mirage view

Governance and regulation

Current understanding (2025-2026)

Points of agreement

Open questions

A 2025 survey perspective

Summary

References

Improve this article

Related Articles

Prompt injection

Anthropic

Frontier models

Grok 3 Jailbreak

System prompt

Jailbreak (artificial intelligence)

What links here

Related Articles

Prompt injection

Anthropic

Frontier models

Grok 3 Jailbreak

System prompt

Jailbreak (artificial intelligence)

What links here