Scaling Laws

Scaling laws in artificial intelligence are empirical relationships that describe how the performance of neural networks improves as key factors, such as model size, training data volume, and computational resources, are increased. These relationships typically follow power-law curves, meaning that performance improves predictably and smoothly as resources grow, often across many orders of magnitude. Scaling laws have become one of the most consequential findings in modern machine learning, guiding decisions worth billions of dollars about how to allocate compute budgets, how large to make models, and how much data to collect for training.

The study of scaling laws rose to prominence with a 2020 paper by Jared Kaplan and colleagues at OpenAI, which demonstrated that language model loss follows simple power-law relationships with model parameters, dataset size, and training compute ^[1]. Two years later, researchers at DeepMind published the Chinchilla paper, which revised the optimal balance between model size and training data, shifting how the industry trained its largest models ^[2]. Since then, scaling laws have been extended, debated, revised, and applied across modalities from text to vision to multimodal systems. The concept of scaling has also expanded beyond pre-training to include inference-time compute.

Explain like I'm 5 (ELI5)

Imagine you are building a sandcastle. You need three things: a bucket (your model), sand (your data), and time to work (your compute). Scaling laws are like noticing a pattern: if you get a bigger bucket, you can build a bigger castle. If you get more sand, you can also build a bigger castle. And if you have more time, same thing.

But here is the interesting part. Scientists found that these patterns are super predictable. If you know how good your sandcastle is with a small bucket and a little sand, you can predict almost exactly how good it will be with a bucket that is ten times bigger and ten times more sand. That predictability is what makes scaling laws so useful, because building a really big sandcastle (training a really big AI model) is very expensive. You want to know in advance whether it is worth it.

There is also a question of balance. Should you spend your money on a bigger bucket or on more sand? Early research said "get a bigger bucket." Later research said "actually, you need to grow both equally." And the newest research says "it depends on how many sandcastles you plan to show to people" (because showing sandcastles to people, which is like running the AI, also costs resources).

Foundations: power laws in neural networks

A power law is a mathematical relationship of the form y = ax^b, where a change in one quantity produces a proportional change in another according to a fixed exponent. In the context of neural networks, the "quantity" being predicted is typically the model's test loss (cross-entropy loss on held-out data), and the independent variables are the number of trainable parameters (N), the number of training tokens or data points (D), and the total compute budget (C), usually measured in floating-point operations (FLOPs).

The key insight is that when you plot loss against any of these variables on a log-log scale, the result is approximately a straight line over a wide range. This means the relationship is not arbitrary or chaotic; it follows a predictable trajectory. If you know how a model performs at one scale, you can extrapolate to predict how it will perform at a much larger scale with reasonable accuracy.

This predictability is what makes scaling laws so valuable in practice. Training a frontier model can cost tens or hundreds of millions of dollars. Being able to run small-scale experiments and then extrapolate to forecast the performance of a much larger model before committing the full budget reduces risk enormously.

Early observations (Hestness et al., 2017)

The first systematic empirical study of neural scaling laws was published by Joel Hestness, Sharan Narang, and colleagues at Baidu Research in 2017 ^[3]. They trained models across several tasks, including machine translation, language modeling, image classification, and speech recognition, and measured how test error changed as a function of training set size. They found power-law exponents (alpha) ranging from 0.07 to 0.35 across tasks: approximately 0.13 for machine translation, 0.06 to 0.09 for language modeling, and 0.3 to 0.5 for ImageNet classification. They also identified a relationship between model size and dataset size of the form N proportional to D^0.7 for language modeling. This work established that scaling behavior was not unique to a single task or architecture but appeared to be a general property of deep learning systems.

The Kaplan scaling laws (2020)

In January 2020, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, and Dario Amodei at OpenAI published "Scaling Laws for Neural Language Models" ^[1]. This paper systematically studied how the performance of transformer-based language models changes as a function of model size, dataset size, and compute.

Experimental setup

The team trained a large number of language models ranging from approximately 768 parameters to 1.5 billion parameters, varying model width, depth, and the amount of training data. They used the WebText2 dataset and measured performance as cross-entropy loss on a held-out test set. The models were all decoder-only transformers, consistent with the GPT architecture. An important methodological choice was that they counted only non-embedding parameters, excluding vocabulary and positional embeddings. This decision later proved to be a significant source of divergence from subsequent scaling law studies.

Key findings

The paper established several foundational results:

1. Power-law scaling with parameters, data, and compute. The test loss L scales as a power law with each of the three factors when the other two are not bottlenecked:

Variable	Scaling relationship	Exponent
Parameters (N)	L(N) = (N_c / N)^alpha_N	alpha_N = 0.076
Data (D)	L(D) = (D_c / D)^alpha_D	alpha_D = 0.095
Compute (C)	L(C) = (C_c / C)^alpha_C	alpha_C = 0.050

These trends spanned more than seven orders of magnitude, a remarkable degree of regularity.

2. Architectural details matter less than scale. Within a broad range, factors like network width, depth, and the number of attention heads had minimal effects on performance relative to the total parameter count. A wide, shallow model performed similarly to a narrow, deep model of the same total size. This finding suggested that the raw amount of parameters was more important than how they were arranged.

3. Larger models are more sample-efficient. A key and perhaps counterintuitive result was that larger models reach the same level of performance with fewer training examples than smaller models. This means bigger models learn more per data point. The degree of overfitting scaled as N^0.74 / D, so an 8x increase in model size required only about 5x more data to maintain the same overfitting level.

4. Compute-optimal allocation favors large models trained briefly. Given a fixed compute budget, the Kaplan paper concluded that the optimal strategy was to train very large models on a relatively modest amount of data, stopping well before convergence. Specifically, they found that when scaling compute optimally, model size should scale faster than training data. The recommended allocation was roughly N proportional to C^0.73 and D proportional to C^0.27 ^[1]. In other words, most of the increased compute budget should go toward making the model larger, not toward training it on more data.

The combined loss formula

When both model size and dataset size are varied simultaneously, the loss follows:

L(N, D) = [(N_c / N)^(alpha_N / alpha_D) + D_c / D]^alpha_D

with fitted parameters alpha_N = 0.076, alpha_D = 0.103, N_c = 6.4 x 10^13, and D_c = 1.8 x 10^13. The paper also derived a critical batch size that scales with loss: B_crit = B_* / L^(1/alpha_B), with B_* = 2 x 10^8 tokens and alpha_B = 0.21.

Impact on GPT-3

The scaling laws research was conducted concurrently with the development of GPT-3, and the findings directly informed the decision to scale GPT-3 to 175 billion parameters. Following the prescription to prioritize model size over data, GPT-3 was trained on approximately 300 billion tokens, producing a tokens-per-parameter ratio of about 1.7. The laws predicted that this scale would yield substantial improvements over GPT-2's 1.5 billion parameters, and GPT-3's performance validated those predictions. This success cemented scaling laws as a central concept in AI strategy.

The Chinchilla scaling laws (2022)

In March 2022, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, and colleagues at DeepMind published "Training Compute-Optimal Large Language Models" ^[2]. This paper, commonly known as the "Chinchilla paper," fundamentally revised the compute-optimal training strategy proposed by Kaplan et al.

Experimental approach

The DeepMind team used three complementary approaches to estimate optimal scaling:

Fixed compute, varying model size and data. They trained over 400 models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, systematically varying the allocation between parameters and training tokens for each compute budget.
Fixed model sizes, varying token counts. They trained models of fixed sizes (70M to 10B parameters) on different amounts of data and fitted a parametric loss function.
Parametric fit to all data. They fitted a parametric loss function of the form L(N, D) = E + A/N^alpha + B/D^beta to the entire dataset of training runs.

The central finding: equal scaling

The Chinchilla paper's central conclusion was that Kaplan et al.'s recommended allocation was significantly lopsided. Instead of allocating most additional compute to model size, Hoffmann et al. found that model size and training tokens should be scaled in roughly equal proportion. For every doubling of model size, the number of training tokens should also be doubled.

This implied an optimal ratio of approximately 20 training tokens per model parameter. The table below compares the Kaplan and Chinchilla recommendations:

Aspect	Kaplan et al. (2020)	Chinchilla (Hoffmann et al., 2022)
Optimal N scaling	N proportional to C^0.73	N proportional to C^0.50
Optimal D scaling	D proportional to C^0.27	D proportional to C^0.50
Tokens per parameter	~1.7 tokens/param	~20 tokens/param
Strategy summary	Train very large models on modest data	Scale model and data equally

The Chinchilla loss function

The fitted parametric loss function from the Chinchilla paper was:

L(N, D) = A/N^alpha + B/D^beta + L_0

with constants A = 406.4, B = 410.7, alpha = 0.34, beta = 0.28, and L_0 = 1.69. The total training compute was approximated as C = 6ND (six FLOPs per parameter per token, accounting for forward and backward passes). From this, the optimal allocation for a given compute budget C works out to:

N_opt = 0.6 * C^0.45
D_opt = 0.3 * C^0.55
L_opt(C) = 1070 * C^(-0.154) + 1.7

The Chinchilla model

To validate their predictions, the DeepMind team trained a 70-billion-parameter model called Chinchilla on 1.4 trillion tokens (exactly a 20:1 token-to-parameter ratio). This model used the same compute budget as Gopher, DeepMind's earlier 280-billion-parameter model, but redistributed it from parameters to data.

The results were striking. Chinchilla, despite being 4x smaller than Gopher, outperformed it on virtually every benchmark:

Benchmark	Gopher (280B)	Chinchilla (70B)
MMLU (average accuracy)	60.0%	67.5%
HellaSwag	79.2%	80.8%
PIQA	81.8%	83.7%
Winogrande	70.1%	74.9%

Chinchilla also outperformed GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on most downstream tasks, despite being far smaller than all of them ^[2]. The implication was clear: many of the largest models in 2021 and early 2022 had been dramatically undertrained relative to their size.

Why did the results differ from Kaplan?

The discrepancy between the two sets of scaling laws has been attributed to several factors. Kaplan et al. counted only non-embedding parameters, while Hoffmann et al. counted total parameters. At small model scales, embedding parameters constitute a meaningful fraction of total parameters, creating a systematic bias in the apparent scaling relationship. A 2024 reconciliation study by Porian et al. reanalyzed Chinchilla's data using non-embedding parameters at Kaplan's scale (768M to 1.5B parameters) and obtained scaling coefficients of 0.74 to 0.78, closely matching Kaplan's 0.73 ^[4]. The Kaplan team also used a fixed learning rate schedule that was not fully optimized for each model size, and their experiments covered a narrower range of data scales.

The Epoch AI replication

In April 2024, researchers at Epoch AI published a detailed replication attempt of the Chinchilla paper's results ^[5]. They found that the specific parameter estimates reported in the original paper contained errors: the optimizer used by Hoffmann et al. had stopped before convergence due to a poor choice of loss scale, and the reported values were rounded in a way that introduced substantial bias.

Epoch's revised parametric fit was:

L(N, D) = 1.8172 + 482.01/N^0.3478 + 2085.43/D^0.3658

Despite the numerical discrepancies, the revised model still implied an optimal ratio of approximately 20 tokens per parameter, consistent with how the actual Chinchilla model was trained and with the other estimation approaches used in the original paper ^[5]. The core conclusion held: models should be trained on far more data than Kaplan et al. had suggested.

Scaling laws beyond language: other modalities

While the initial scaling laws work focused on language models, researchers have since investigated whether similar relationships hold for other domains. The general answer is yes, but with different exponents and some additional complexity.

Scaling laws for autoregressive generative modeling (Henighan et al., 2020)

In October 2020, Tom Henighan, Jared Kaplan, and colleagues at OpenAI published "Scaling Laws for Autoregressive Generative Modeling," extending scaling laws to four additional domains: generative image modeling, video modeling, multimodal image-text models, and mathematical problem solving ^[6]. They found that in all cases, autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size depends on the compute budget through a power law with exponents that are nearly universal across all data domains.

The paper also identified domain-specific phenomena: a scaling relation for mutual information between captions and images in multimodal models, scaling laws for extrapolation beyond the training distribution in mathematical problem solving, and smooth scaling of classification loss when fine-tuning generative image models for ImageNet classification.

Vision Transformers

Scaling laws for Vision Transformers (ViT) were studied by Zhai et al. in "Scaling Vision Transformers" (2022) ^[7] and further refined by Alabdulmohsin et al. in "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design" (2023) ^[8]. The ViT experiments covered parameter ranges from 5 x 10^6 to 2 x 10^9 and dataset sizes from 3 x 10^7 to 3 x 10^9 images.

The results confirmed that vision models also exhibit power-law scaling behavior, though with some differences from language models. The performance-compute frontier for ViTs follows a saturating power law, meaning that gains per unit of compute gradually diminish at very large scales more quickly than in language models. A key finding was that staying on the optimal frontier requires simultaneously scaling both model size and data. Google scaled ViT to 22 billion parameters (ViT-22B) in 2023, demonstrating that vision transformers continue to benefit from increased scale, though the gains are more modest compared to language models at equivalent compute levels ^[9].

Fitted scaling exponents differ noticeably across modalities. Empirical studies have found exponents of approximately 0.06 to 0.09 for language, 0.11 for images, and up to 0.37 for code and molecular data ^[10]. Images scale "hardest" (smallest exponent, slowest improvement), while code and structured molecular data scale "easiest."

Unified scaling laws for routed language models (Clark et al., 2022)

Aidan Clark and colleagues at DeepMind studied the scaling behavior of routing networks, architectures that conditionally use only a subset of their parameters when processing a given input (i.e., Mixture of Experts models) ^[11]. Standard scaling laws are defined only in terms of total parameter count, but routed models have both a total parameter count and a much smaller active parameter count per input. Clark et al. derived scaling laws defined on both parameter count and computational requirement, generalizing the standard power-law relationships. They introduced an "Effective Parameter Count" that accounts for routing, enabling all models (dense and sparse) to be compared on a single scale. This work was published at ICML 2022.

Multimodal models

Scaling laws for native multimodal models (those trained from scratch on multiple modalities rather than composed from separate unimodal components) were studied by Shukor et al. in a 2025 paper from Apple ^[12]. Their study spanned 457 trained models and revealed several insights: early-fusion architectures performed better than late-fusion at lower parameter counts; incorporating MoE layers allowed models to learn modality-specific weights; and visual data exhibits high redundancy but still produces many more tokens than equivalent text due to higher dimensionality.

A separate line of work by Aghajanyan et al. (2023) proposed scaling law hypotheses for mixed-modal models, predicting performance based on modality-specific compression and tokenization efficiency ^[13]. For speech and text combinations, they forecast a break-even point at around 28 billion parameters trained on approximately 45 billion tokens, which was validated by a 30B parameter run.

Beyond Chinchilla: over-training for inference efficiency

The Chinchilla scaling laws optimize for one objective: achieving the lowest possible loss for a given training compute budget. But this is not the only cost that matters. Once a model is trained, it must be deployed for inference, and inference costs scale with model size. A smaller model is cheaper to run per query, even if it required more training compute to reach a given quality level.

This insight led to the concept of "inference-aware" or "beyond Chinchilla" scaling, first formalized in the paper "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws" by Sardana and Frankle (2024) ^[14]. The authors showed that when inference demand is factored in, the optimal strategy shifts: models should be trained smaller and longer than the Chinchilla-optimal point. For a model expected to serve a large number of inference requests (on the order of billions), the training-time cost is amortized over so many queries that it becomes economical to invest extra compute in training a smaller, cheaper-to-serve model.

The key formula accounts for total cost as the sum of training and inference FLOPs: Total = 6N * D_train + 2N * D_inference. The authors trained 47 models to validate the approach and found that model quality continues to improve as the tokens-per-parameter ratio scales to extreme ranges, up to 10,000 tokens per parameter in some experiments ^[14].

Meta's LLaMA: a case study in over-training

Meta's LLaMA model family represents the most prominent practical application of beyond-Chinchilla scaling. The original LLaMA (February 2023) trained a 65-billion-parameter model on 1.4 trillion tokens, roughly consistent with the Chinchilla ratio of about 20 tokens per parameter ^[15]. But subsequent versions departed dramatically from Chinchilla optimality.

Llama 2 (July 2023) trained a 70-billion-parameter model on 2 trillion tokens, a ratio of roughly 29 tokens per parameter. Llama 3 (April 2024) went much further: the 8-billion-parameter model was trained on 15 trillion tokens, a ratio of approximately 1,875 tokens per parameter ^[16]. By Chinchilla standards, the compute-optimal training set for an 8B model would be around 160 to 200 billion tokens. Meta used 75 times that amount.

Model	Parameters	Training tokens	Tokens/parameter ratio	Chinchilla-optimal ratio
LLaMA 65B	65B	1.4T	~21.5	~20
Llama 2 70B	70B	2.0T	~28.6	~20
Llama 3 8B	8B	15T	~1,875	~20
Llama 3 70B	70B	15T	~214	~20
Microsoft Phi-3	3.8B	~170T tokens equiv.	~45x Chinchilla	~20

Meta's rationale was explicitly focused on inference economics. A smaller model that has been over-trained costs less to deploy per query. The extra training compute is a one-time investment that pays off across billions of inference calls. Meta reported that performance on their 8B model continued to improve log-linearly even at 15 trillion tokens, well beyond any theoretical saturation point predicted by Chinchilla ^[16].

This approach has since become industry standard. Most models released in 2024 and 2025 train well beyond Chinchilla-optimal ratios.

Quantization sensitivity

One caveat to extreme over-training has emerged in research: models trained far past Chinchilla optimality can become more sensitive to quantization. Degradation in loss due to weight quantization increases as an approximate power law in the token-to-parameter ratio. This means that aggressively over-trained models may lose more quality when compressed for deployment on resource-constrained hardware, partially offsetting the inference cost savings ^[14].

Scaling laws for data-constrained settings

Muennighoff, Rush, Barak, and colleagues (2023) addressed a practical question that the original scaling laws left open: what happens when you run out of unique training data? ^[17] The Chinchilla framework assumes one pass through the data (a single epoch), but many real-world scenarios, particularly for low-resource languages or specialized domains, involve datasets too small to satisfy Chinchilla ratios.

The authors ran extensive experiments with models up to 9 billion parameters and up to 900 billion training tokens, systematically varying the amount of data repetition. Their key findings were:

Training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
Beyond 4 epochs, the value of additional compute from repeated data decays toward zero.
The optimal strategy when data is constrained is to scale epochs faster than parameters.

They proposed a modified scaling law that accounts for the decreasing marginal value of repeated tokens, enabling compute-optimal allocation even when unique data is limited. This work is especially relevant for domains where collecting new data is expensive or where privacy constraints limit data availability.

Broken neural scaling laws

Not all scaling behavior follows a single clean power law. Caballero, Gupta, Rish, and Krueger (2023) introduced the concept of "broken neural scaling laws" (BNSL), showing that many scaling curves are better described by smoothly broken power laws: curves that follow one power-law exponent for a range and then transition to a different exponent ^[18].

The BNSL functional form is:

L(D) = E + (b * D^(-c_0)) * product_i(1 + (D/d_i)^(1/f_i))^(-c_i * f_i)

where the d_i values represent the "break points" at which the scaling regime changes. This formulation captures transitions that simple power laws miss, such as when a model shifts from learning coarse statistical patterns to learning finer-grained structure.

BNSL behavior has been observed across vision, language, audio, diffusion models, reinforcement learning, and multiple architectures including Transformers, CNNs, RNNs, and MoE models. The implication is that simple power-law extrapolation can be misleading when the model crosses a break point at larger scale, making long-range predictions less reliable than the smooth scaling curves might suggest.

Emergent abilities and scaling

The question of whether new capabilities "emerge" abruptly at certain scales, rather than improving gradually, has been one of the most debated topics in scaling law research.

The emergence hypothesis (Wei et al., 2022)

Jason Wei, Yi Tay, and colleagues at Google Research defined an "emergent ability" as one that is not present in smaller models but appears in larger models, such that the ability cannot be predicted by extrapolating from smaller-scale performance ^[19]. They documented a range of tasks where models showed near-random performance below a certain scale and then rapidly improved above it. Examples included multi-step arithmetic, word unscrambling, and certain types of logical reasoning.

The emergence pattern stood in apparent tension with the smooth power-law scaling observed in loss metrics. Loss decreases smoothly, yet certain downstream task accuracies seemed to jump discontinuously.

The measurement artifact hypothesis (Schaeffer et al., 2024)

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo challenged the emergence narrative in a 2024 paper arguing that apparent emergent abilities are largely artifacts of the metrics used to evaluate them ^[20]. Their key observation was that metrics like exact-match accuracy and multiple-choice grade are discontinuous: a model either gets a question exactly right or it does not, with no partial credit. When a model's per-token probability of generating the correct answer increases smoothly, the probability of getting the entire answer correct can exhibit a sharp threshold.

Schaeffer et al. showed that when discontinuous metrics are replaced with continuous alternatives (such as Brier score or token edit distance), the abrupt transitions largely disappear and performance scales smoothly. They also demonstrated that increasing the number of evaluation examples reveals above-chance improvements in smaller models that are hidden by high-variance evaluations with few examples.

Current status of the debate

The debate remains unresolved. Some researchers have countered that certain tasks exhibit genuine emergence even under continuous metrics, pointing to evidence that pre-training loss itself can reach tipping points beyond which new capabilities unlock ^[21]. Others have found U-shaped and inverted-U scaling patterns that do not fit neatly into either the smooth-scaling or discontinuous-emergence frameworks. The practical upshot is that scaling laws are reliable predictors of aggregate loss, but predicting exactly when a specific capability will appear at a specific scale remains difficult.

Inference-time scaling: test-time compute

Perhaps the most significant evolution of scaling laws since Chinchilla has been the emergence of inference-time scaling, sometimes called test-time compute scaling. Rather than making models larger or training them on more data, inference-time scaling improves performance by spending more compute at the point where the model generates its response.

The core idea

Traditional scaling laws describe how to exchange training-time compute for better predictions. Inference-time scaling laws describe how to exchange inference-time compute for better decisions. The mechanism is straightforward: let the model "think longer" by generating extended chain-of-thought reasoning, exploring multiple solution paths, verifying its own outputs, and backtracking when it detects errors.

This approach was first demonstrated at scale with OpenAI's o1 model, released in September 2024. o1 maintains an internal reasoning process (a hidden chain of thought) that unfolds during inference, spending variable amounts of compute depending on the difficulty of the problem. For easy questions, the model responds quickly. For hard mathematical or coding problems, it may spend significantly more time reasoning before answering.

Earlier work on inference-time scaling existed in games: AlphaGo Zero showed that either doubling model size and training or doubling test-time search both produce approximately a 120 Elo improvement, establishing a quantitative trade-off between training and inference compute ^[22].

Reasoning models

The release of o1 initiated a wave of reasoning-focused models:

Model	Developer	Release date	Key feature
o1-preview	OpenAI	September 2024	First commercial reasoning model with hidden chain-of-thought
o1	OpenAI	December 2024	Full release with improved reasoning
o3-mini	OpenAI	January 2025	Cost-efficient reasoning with adjustable effort levels (low/medium/high)
o3	OpenAI	April 2025	State-of-the-art reasoning, 96.7% on AIME 2024
DeepSeek-R1	DeepSeek	January 2025	Open-weight reasoning model trained via pure reinforcement learning
Claude 3.7 Sonnet (extended thinking)	Anthropic	February 2025	Visible extended thinking with adjustable budget

DeepSeek-R1 was particularly notable because it demonstrated that reasoning capabilities matching o1 could be achieved through pure reinforcement learning applied to a base model, without requiring supervised fine-tuning on human-curated reasoning traces. The DeepSeek-R1-Zero variant showed self-reflection and verification behaviors arising spontaneously from RL training. DeepSeek also demonstrated that reasoning patterns from large models could be distilled into much smaller models (as small as 1.5 billion parameters), achieving better performance than training the small models with RL directly.

Approaches to inference-time scaling

Approach	Description	Example
Chain-of-thought	Model generates intermediate reasoning steps before the final answer	OpenAI o1, o3
Best-of-N sampling	Generate multiple candidate answers and select the best one using a verifier	AlphaCode
Tree search	Explore a tree of reasoning paths and select the most promising branches	AlphaProof
Budget forcing	Control the length of reasoning chains to trade off between compute cost and accuracy	s1 model (2025)
Self-refinement	Model iteratively critiques and improves its own output	Various research papers

Research from Snell et al. (2024) showed that scaling inference compute with the right strategies can be more effective than scaling model parameters. A smaller model with optimal test-time compute allocation can match or exceed the performance of a model that is 14 times larger ^[23]. However, a large-scale study spanning over 30 billion generated tokens found that no single test-time scaling strategy universally dominates across tasks and model sizes.

Infrastructure implications

Inference-time scaling has significant cost implications. Reasoning models such as o1 and DeepSeek-R1 generate orders of magnitude more tokens than standard models. OpenAI's 2024 inference spending reportedly reached $2.3 billion, roughly 15 times the training cost for GPT-4.5. This has prompted research into more efficient inference strategies, including distillation of reasoning capabilities into smaller models. Analysts project that inference will claim 75% of total AI compute by 2030.

The data wall

One of the most pressing challenges to continued scaling is the finite supply of high-quality training data. Scaling laws assume that more data yields better models, but the pool of suitable text is not unlimited.

The exhaustion timeline

According to projections from Epoch AI, the supply of high-quality human-generated text on the internet may be exhausted as early as 2026 to 2028 ^[24]. This does not mean there will be no text available, but that the marginal quality of newly added data will decline, and models may begin encountering the same data repeatedly. Common Crawl and similar large-scale web scrapes have already been extensively mined.

Synthetic data as a response

The AI industry has increasingly turned to synthetic data as a way to extend the data supply. Synthetic data is generated by AI models themselves and can be used to train or fine-tune other models. Several major model releases in 2025 incorporated synthetic data in their training pipelines.

However, synthetic data introduces its own challenges. Models trained on synthetic data risk "model collapse," where errors and biases in the generated data compound across training generations. The most capable models are still expected to be anchored in human-generated data, with synthetic data used to expand and augment around that core.

Beyond text data

The data wall is most acute for text. Other modalities have larger untapped reservoirs. Video data, in particular, represents an enormous and largely underutilized source of training signal. Multimodal training that combines text, images, audio, and video may help extend the scaling runway by drawing on these richer data sources.

Diminishing returns and the scaling debate

Whether scaling laws will continue to hold, and whether continued scaling will produce commensurate improvements in model capability, is one of the most debated questions in AI.

The case that scaling is ending

Ilya Sutskever, co-founder of OpenAI and later Safe Superintelligence Inc., has been the most prominent voice arguing that the era of simple scaling is over. In a November 2025 podcast, Sutskever characterized the period from 2020 to 2025 as the "age of scaling" and argued that this era is ending for three reasons: data scarcity (pre-training data is finite and the major datasets have been exhausted); diminishing returns (advanced reasoning systems no longer deliver proportionate improvements when adding more computational steps); and generalization gaps (despite strong benchmark performance, models "generalize dramatically worse than people" in real-world settings) ^[25].

The paper "Scaling Laws Do Not Scale" (Liao et al., 2024) formalized additional criticisms, arguing that the scaling law relationship depends on metrics that may not correspond with how different groups of people perceive the quality of model output, and that communities represented in datasets may have values or preferences not reflected in standard evaluation metrics ^[26].

The case that scaling continues

On the other side, Sam Altman and others at OpenAI have maintained that scaling laws are far from reaching their ceiling. The industry committed an estimated $7.8 trillion to AI infrastructure through 2030, suggesting strong institutional confidence that scaling will continue to pay off ^[25].

Several arguments support continued scaling: inference-time scaling opens a new dimension of improvement distinct from pre-training scaling; synthetic data, retrieval-augmented generation, and multimodal training can extend the effective data supply; hardware efficiency improvements (approximately 40% per year for leading AI GPUs) mean that the same dollar buys more compute each year; and new architectures and training techniques may unlock better scaling exponents.

Mathematical framework

The general form of a neural scaling law can be expressed as a parametric loss function. The table below summarizes the major formulations:

Formulation	Formula	Source
Basic power law	L = B * D^(-b) + E	General form
Joint error function	L = A * N^(-a) + B * D^(-b) + E	Rosenfeld et al. (2020)
Chinchilla	L = 406.4/N^0.34 + 410.7/D^0.28 + 1.69	Hoffmann et al. (2022)
M4 estimator	(L - E) / (I - L)^alpha = A * N^(-a) + B * D^(-b)	Alabdulmohsin et al. (2022)
Broken power law (BNSL)	L = E + (b * D^(-c_0)) * prod(1 + (D/d_i)^(1/f_i))^(-c_i * f_i)	Caballero et al. (2023)
Epoch AI revised Chinchilla	L = 1.8172 + 482.01/N^0.3478 + 2085.43/D^0.3658	Besiroglu et al. (2024)

In all formulations:

L is the test loss (typically cross-entropy)
N is the number of model parameters
D is the number of training tokens
E is the irreducible loss (the entropy of the data itself)
A, B, alpha, beta are fitted constants

The irreducible loss E represents a fundamental limit: even a perfect model cannot predict truly random aspects of the data. The terms involving N and D represent the reducible loss that can be decreased by increasing model size or data.

For compute-optimal training, the total compute C is proportional to approximately 6ND. The optimization problem is to minimize L(N, D) subject to the constraint 6ND = C, which yields the optimal allocation of compute between N and D.

Theoretical explanations

Why do neural scaling laws follow power laws at all? Several theoretical explanations have been proposed.

The manifold hypothesis

One explanation draws on the manifold hypothesis: real-world data lies on low-dimensional manifolds embedded in high-dimensional space. As a model's capacity grows, it can approximate these manifolds with increasing fidelity, and the rate of improvement follows a power law related to the intrinsic dimensionality of the data manifold. Research from Sharma and Kaplan (2022) showed that scaling exponents can be predicted from the data manifold dimension ^[27].

Statistical mechanics analogies

Researchers at the National Academy of Sciences published a theoretical framework in 2024 explaining neural scaling laws through the lens of statistical mechanics ^[28]. They showed that power-law exponents can be derived from properties of the data distribution, specifically its intrinsic dimension and spectral characteristics. The theory predicts different exponents for different types of data, consistent with the empirical observation that vision and language models have different scaling behaviors. Four scaling regimes were identified: variance-limited and resolution-limited scaling for both dataset and model size.

Information-theoretic arguments

Another line of reasoning appeals to information theory. The model's loss represents the gap between its predictions and the true data distribution. As more parameters or data are added, the model can capture increasingly fine-grained statistical patterns. The rate at which it captures these patterns follows a power law because the "easy" patterns (high-frequency, high-information) are learned first, and the model progressively learns rarer, more subtle patterns.

Environmental and cost implications

The exponential growth in training compute has significant environmental and economic consequences.

Energy consumption

The power required to train the largest frontier models is growing by more than 2x per year. Epoch AI projects that the largest individual training runs by 2030 will draw 4 to 16 gigawatts (GW) of power, enough to supply several million US homes ^[29].

Year	Estimated training power (frontier models)	Context
2020	~1 MW	Small data center wing
2023	~10-50 MW	Large data center
2025	~100-500 MW	Multiple large data centers
2030 (projected)	4-16 GW	Small city

The energy efficiency of leading AI accelerators (GPUs and TPUs) has improved by approximately 40% per year, partially offsetting the growth in compute demand ^[29]. However, the growth in demand has consistently outpaced efficiency gains, meaning total energy consumption continues to rise.

Economic costs

The cost of training frontier models has escalated rapidly. Training GPT-4 was estimated to cost over $100 million. By 2025, training runs costing $500 million to $1 billion are plausible, and projections for the late 2020s suggest individual training runs costing several billion dollars. These costs create significant concentration effects, as only a handful of organizations can afford to train frontier models.

Current state (2025-2026)

As of early 2026, the field of scaling laws is at an inflection point. Several trends define the current moment.

Pre-training scaling continues with caveats. Training compute for frontier models continues to grow at roughly 4 to 5x per year ^[29]. However, the composition of that compute is changing. Pure pre-training on next-token prediction is being supplemented with reinforcement learning stages, synthetic data generation, and multi-stage training pipelines. The simple story of "more parameters and more data yields better performance" is giving way to a more nuanced picture where training methodology matters as much as scale.

Inference-time scaling is the new frontier. The biggest shift in the scaling paradigm is the move toward inference-time compute. Reasoning models like o3, DeepSeek-R1, and Claude with extended thinking have demonstrated that spending more compute at inference time can produce capabilities that pre-training alone cannot achieve.

Over-training is now standard practice. Virtually no major model released in 2025 or 2026 adheres to the original Chinchilla-optimal ratio of 20 tokens per parameter. The industry has broadly adopted the practice of training models on far more data than Chinchilla would prescribe, driven by the inference cost advantages of smaller, over-trained models. Token-to-parameter ratios of 100:1 to 2000:1 are now common.

Scaling laws are becoming more complex. The original scaling laws described clean, simple power-law relationships. The reality in 2026 is more complex. Different training stages (pre-training, supervised fine-tuning, RLHF, reinforcement learning for reasoning) may have different scaling behaviors. Multimodal training introduces additional variables. Inference-time scaling adds yet another dimension. The field is moving toward a more comprehensive theory that accounts for all these factors, but such a unified framework does not yet exist.

Timeline of key scaling laws publications

Year	Paper	Key contribution
2017	Hestness et al.	First systematic empirical study of neural scaling across tasks
2020 (Jan)	Kaplan et al.	Established power-law relationships for language models (N, D, C)
2020 (Oct)	Henighan et al.	Extended scaling laws to images, video, math, and multimodal models
2021	Hernandez et al.	Scaling laws for transfer learning and fine-tuning
2022 (Feb)	Clark et al.	Unified scaling laws for routed (MoE) language models
2022 (Mar)	Hoffmann et al. (Chinchilla)	Compute-optimal training; equal scaling of N and D
2022 (Jun)	Wei et al.	Emergent abilities of large language models
2022 (Oct)	Caballero et al.	Broken neural scaling laws (smoothly broken power laws)
2023 (May)	Muennighoff et al.	Scaling laws for data-constrained (multi-epoch) training
2024 (Jan)	Sardana & Frankle	Inference-aware scaling (beyond Chinchilla-optimal)
2024 (Apr)	Besiroglu et al. (Epoch AI)	Chinchilla replication and revised parameter estimates
2024 (Jun)	Porian et al.	Reconciling Kaplan and Chinchilla discrepancies
2024	Bahri et al.	Theoretical explanation of neural scaling laws (PNAS)
2024	Snell et al.	Inference-time compute scaling; smaller models can match larger ones

References

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. https://arxiv.org/abs/2001.08361
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., & Sifre, L. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. arXiv:2203.15556. https://arxiv.org/abs/2203.15556
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M., Yang, Y., & Zhou, Y. (2017). "Deep Learning Scaling is Predictable, Empirically." arXiv:1712.00409. https://arxiv.org/abs/1712.00409
Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. (2024). "Reconciling Kaplan and Chinchilla Scaling Laws." arXiv:2406.12907. https://arxiv.org/abs/2406.12907
Besiroglu, T., Erdil, E., Barnett, M., & You, J. (2024). "Chinchilla Scaling: A Replication Attempt." Epoch AI. arXiv:2404.10102. https://arxiv.org/abs/2404.10102
Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T.B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D.M., Schulman, J., Amodei, D., & McCandlish, S. (2020). "Scaling Laws for Autoregressive Generative Modeling." arXiv:2010.14701. https://arxiv.org/abs/2010.14701
Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). "Scaling Vision Transformers." CVPR 2022. arXiv:2106.04560. https://arxiv.org/abs/2106.04560
Alabdulmohsin, I., Zhai, X., Kolesnikov, A., & Beyer, L. (2023). "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design." NeurIPS 2023. arXiv:2305.13035. https://arxiv.org/abs/2305.13035
Dehghani, M., et al. (2023). "Scaling Vision Transformers to 22 Billion Parameters." ICML 2023. https://proceedings.mlr.press/v202/dehghani23a.html
Aghajanyan, A., et al. (2023). "Scaling Laws for Generative Mixed-Modal Language Models." arXiv:2301.03728. https://arxiv.org/abs/2301.03728
Clark, A., de Las Casas, D., et al. (2022). "Unified Scaling Laws for Routed Language Models." ICML 2022. arXiv:2202.01169. https://arxiv.org/abs/2202.01169
Shukor, M., Fini, E., et al. (2025). "Scaling Laws for Native Multimodal Models." ICCV 2025. arXiv:2504.07951. https://arxiv.org/abs/2504.07951
Aghajanyan, A., Yu, L., et al. (2023). "Scaling Laws for Generative Mixed-Modal Language Models." arXiv:2301.03728. https://arxiv.org/abs/2301.03728
Sardana, N., Portes, J., Doubov, S., & Frankle, J. (2024). "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." ICML 2024. arXiv:2401.00448. https://arxiv.org/abs/2401.00448
Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. https://arxiv.org/abs/2302.13971
Meta AI. (2024). "Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date." https://ai.meta.com/blog/meta-llama-3/
Muennighoff, N., Rush, A.M., Barak, B., et al. (2023). "Scaling Data-Constrained Language Models." NeurIPS 2023. arXiv:2305.16264. https://arxiv.org/abs/2305.16264
Caballero, E., Gupta, K., Rish, I., & Krueger, D. (2023). "Broken Neural Scaling Laws." ICLR 2023. arXiv:2210.14891. https://arxiv.org/abs/2210.14891
Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. arXiv:2206.07682. https://arxiv.org/abs/2206.07682
Schaeffer, R., Miranda, B., & Koyejo, S. (2024). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023. arXiv:2304.15004. https://arxiv.org/abs/2304.15004
Du, Z., Zeng, A., Dong, Y., & Tang, J. (2024). "Understanding Emergent Abilities of Language Models from the Loss Perspective." NeurIPS 2024. https://arxiv.org/abs/2403.15796
Jones, A. (2021). "Scaling Scaling Laws with Board Games." arXiv:2104.03113. https://arxiv.org/abs/2104.03113
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314. https://arxiv.org/abs/2408.03314
Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). "Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data." Epoch AI. https://epoch.ai/publications/will-we-run-out-of-data
Sutskever, I. (2025). "The Age of Scaling is Over." As discussed in multiple industry reports.
Liao, T.I., et al. (2024). "Scaling Laws Do Not Scale." AAAI/ACM Conference on AI, Ethics, and Society 2024. arXiv:2307.03201. https://arxiv.org/abs/2307.03201
Sharma, U. & Kaplan, J. (2022). "Scaling Laws from the Data Manifold Dimension." Journal of Machine Learning Research, 23, 1-34. https://jmlr.org/papers/v23/20-1111.html
Bahri, Y., Dyer, E., Kaplan, J., Lee, J., & Sharma, U. (2024). "Explaining Neural Scaling Laws." Proceedings of the National Academy of Sciences, 121(27). https://www.pnas.org/doi/10.1073/pnas.2311878121
Epoch AI. (2025). "Can AI Scaling Continue Through 2030?" https://epoch.ai/blog/can-ai-scaling-continue-through-2030

Explain like I'm 5 (ELI5)

Foundations: power laws in neural networks

Early observations (Hestness et al., 2017)

The Kaplan scaling laws (2020)

Experimental setup

Key findings

The combined loss formula

Impact on GPT-3

The Chinchilla scaling laws (2022)

Experimental approach

The central finding: equal scaling

The Chinchilla loss function

The Chinchilla model

Why did the results differ from Kaplan?

The Epoch AI replication

Scaling laws beyond language: other modalities

Scaling laws for autoregressive generative modeling (Henighan et al., 2020)

Vision Transformers

Unified scaling laws for routed language models (Clark et al., 2022)

Multimodal models

Beyond Chinchilla: over-training for inference efficiency

Meta's LLaMA: a case study in over-training

Quantization sensitivity

Scaling laws for data-constrained settings

Broken neural scaling laws

Emergent abilities and scaling

The emergence hypothesis (Wei et al., 2022)

The measurement artifact hypothesis (Schaeffer et al., 2024)

Current status of the debate

Inference-time scaling: test-time compute

The core idea

Reasoning models

Approaches to inference-time scaling

Infrastructure implications

The data wall

The exhaustion timeline

Synthetic data as a response

Beyond text data

Diminishing returns and the scaling debate

The case that scaling is ending

The case that scaling continues

Mathematical framework

Theoretical explanations

The manifold hypothesis

Statistical mechanics analogies

Information-theoretic arguments

Environmental and cost implications

Energy consumption

Economic costs

Current state (2025-2026)

Timeline of key scaling laws publications

See also

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

DeepSeek 3.0

Context window

Post-training

Multi-token prediction

Explain like I'm 5 (ELI5)

Foundations: power laws in neural networks

Early observations (Hestness et al., 2017)

The Kaplan scaling laws (2020)

Experimental setup

Key findings

The combined loss formula

Impact on GPT-3

The Chinchilla scaling laws (2022)

Experimental approach

The central finding: equal scaling

The Chinchilla loss function

The Chinchilla model

Why did the results differ from Kaplan?

The Epoch AI replication

Scaling laws beyond language: other modalities

Scaling laws for autoregressive generative modeling (Henighan et al., 2020)

Vision Transformers

Unified scaling laws for routed language models (Clark et al., 2022)