Scaling laws in artificial intelligence are empirical relationships that describe how the performance of neural networks improves as key factors, such as model size, training data volume, and computational resources, are increased. These relationships typically follow power-law curves, meaning that performance improves predictably and smoothly as resources grow, often across many orders of magnitude. Scaling laws have become one of the most consequential findings in modern machine learning, guiding decisions worth billions of dollars about how to allocate compute budgets, how large to make models, and how much data to collect for training.
The study of scaling laws rose to prominence with a 2020 paper by Jared Kaplan and colleagues at OpenAI, which demonstrated that language model loss follows simple power-law relationships with model parameters, dataset size, and training compute [1]. Two years later, researchers at DeepMind published the Chinchilla paper, which revised the optimal balance between model size and training data, profoundly shifting how the industry trained its largest models [2]. Since then, scaling laws have been extended, debated, revised, and applied across modalities from text to vision to multimodal systems, and the concept of scaling has expanded beyond pre-training to include inference-time compute.
A power law is a mathematical relationship of the form y = ax^b, where a change in one quantity produces a proportional change in another according to a fixed exponent. In the context of neural networks, the "quantity" being predicted is typically the model's test loss (cross-entropy loss on held-out data), and the independent variables are the number of trainable parameters (N), the number of training tokens or data points (D), and the total compute budget (C), usually measured in floating-point operations (FLOPs).
The key insight is that when you plot loss against any of these variables on a log-log scale, the result is approximately a straight line over a wide range. This means the relationship is not arbitrary or chaotic; it follows a predictable trajectory. If you know how a model performs at one scale, you can extrapolate to predict how it will perform at a much larger scale with reasonable accuracy.
This predictability is what makes scaling laws so valuable in practice. Training a frontier model can cost tens or hundreds of millions of dollars. Being able to run small-scale experiments and then extrapolate to forecast the performance of a much larger model before committing the full budget reduces risk enormously.
In January 2020, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever at OpenAI published "Scaling Laws for Neural Language Models" [1]. This paper systematically studied how the performance of transformer-based language models changes as a function of model size, dataset size, and compute.
The team trained a large number of language models ranging from approximately 768 parameters to 1.5 billion parameters, varying model width, depth, and the amount of training data. They used the WebText2 dataset and measured performance as cross-entropy loss on a held-out test set. The models were all decoder-only transformers, consistent with the GPT architecture.
The paper established several foundational results:
1. Power-law scaling with parameters, data, and compute. The test loss L scales as a power law with each of the three factors when the other two are not bottlenecked:
| Variable | Scaling Relationship | Exponent |
|---|---|---|
| Parameters (N) | L(N) = (N_c / N)^alpha_N | alpha_N ≈ 0.076 |
| Data (D) | L(D) = (D_c / D)^alpha_D | alpha_D ≈ 0.095 |
| Compute (C) | L(C) = (C_c / C)^alpha_C | alpha_C ≈ 0.050 |
These trends spanned more than seven orders of magnitude, a remarkable degree of regularity.
2. Architectural details matter less than scale. Within a broad range, factors like network width, depth, and the number of attention heads had minimal effects on performance relative to the total parameter count. A wide, shallow model performed similarly to a narrow, deep model of the same total size. This finding suggested that the raw amount of parameters was more important than how they were arranged.
3. Larger models are more sample-efficient. A key and perhaps counterintuitive result was that larger models reach the same level of performance with fewer training examples than smaller models. This means bigger models learn more per data point.
4. Compute-optimal allocation favors large models trained briefly. Given a fixed compute budget, the Kaplan paper concluded that the optimal strategy was to train very large models on a relatively modest amount of data, stopping well before convergence. Specifically, they found that when scaling compute optimally, model size should scale faster than training data. The recommended allocation was roughly N proportional to C^0.73 and D proportional to C^0.27 [1]. In other words, most of the increased compute budget should go toward making the model larger, not toward training it on more data.
The scaling laws research was conducted concurrently with the development of GPT-3, and the findings directly informed the decision to scale GPT-3 to 175 billion parameters. The laws predicted that this scale would yield substantial improvements over GPT-2's 1.5 billion parameters, and GPT-3's performance validated those predictions. This success cemented scaling laws as a central concept in AI strategy.
In March 2022, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre at DeepMind published "Training Compute-Optimal Large Language Models" [2]. This paper, commonly known as the "Chinchilla paper," fundamentally revised the compute-optimal training strategy proposed by Kaplan et al.
The DeepMind team used three complementary approaches to estimate optimal scaling:
The Chinchilla paper's central conclusion was that Kaplan et al.'s recommended allocation was significantly lopsided. Instead of allocating most additional compute to model size, Hoffmann et al. found that model size and training tokens should be scaled in roughly equal proportion. For every doubling of model size, the number of training tokens should also be doubled.
This implied an optimal ratio of approximately 20 training tokens per model parameter. The table below compares the Kaplan and Chinchilla recommendations:
| Aspect | Kaplan et al. (2020) | Chinchilla (Hoffmann et al., 2022) |
|---|---|---|
| Optimal N scaling | N ∝ C^0.73 | N ∝ C^0.50 |
| Optimal D scaling | D ∝ C^0.27 | D ∝ C^0.50 |
| Tokens per parameter | ~1.7 tokens/param | ~20 tokens/param |
| Strategy summary | Train very large models on modest data | Scale model and data equally |
The discrepancy between the two sets of scaling laws has been attributed to several factors. The Kaplan team used a learning rate schedule that was not fully optimized for each model size, and their experiments covered a narrower range of data scales. Kaplan et al. also did not account for the fact that their smaller models may have been undertrained, which biased the estimated optimal allocation toward larger models [3].
To validate their predictions, the DeepMind team trained a 70-billion-parameter model called Chinchilla on 1.4 trillion tokens (exactly a 20:1 token-to-parameter ratio). This model used the same compute budget as Gopher, DeepMind's earlier 280-billion-parameter model, but redistributed it from parameters to data.
The results were striking. Chinchilla, despite being 4x smaller than Gopher, outperformed it on virtually every benchmark:
| Benchmark | Gopher (280B) | Chinchilla (70B) |
|---|---|---|
| MMLU (average accuracy) | 60.0% | 67.5% |
| HellaSwag | 79.2% | 80.8% |
| PIQA | 81.8% | 83.7% |
| Winogrande | 70.1% | 74.9% |
Chinchilla also outperformed GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on most downstream tasks, despite being far smaller than all of them [2]. The implication was clear: many of the largest models in 2021 and early 2022 had been dramatically undertrained relative to their size.
In April 2024, researchers at Epoch AI published a detailed replication attempt of the Chinchilla paper's results [3]. They found that the specific parameter estimates reported in the original paper contained errors: the optimizer used by Hoffmann et al. had stopped before convergence due to a poor choice of loss scale, and the reported values were rounded in a way that introduced substantial bias.
Epoch's revised parametric fit was:
L(N, D) = 1.8172 + 482.01/N^0.3478 + 2085.43/D^0.3658
Despite the numerical discrepancies, the revised model still implied an optimal ratio of approximately 20 tokens per parameter, consistent with how the actual Chinchilla model was trained and with the other estimation approaches used in the original paper [3]. The core conclusion held: models should be trained on far more data than Kaplan et al. had suggested.
The Chinchilla scaling laws optimize for one objective: achieving the lowest possible loss for a given training compute budget. But this is not the only cost that matters. Once a model is trained, it must be deployed for inference, and inference costs scale with model size. A smaller model is cheaper to run per query, even if it required more training compute to reach a given quality level.
This insight led to the concept of "inference-aware" or "beyond Chinchilla" scaling, first formalized in the paper "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws" (2024) [4]. The authors showed that when inference demand is factored in, the optimal strategy shifts: models should be trained smaller and longer than the Chinchilla-optimal point. For a model expected to serve a large number of inference requests (on the order of billions), the training-time cost is amortized over so many queries that it becomes economical to invest extra compute in training a smaller, cheaper-to-serve model.
Research in this direction found that model quality continues to improve as the tokens-per-parameter ratio scales to extreme ranges, up to 10,000 tokens per parameter in some experiments [4].
Meta's LLaMA model family represents the most prominent practical application of beyond-Chinchilla scaling. The original LLaMA (February 2023) trained a 65-billion-parameter model on 1.4 trillion tokens, roughly consistent with the Chinchilla ratio of about 20 tokens per parameter [5]. But subsequent versions departed dramatically from Chinchilla optimality.
Llama 2 (July 2023) trained a 70-billion-parameter model on 2 trillion tokens, a ratio of roughly 29 tokens per parameter. Llama 3 (April 2024) went much further: the 8-billion-parameter model was trained on 15 trillion tokens, a ratio of approximately 1,875 tokens per parameter [6]. By Chinchilla standards, the compute-optimal training set for an 8B model would be around 160-200 billion tokens. Meta used 75 times that amount.
| Model | Parameters | Training Tokens | Tokens/Parameter Ratio | Chinchilla-Optimal Ratio |
|---|---|---|---|---|
| LLaMA 65B | 65B | 1.4T | ~21.5 | ~20 |
| Llama 2 70B | 70B | 2.0T | ~28.6 | ~20 |
| Llama 3 8B | 8B | 15T | ~1,875 | ~20 |
| Llama 3 70B | 70B | 15T | ~214 | ~20 |
Meta's rationale was explicitly focused on inference economics. A smaller model that has been over-trained costs less to deploy per query. The extra training compute is a one-time investment that pays off across billions of inference calls. Meta reported that performance on their 8B model continued to improve log-linearly even at 15 trillion tokens, well beyond any theoretical saturation point predicted by Chinchilla [6].
This approach has since become industry standard. Most models released in 2024 and 2025, including Mistral's models and various open-weight releases, train well beyond Chinchilla-optimal ratios.
One caveat to extreme over-training has emerged in research: models trained far past Chinchilla optimality can become more sensitive to quantization. Degradation in loss due to weight quantization increases as an approximate power law in the token-to-parameter ratio. This means that aggressively over-trained models may lose more quality when compressed for deployment on resource-constrained hardware, partially offsetting the inference cost savings [4].
While the initial scaling laws work focused on language models, researchers have since investigated whether similar relationships hold for other domains.
Scaling laws for Vision Transformers (ViT) were studied by Zhai et al. in "Scaling Vision Transformers" (2022) [7] and further refined by Alabdulmohsin et al. in "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design" (2023) [8].
The results confirmed that vision models also exhibit power-law scaling behavior, though with some differences from language models. The performance-compute frontier for ViTs follows a saturating power law, meaning that gains per unit of compute gradually diminish at very large scales, more quickly than in language models. A key finding was that staying on the optimal frontier requires simultaneously scaling both model size and data; increasing compute without growing the model is suboptimal [7].
Google scaled ViT to 22 billion parameters (ViT-22B) in 2023, demonstrating that vision transformers continue to benefit from increased scale, though the gains are more modest compared to language models at equivalent compute levels [9].
Scaling laws for native multimodal models (those trained from scratch on multiple modalities rather than composed from separate unimodal components) were studied by Shukor et al. in a 2025 paper from Apple [10]. Their study spanned 457 trained models and revealed several insights:
A separate line of work on scaling law hypotheses for mixed-modal models proposed that performance can be predicted based on modality-specific compression and tokenization efficiency, extending the text-based scaling laws to systems processing text, audio, images, and video within a shared token space [11].
Perhaps the most significant evolution of scaling laws since Chinchilla has been the emergence of inference-time scaling, sometimes called test-time compute scaling. Rather than making models larger or training them on more data, inference-time scaling improves performance by spending more compute at the point where the model generates its response.
Traditional scaling laws describe how to exchange training-time compute for better predictions. Inference-time scaling laws describe how to exchange inference-time compute for better decisions. The mechanism is straightforward: let the model "think longer" by generating extended chain-of-thought reasoning, exploring multiple solution paths, verifying its own outputs, and backtracking when it detects errors.
This approach was first demonstrated at scale with OpenAI's o1 model, released in September 2024. o1 maintains an internal reasoning process (a hidden chain of thought) that unfolds during inference, spending variable amounts of compute depending on the difficulty of the problem. For easy questions, the model responds quickly. For hard mathematical or coding problems, it may spend significantly more time reasoning before answering.
The release of o1 initiated a wave of reasoning-focused models:
| Model | Developer | Release Date | Key Feature |
|---|---|---|---|
| o1-preview | OpenAI | September 2024 | First commercial reasoning model with hidden chain-of-thought |
| o1 | OpenAI | December 2024 | Full release with improved reasoning |
| o3-mini | OpenAI | January 2025 | Cost-efficient reasoning with adjustable effort levels (low/medium/high) |
| o3 | OpenAI | April 2025 | State-of-the-art reasoning, 96.7% on AIME 2024 |
| DeepSeek-R1 | DeepSeek | January 2025 | Open-weight reasoning model trained via pure reinforcement learning |
| Claude 3.7 Sonnet (extended thinking) | Anthropic | February 2025 | Visible extended thinking with adjustable budget |
DeepSeek-R1 was particularly notable because it demonstrated that reasoning capabilities matching o1 could be achieved through pure reinforcement learning applied to a base model, without requiring supervised fine-tuning on human-curated reasoning traces [12]. The model's DeepSeek-R1-Zero variant showed emergent behaviors like self-reflection and verification arising spontaneously from RL training. DeepSeek also demonstrated that reasoning patterns from large models could be distilled into much smaller models (as small as 1.5 billion parameters), achieving better performance than training the small models with RL directly.
Research on inference-time scaling has established that, for reasoning-heavy tasks, there is a roughly log-linear relationship between the amount of inference compute spent and the accuracy of the model's outputs. More thinking time yields better results, but with diminishing returns.
The practical economics are shifting accordingly. OpenAI's inference spending in 2024 reached an estimated $2.3 billion, approximately 15 times the training cost for GPT-4.5, driven largely by reasoning models that generate orders of magnitude more tokens than non-reasoning models [13]. Analysts project that inference will claim 75% of total AI compute by 2030.
This shift has deep implications for hardware design, data center architecture, and the economics of AI deployment. Training is a one-time cost; inference is ongoing and scales with users.
One of the most pressing challenges to continued scaling is the finite supply of high-quality training data. Scaling laws assume that more data yields better models, but the pool of suitable text is not unlimited.
According to projections from Epoch AI, the supply of high-quality human-generated text on the internet may be exhausted as early as 2026 to 2028 [14]. This does not mean there will be no text available, but that the marginal quality of newly added data will decline, and models may begin encountering the same data repeatedly. Common Crawl and similar large-scale web scrapes, which have formed the backbone of language model training sets, have already been extensively mined.
The AI industry has increasingly turned to synthetic data as a way to extend the data supply. Synthetic data is generated by AI models themselves and can be used to train or fine-tune other models. Several major model releases in 2025 incorporated synthetic data in their training pipelines, including Minimax, Nemotron-3, and others [15].
Microsoft's SynthLLM framework represents one systematic approach to generating high-quality synthetic data at scale [16]. The approach generates diverse training examples that maintain the statistical properties of real data while avoiding the limitations of finite human-generated text.
However, synthetic data introduces its own challenges. Models trained on synthetic data risk "model collapse," where errors and biases in the generated data compound across training generations. Gartner forecasts that by 2030, synthetic data will be more widely used for AI training than real-world datasets, but the most capable models in 2026 and beyond are still expected to be anchored in human-generated data, with synthetic data used to expand and stress-test around that core [15].
The data wall is most acute for text. Other modalities have larger untapped reservoirs. Video data, in particular, represents an enormous and largely underutilized source of training signal. Multimodal training that combines text, images, audio, and video may help extend the scaling runway by drawing on these richer data sources.
Whether scaling laws will continue to hold, and whether continued scaling will produce commensurate improvements in model capability, is one of the most debated questions in AI.
Ilya Sutskever, co-founder of OpenAI and later Safe Superintelligence Inc., has been the most prominent voice arguing that the era of simple scaling is over. In a November 2025 podcast, Sutskever characterized the period from 2020 to 2025 as the "age of scaling," when pre-training with ever more compute was a reliable formula for progress [17]. He argued that this era is now ending for three reasons:
Sutskever's view is that AI is transitioning from an "age of scaling" to an "age of research," where novel architectures, training methods, and paradigms will matter more than raw compute.
On the other side, Sam Altman and others at OpenAI have maintained that scaling laws are far from reaching their ceiling. The industry committed an estimated $7.8 trillion to AI infrastructure through 2030, with OpenAI alone pledging over $1 trillion, suggesting strong institutional confidence that scaling will continue to pay off [17].
Several arguments support continued scaling:
The next generation of frontier models (expected throughout 2025 and 2026) will provide empirical evidence. If GPT-5, Gemini 4, and their competitors show substantial improvements over their predecessors, it would suggest that scaling continues to deliver value. If improvements plateau or require exponentially more compute for marginal gains, Sutskever's position would be vindicated.
The exponential growth in training compute has significant environmental and economic consequences.
The power required to train the largest frontier models is growing by more than 2x per year. Epoch AI projects that the largest individual training runs by 2030 will draw 4 to 16 gigawatts (GW) of power, enough to power several million US homes [18]. For comparison, the total power capacity of California in 2022 was approximately 86 GW.
| Year | Estimated Training Power (Frontier Models) | Context |
|---|---|---|
| 2020 | ~1 MW | Small data center wing |
| 2023 | ~10-50 MW | Large data center |
| 2025 | ~100-500 MW | Multiple large data centers |
| 2030 (projected) | 4-16 GW | Small city |
The energy efficiency of leading AI accelerators (GPUs and TPUs) has improved by approximately 40% per year, partially offsetting the growth in compute demand [18]. However, the growth in demand has consistently outpaced efficiency gains, meaning total energy consumption continues to rise.
Estimates suggest that AI's annual carbon footprint could reach 32.6 to 79.7 million tons of CO2 by 2025 [19]. The actual figure depends heavily on the energy mix of the data centers performing the computation. Training runs in regions with high renewable energy penetration produce far less carbon than those relying on fossil fuels.
The cost of training frontier models has escalated rapidly. Training GPT-4 was estimated to cost over $100 million. By 2025, training runs costing $500 million to $1 billion are plausible, and projections for the late 2020s suggest individual training runs costing several billion dollars.
These costs create significant concentration effects. Only a handful of organizations, primarily large technology companies and well-funded startups, can afford to train frontier models. This raises questions about access, competition, and the distribution of AI capabilities.
The shift toward inference-time scaling introduces its own cost dynamics. While individual inference calls are cheap, the aggregate cost of serving billions of queries with reasoning models that generate tens of thousands of tokens per response is substantial. The economics of inference are driving massive investment in inference-optimized hardware, including custom chips designed specifically for transformer inference workloads.
The general form of a neural scaling law can be expressed as a parametric loss function. The most common formulation, following the Chinchilla approach, is:
L(N, D) = E + A/N^alpha + B/D^beta
Where:
The irreducible loss E represents a fundamental limit: even a perfect model cannot predict truly random aspects of the data. The terms A/N^alpha and B/D^beta represent the reducible loss, the portion of loss that can be decreased by increasing model size or data, respectively.
For compute-optimal training, the total compute C is proportional to approximately 6ND (six times the product of parameters and tokens, accounting for the forward and backward passes). The optimization problem is then to minimize L(N, D) subject to the constraint 6ND = C, which yields the optimal allocation of compute between N and D.
Under the Chinchilla parameterization, the optimal scaling is approximately:
This equal scaling means that a 10x increase in compute budget should be split into roughly a 3.16x increase in model size and a 3.16x increase in training data (since 3.16 x 3.16 ≈ 10).
Why do neural scaling laws follow power laws at all? Several theoretical explanations have been proposed.
One explanation draws on the manifold hypothesis: real-world data lies on low-dimensional manifolds embedded in high-dimensional space. As a model's capacity grows, it can approximate these manifolds with increasing fidelity, and the rate of improvement follows a power law related to the intrinsic dimensionality of the data manifold.
Researchers at the National Academy of Sciences published a theoretical framework in 2024 explaining neural scaling laws through the lens of statistical mechanics [20]. They showed that power-law exponents in scaling laws can be derived from properties of the data distribution, specifically its intrinsic dimension and spectral characteristics. The theory predicts different exponents for different types of data, consistent with the empirical observation that vision and language models have different scaling behaviors.
Another line of reasoning appeals to information theory. The model's loss represents the gap between its predictions and the true data distribution. As more parameters or data are added, the model can capture increasingly fine-grained statistical patterns. The rate at which it captures these patterns follows a power law because the "easy" patterns (high-frequency, high-information) are learned first, and the model progressively learns rarer, more subtle patterns.
As of early 2026, the field of scaling laws is at an inflection point. Several trends define the current moment.
Training compute for frontier models continues to grow at roughly 4 to 5x per year [18]. However, the composition of that compute is changing. Pure pre-training on next-token prediction is being supplemented with reinforcement learning stages, synthetic data generation, and multi-stage training pipelines. The simple story of "more parameters and more data yields better performance" is giving way to a more nuanced picture where training methodology matters as much as scale.
The biggest shift in the scaling paradigm is the move toward inference-time compute. Reasoning models like o3, DeepSeek-R1, and Claude with extended thinking have demonstrated that spending more compute at inference time can produce capabilities that pre-training alone cannot achieve. This has opened a second axis of scaling that may prove more economically efficient than continued pre-training scale-up, especially for tasks requiring complex reasoning.
Virtually no major model released in 2025 or 2026 adheres to the original Chinchilla-optimal ratio of 20 tokens per parameter. The industry has broadly adopted the practice of training models on far more data than Chinchilla would prescribe, driven by the inference cost advantages of smaller, over-trained models. Token-to-parameter ratios of 100:1 to 2000:1 are now common.
Whether the data wall will prove to be a binding constraint depends on the success of several parallel efforts: synthetic data generation, multimodal training, improved data curation, and the development of models that can learn more efficiently from existing data. The next few years will determine whether the internet's text corpus represents a hard ceiling or merely a speed bump.
The original scaling laws described clean, simple power-law relationships. The reality in 2026 is more complex. Different training stages (pre-training, supervised fine-tuning, reinforcement learning from human feedback, reinforcement learning for reasoning) may have different scaling behaviors. Multimodal training introduces additional variables. Inference-time scaling adds yet another dimension. The field is moving toward a more comprehensive theory that accounts for all these factors, but such a unified framework does not yet exist.