The Chinchilla scaling laws are a set of empirical findings published by DeepMind researchers in 2022 that describe how to optimally allocate a fixed computational budget when training large language models. The central result, presented in the paper "Training Compute-Optimal Large Language Models" by Jordan Hoffmann and colleagues, overturned the prevailing wisdom about how to scale language models. Rather than investing most additional compute into larger models with relatively little extra data, Hoffmann et al. demonstrated that model size and training data should be scaled in roughly equal proportions. Their work produced a concrete guideline: the optimal number of training tokens is approximately 20 times the number of model parameters.
To validate these findings, the team trained a 70-billion-parameter model called Chinchilla on 1.4 trillion tokens. Despite being four times smaller than DeepMind's earlier Gopher model (280 billion parameters), Chinchilla outperformed Gopher on a wide range of benchmarks while using the same compute budget. This result sent shockwaves through the AI research community and reshaped how organizations approach the training of foundation models.
Between 2018 and 2022, the dominant strategy in language model development was to make models bigger. GPT-2 had 1.5 billion parameters, GPT-3 jumped to 175 billion, and models like Megatron-Turing NLG reached 530 billion parameters. This trend was supported by influential research from OpenAI, particularly the scaling laws published by Jared Kaplan and collaborators in January 2020.
Kaplan et al. studied the relationship between model performance (measured as cross-entropy loss) and three factors: the number of parameters (N), the size of the training dataset in tokens (D), and the total compute budget (C). They found smooth power-law relationships governing how loss decreases as each factor increases. Critically, their analysis suggested that when allocating additional compute, the vast majority should go toward increasing model size rather than training data.
Specifically, Kaplan et al. reported that the optimal number of parameters scales as N_opt proportional to C^0.73, while the optimal number of training tokens scales as D_opt proportional to C^0.27. This implied that for a tenfold increase in compute, roughly seven-fold more should be spent making the model larger, with only a modest increase in training data. Their paper also suggested that models should be trained well short of convergence, meaning that large models trained on comparatively little data would be the most compute-efficient approach.
This philosophy directly shaped the development of GPT-3 (175 billion parameters trained on only 300 billion tokens, a ratio of about 1.7 tokens per parameter), Gopher (280 billion parameters trained on 300 billion tokens), Jurassic-1 (178 billion parameters), and Megatron-Turing NLG (530 billion parameters). These models all prioritized massive parameter counts over extensive training data.
Several researchers had begun questioning whether this parameter-heavy approach was truly optimal. The practical consequence of the Kaplan scaling laws was that organizations raced to build ever-larger models while keeping training datasets roughly the same size. Gopher, for instance, was trained on approximately 300 billion tokens from the MassiveText dataset, a 10.5 TB corpus comprising web pages, Wikipedia, GitHub code, books, and news articles. At 280 billion parameters, this gave Gopher a tokens-to-parameters ratio of roughly 1.07, far below what Chinchilla would later show to be optimal.
The DeepMind team suspected that existing models were significantly undertrained. Their hypothesis was straightforward: if you have a fixed compute budget, perhaps it is better to train a smaller model on much more data rather than a gigantic model on relatively little data.
"Training Compute-Optimal Large Language Models" was posted to arXiv on March 29, 2022 (arXiv:2203.15556) and subsequently presented at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). The paper was authored by Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre.
The researchers conducted a large-scale empirical study, training over 400 language models with parameter counts ranging from 70 million to over 16 billion. Each model was a transformer-based autoregressive language model. The models were trained on varying amounts of data, from 5 billion to 500 billion tokens, all drawn from subsets of the MassiveText dataset.
The compute cost for training a model with N parameters on D tokens was approximated as:
C ≈ 6ND
where C is measured in floating-point operations (FLOPs). This approximation (6 FLOPs per parameter per token) accounts for both the forward and backward passes during training and is a standard estimate for transformer models.
A distinctive feature of the paper is that the authors used three independent approaches to estimate the relationship between compute budget and the optimal balance of model size and training data. All three approaches yielded broadly consistent results, strengthening the conclusions.
In the first approach, the team trained models at several fixed sizes (70M, 150M, 400M, 1B, 10B, and larger) and varied the number of training tokens for each size. For each model size, they recorded the final training loss at different token counts and identified the minimum loss envelope across all runs at each FLOP budget. By fitting a power law to the set of optimal (N, D) pairs extracted from this envelope, they estimated how model size and data should scale with compute.
For a compute budget equivalent to that used for Gopher (approximately 5.76 x 10^23 FLOPs), this approach predicted an optimal model size of roughly 67 billion parameters trained on approximately 1.5 trillion tokens.
The second approach, called the IsoFLOP method, fixed the total compute budget at nine different levels (ranging from 10^18 to 10^21 FLOPs) and varied the model size at each budget. For each fixed compute budget, the researchers trained multiple models of different sizes, with the number of training tokens adjusted so that each model consumed exactly the specified amount of compute. They then identified which model size produced the lowest loss at each budget level.
This approach directly answers the question: "Given a fixed FLOP budget, what model size minimizes loss?" The resulting IsoFLOP curves showed that for every compute level, there was a clear optimum model size, and that optimum shifted to larger models as compute increased. For the Gopher-equivalent compute budget, Approach 2 predicted an optimal model size of about 63 billion parameters trained on approximately 1.4 trillion tokens.
The third approach fitted a parametric model to the final loss values from all experiments conducted in Approaches 1 and 2. The proposed functional form was:
L(N, D) = E + A / N^alpha + B / D^beta
where L is the loss, N is the number of parameters, D is the number of training tokens, and E, A, B, alpha, and beta are fitted constants. The term E represents the irreducible entropy of natural language (the theoretical minimum loss that no model could beat). The terms A / N^alpha and B / D^beta capture the additional loss attributable to having a finite model and finite training data, respectively.
The original fitted values reported by Hoffmann et al. were approximately: E = 1.69, A = 406.4, B = 410.7, alpha = 0.34, and beta = 0.28. By minimizing L(N, D) subject to the constraint C = 6ND, one can solve for the optimal N and D at any given compute budget.
For the Gopher-equivalent budget, Approach 3 yielded a somewhat different prediction of about 40 billion parameters. This discrepancy between Approach 3 and the other two approaches became a point of later scrutiny (see the Criticisms section below).
The following table summarizes the optimal model sizes and training token counts predicted by each approach for the Gopher-equivalent compute budget (approximately 5.76 x 10^23 FLOPs).
| Approach | Method | Predicted optimal parameters | Predicted optimal tokens |
|---|---|---|---|
| Approach 1 | Fixed model sizes, varying tokens | ~67B | ~1.5T |
| Approach 2 | IsoFLOP profiles | ~63B | ~1.4T |
| Approach 3 | Parametric loss fitting | ~40B | ~2.3T |
Approaches 1 and 2 agreed closely, both suggesting a model in the 63-67 billion parameter range trained on 1.4 to 1.5 trillion tokens. Approach 3 predicted a smaller model with more data, though all three approaches agreed that existing models like Gopher (280B parameters, 300B tokens) were heavily over-parameterized and under-trained.
The central finding across all approaches was that the optimal number of parameters and the optimal number of training tokens should each scale roughly as the square root of the compute budget:
N_opt proportional to C^0.50
D_opt proportional to C^0.50
This means that doubling the compute budget should lead to roughly a 1.4x increase in both model size and training data. Equivalently, for every doubling of model size, the number of training tokens should also be doubled. This stands in stark contrast to the Kaplan et al. prediction, where most additional compute would go toward model size.
The disagreement between the Chinchilla and Kaplan scaling laws represents one of the most consequential debates in modern AI research. The following table highlights the key differences.
| Aspect | Kaplan et al. (2020) | Chinchilla / Hoffmann et al. (2022) |
|---|---|---|
| Affiliation | OpenAI | DeepMind |
| Parameter scaling with compute | N_opt proportional to C^0.73 | N_opt proportional to C^0.50 |
| Token scaling with compute | D_opt proportional to C^0.27 | D_opt proportional to C^0.50 |
| Recommended tokens-to-parameters ratio | ~1-2 tokens per parameter (implied) | ~20 tokens per parameter |
| Compute allocation philosophy | Favor larger models, modest data increase | Scale model size and data equally |
| Training strategy | Stop well before convergence | Train closer to convergence on more data |
| Models influenced | GPT-3, Gopher, Jurassic-1, MT-NLG | LLaMA, Chinchilla, PaLM |
Several factors contributed to the discrepancy. Kaplan et al. used a fixed learning rate schedule that was not adjusted for different training durations, which may have biased their results toward favoring larger models trained for fewer steps. They also used a smaller range of model sizes and did not fully explore the space of possible token-to-parameter ratios. Additionally, the Kaplan team's analysis was conducted using models trained on data that was sometimes reused across epochs, whereas the Chinchilla analysis generally considered single-epoch training on fresh data.
A 2024 paper, "Reconciling Kaplan and Chinchilla Scaling Laws" (Porian et al., 2024), analyzed both sets of findings in detail and showed that methodological differences in learning rate tuning and data handling largely explain the divergent conclusions.
To validate their scaling predictions, the DeepMind team trained a model called Chinchilla. The design was straightforward: take the compute budget used for Gopher (approximately 5.76 x 10^23 FLOPs), but instead of building a 280-billion-parameter model trained on 300 billion tokens, build a 70-billion-parameter model trained on 1.4 trillion tokens.
Chinchilla used the same transformer architecture as Gopher, with appropriate adjustments to layer count, hidden dimension, and attention heads to reach the 70-billion-parameter target. It was trained on data from the MassiveText dataset, the same source used for Gopher, but with roughly four times more data sampled from it.
Chinchilla delivered consistently superior performance compared to Gopher and other contemporary large language models, despite being four times smaller. The following table shows selected benchmark results.
| Benchmark | Chinchilla (70B) | Gopher (280B) | GPT-3 (175B) |
|---|---|---|---|
| MMLU (average accuracy) | 67.5% | 60.0% | 43.9% (few-shot) |
| HellaSwag | 80.8% | 79.2% | 78.9% |
| PIQA | 81.8% | 81.8% | 81.0% |
| WinoGrande | 74.9% | 70.1% | 70.2% |
| BoolQ | 83.7% | 79.3% | 60.5% |
The MMLU result was particularly striking. Chinchilla's 67.5% average accuracy represented a 7.5 percentage point improvement over Gopher, despite using the same training compute. At the time of publication, this was a new state of the art on MMLU.
Chinchilla also outperformed Jurassic-1 (178B parameters, developed by AI21 Labs) and Megatron-Turing NLG (530B parameters, developed jointly by NVIDIA and Microsoft) on the majority of evaluated tasks.
Beyond raw performance, Chinchilla offered significant practical benefits. At 70 billion parameters, the model required roughly one quarter of the memory needed to serve Gopher. This translated directly into lower inference costs, faster response times, and the ability to run the model on fewer GPUs. For organizations deploying language models at scale, these savings were substantial. The result demonstrated that compute-optimal training was not just an academic curiosity but had immediate practical implications for model deployment.
The Chinchilla paper had a profound and lasting effect on how the AI research community approaches language model training. Its influence can be seen across multiple dimensions.
Perhaps the most visible response to the Chinchilla scaling laws came from Meta AI. In February 2023, Hugo Touvron and colleagues released LLaMA (Large Language Model Meta AI), a family of open-weight models ranging from 7 billion to 65 billion parameters. The paper explicitly cited the Chinchilla findings and designed its training regime accordingly.
LLaMA's training token counts were far above what the pre-Chinchilla Kaplan scaling laws would have recommended:
| Model | Parameters | Training tokens | Tokens-per-parameter ratio |
|---|---|---|---|
| LLaMA-7B | 7B | 1.0T | ~143 |
| LLaMA-13B | 13B | 1.0T | ~77 |
| LLaMA-33B | 33B | 1.4T | ~42 |
| LLaMA-65B | 65B | 1.4T | ~22 |
Notably, the LLaMA-65B model closely matched the Chinchilla-optimal ratio of ~20 tokens per parameter, while the smaller models were trained well beyond the Chinchilla-optimal point. The result was remarkable: LLaMA-13B outperformed GPT-3 (175B) on most benchmarks, and LLaMA-65B was competitive with Chinchilla-70B and PaLM-540B. This demonstrated that smaller, well-trained models could rival or exceed much larger ones.
Google's PaLM (Pathways Language Model), released in April 2022 with 540 billion parameters trained on 780 billion tokens (a ratio of about 1.4 tokens per parameter), was arguably designed before the Chinchilla findings became widely known. Later Google models, including PaLM 2 and the Gemini family, incorporated Chinchilla-informed training strategies with significantly more training data relative to model size.
The Chinchilla scaling laws influenced a wide range of model developers:
While the Chinchilla scaling laws established the concept of compute-optimal training, a clear trend emerged in 2023 and 2024 where many practitioners deliberately trained models far beyond the Chinchilla-optimal point. This practice, often called "over-training," involves using more training tokens than the scaling laws suggest is compute-optimal.
The key insight motivating over-training is that the Chinchilla scaling laws optimize only for training compute. They do not account for the cost of inference, which for widely deployed models can far exceed the one-time training cost. A smaller model that has been trained longer will produce worse loss per training FLOP, but it will be cheaper to serve at inference time because it has fewer parameters.
Sardana and Frankle (2024) formalized this reasoning in their paper "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." They showed that when inference demand is factored in, the optimal strategy shifts toward training smaller models on more data. For example, they estimated that a developer expecting 2 trillion tokens of total inference demand could reduce total compute (training plus inference) by 17% by training a 7-billion-parameter model on extra data instead of using a compute-optimal 13-billion-parameter model of equivalent quality.
Their modified objective function accounts for both training and inference compute:
Total cost = 6N * D_train + 2N * D_inference
where D_train is the number of training tokens and D_inference is the total number of tokens processed during inference over the model's deployment lifetime.
The most dramatic example of over-training is Meta's LLaMA 3, released in April 2024. The 8-billion-parameter model was trained on approximately 15 trillion tokens, giving it a tokens-to-parameters ratio of roughly 1,875. This is nearly 100 times the Chinchilla-optimal ratio of 20. Even the 70-billion-parameter variant was trained on 15 trillion tokens, yielding a ratio of about 214 tokens per parameter, more than 10 times the Chinchilla point.
Meta reported that both the 8B and 70B models continued to show log-linear improvements in performance even at these extreme token counts, with no evidence of a saturation point. The Chinchilla-optimal amount of training compute for an 8B model corresponds to roughly 200 billion tokens; LLaMA 3 8B used 75 times that amount.
| Model | Parameters | Training tokens | Chinchilla-optimal tokens | Over-training factor |
|---|---|---|---|---|
| Chinchilla | 70B | 1.4T | 1.4T | 1x |
| LLaMA-65B | 65B | 1.4T | 1.3T | ~1.1x |
| LLaMA 2-7B | 7B | 2.0T | 140B | ~14x |
| LLaMA 3-8B | 8B | 15T | 160B | ~94x |
| LLaMA 3-70B | 70B | 15T | 1.4T | ~11x |
This trend reflects a pragmatic shift in the industry. For models that will serve billions of inference requests, the one-time cost of extended training is a worthwhile investment if it yields a smaller, faster, and cheaper model at deployment time.
The mathematical core of the Chinchilla paper is the parametric loss function from Approach 3. Understanding this function helps clarify what the scaling laws actually predict and where their limitations lie.
The proposed loss function takes the form:
L(N, D) = E + A / N^alpha + B / D^beta
Each component has a specific interpretation:
Given the constraint C = 6ND, one can use Lagrange multipliers to minimize L(N, D) subject to this constraint. The solution yields the optimal model size and data size as functions of the compute budget:
N_opt proportional to C^(beta / (alpha + beta))
D_opt proportional to C^(alpha / (alpha + beta))
Using the fitted values alpha = 0.34 and beta = 0.28, this gives:
N_opt proportional to C^0.45
D_opt proportional to C^0.55
This is close to, but not exactly, the 50/50 split suggested by Approaches 1 and 2. The slight asymmetry means that data should grow slightly faster than model size, which is consistent with the approximately 20:1 tokens-to-parameters ratio.
Despite its enormous influence, the Chinchilla paper has faced several important criticisms.
In April 2024, Besiroglu, Erdil, Barnett, and You from Epoch AI published "Chinchilla Scaling: A Replication Attempt." This paper systematically examined the Chinchilla results and identified several issues, particularly with Approach 3.
First, the reported parameter estimates for the Approach 3 loss function appeared inconsistent with the underlying data. When the Epoch team extracted the data from the original paper's figures and re-fitted the parametric model, they obtained significantly different coefficients. Their revised fit was:
L(N, D) = 1.82 + 514.0 / N^0.35 + 2115.2 / D^0.37
compared to the original:
L(N, D) = 1.69 + 406.4 / N^0.34 + 410.7 / D^0.28
The most notable difference is in the data term: the Epoch team found B = 2115.2 and beta = 0.37, compared to the original B = 410.7 and beta = 0.28. This suggests that data contributes more to reducing loss than the original paper indicated.
Second, the confidence intervals reported by Hoffmann et al. were implausibly narrow. For the exponent alpha, they reported a confidence interval of 0.454 to 0.455. The Epoch team estimated that obtaining such tight intervals would require on the order of 600,000 experiments, whereas the actual study likely conducted fewer than 500.
Third, there was an internal inconsistency: the Approach 3 estimates implied an optimal ratio closer to 70 tokens per parameter, contradicting the approximately 20 tokens per parameter derived from Approaches 1 and 2 (and actually used for the Chinchilla model). The Epoch team's revised Approach 3 estimates, however, aligned well with Approaches 1 and 2, suggesting approximately 20 tokens per parameter.
The models used in the scaling study ranged from 70 million to 16 billion parameters, trained on up to 500 billion tokens. The predictions for much larger models (like the 70-billion-parameter Chinchilla itself) were extrapolations beyond the range of the fitted data. While Chinchilla's strong performance validated these extrapolations in one case, the reliability of the scaling laws at much larger compute budgets remains an open question.
The Chinchilla scaling laws treat all training tokens as equivalent. In practice, data quality varies enormously, and a token from a curated textbook is not equivalent to a token from a low-quality web scrape. The scaling laws do not capture the effects of data curation, deduplication, or domain-specific filtering, all of which can significantly affect model quality independent of raw token count.
The scaling laws were derived from experiments with standard transformer architectures. Whether the same relationships hold for alternative architectures (such as mixture-of-experts models, state-space models like Mamba, or retrieval-augmented models) is not established by the original work.
As discussed in the over-training section above, the Chinchilla scaling laws optimize for training compute alone. They do not model the total cost of developing and deploying a model, which includes inference costs, fine-tuning costs, and the engineering overhead of working with larger models. For many real-world applications, a smaller model trained beyond the Chinchilla-optimal point is a better economic choice.
The Epoch AI replication also noted that the definition of "model parameters" in the Chinchilla paper was ambiguous. Three different interpretations of what counts as a parameter (for instance, whether to include embedding parameters) are possible, with relative differences as high as 15.2%. This ambiguity complicates precise replication of the results.
The Chinchilla scaling laws fundamentally altered the economics of large language model development in several ways.
Before Chinchilla, the primary bottleneck for building better language models was compute: organizations needed more GPUs and more training time. The Chinchilla findings shifted attention to data as an equally important bottleneck. Training a 70-billion-parameter model requires 1.4 trillion tokens of high-quality text, and training a 400-billion-parameter model would require roughly 8 trillion tokens under Chinchilla-optimal scaling. This created intense competition for training data and spurred investment in data curation, synthetic data generation, and data licensing.
By demonstrating that a 70-billion-parameter model could outperform a 280-billion-parameter model, the Chinchilla paper showed that raw scale was not the only path to state-of-the-art performance. This lowered the barrier to entry for organizations with moderate compute budgets but access to large, high-quality datasets. Startups like Mistral AI and open-source efforts like LLaMA benefited directly from this insight.
A consequence of Chinchilla-optimal training is that model development quickly runs into the finite supply of high-quality text data on the internet. Estimates by Epoch AI suggest that the stock of publicly available text data suitable for language model training is on the order of a few trillion to tens of trillions of tokens. As models grow, the demand for training data under Chinchilla-optimal scaling grows linearly. This has been called the "data wall" and has motivated research into synthetic data, multi-epoch training, and curriculum learning strategies.
The realization that inference costs often dominate total deployment costs led to a new paradigm of inference-aware scaling. Instead of optimizing purely for training efficiency, developers now consider the total lifecycle cost of a model. This has driven the trend toward smaller, over-trained models (like LLaMA 3) that sacrifice training efficiency for inference efficiency.
The Chinchilla scaling laws remain one of the most cited and discussed results in modern AI research. While the specific numbers (20 tokens per parameter, equal scaling exponents) may be refined by future work, the core insight has proven durable: data matters as much as model size, and the most productive use of additional compute is not always building a bigger model.
The paper also established a methodological template for scaling law research. The use of multiple independent estimation approaches, large sweeps of training runs, and parametric loss fitting has been adopted by subsequent studies on scaling in areas such as reinforcement learning, multimodal models, and mixture-of-experts architectures.
More broadly, the Chinchilla paper demonstrated that careful empirical analysis of scaling behavior can yield actionable insights worth hundreds of millions of dollars in training efficiency. It showed that the AI research community's collective intuitions about scaling can be wrong and that systematic measurement can overturn even widely held beliefs.