Chinchilla (language model and scaling laws)
Last reviewed
Apr 28, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 4,333 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 4,333 words
Add missing citations, update stale details, or suggest a clearer explanation.
Chinchilla is both a 70-billion-parameter transformer large language model released by DeepMind in March 2022 and the name commonly attached to the compute-optimal scaling laws introduced in the same paper, Training Compute-Optimal Large Language Models by Jordan Hoffmann and colleagues [1]. The paper challenged the prevailing wisdom on how to allocate compute when training large neural network language models. Until that point, most laboratories had followed the OpenAI scaling laws of Kaplan et al. (2020), which suggested that as compute budgets grow, model size should grow much faster than the size of the training corpus [2]. Chinchilla showed empirically that this advice was wrong for the compute regime that mattered most in 2022. The correct compute-optimal recipe was to scale the model and the dataset roughly equally, which in concrete terms meant training on roughly 20 tokens per parameter, far more data than any of the contemporary frontier models had seen.
The headline experimental result was that Chinchilla 70B, trained on 1.4 trillion tokens, outperformed DeepMind's own Gopher 280B (trained on 300 billion tokens) on essentially every downstream evaluation, while using the same total training compute. Chinchilla also beat GPT-3 175B from OpenAI, Jurassic-1 178B from AI21 Labs, and Megatron-Turing NLG 530B from Microsoft and NVIDIA, despite being a fraction of their size [1]. The implication was profound: nearly every public frontier model trained between 2020 and early 2022 was significantly undertrained, and the field had been spending its compute in the wrong place. The lesson was absorbed quickly. The first generation of LLaMA models from Meta, released less than a year later, explicitly cited Chinchilla as motivation for training small models on enormous corpora, and the same philosophy has informed almost every subsequent open-weight family including Falcon, Mistral, Qwen, and DeepSeek [3].
The Chinchilla paper has since become one of the most cited results in modern AI, with the phrase "Chinchilla optimal" entering everyday usage among practitioners. Later work, including a high-profile 2024 replication study by Besiroglu and colleagues at Epoch AI, found that Hoffmann et al. had made some technical errors in their parameter fits and reported confidence intervals that were implausibly tight, but the qualitative conclusions of the paper survived rederivation [4]. Subsequent research on inference-aware scaling, including the influential 2024 Beyond Chinchilla-Optimal paper by Sardana et al., has further refined the recipe by accounting for the cost of serving a model after training, leading to the modern practice of intentionally over-training small models far past the Chinchilla point because it pays back over many billions of inference requests [5]. The frontier-scale models of 2024 and 2025, including Llama 3 70B, were trained on roughly 200 tokens per parameter, an order of magnitude past the Chinchilla recipe.
The practice of scaling neural language models had been guided since 2020 by a paper from OpenAI titled Scaling Laws for Neural Language Models, written by Jared Kaplan, Sam McCandlish, and colleagues [2]. Kaplan et al. trained a large number of small to medium transformer language models, varying parameter count, dataset size, and training compute over several orders of magnitude. They fit power laws to the resulting test losses and concluded that, given a fixed compute budget, the optimal allocation pushes the great majority of the budget into model size, with only a modest increase in dataset tokens. Translated into a rule of thumb, Kaplan suggested that for every tenfold increase in compute, model size should grow by roughly a factor of 5.5 while data should grow by only about 1.8.
Kaplan's recommendation matched the trajectory the field was on. GPT-3 was 175 billion parameters trained on 300 billion tokens, a ratio of just under two tokens per parameter. Jurassic-1 was 178 billion parameters trained on 300 billion tokens. Megatron-Turing NLG was 530 billion parameters trained on 270 billion tokens. DeepMind's own Gopher, released in late 2021, was 280 billion parameters trained on 300 billion tokens. Each of these models was, on the Kaplan analysis, an aggressive scale-up with relatively modest growth in training data. The unspoken implication was that further progress would come from continuing to grow models even faster, leading to credible roadmaps for trillion-parameter and ten-trillion-parameter models that would still be trained on a few hundred billion tokens.
DeepMind's team was uneasy with this trajectory. The Kaplan analysis had used a relatively narrow range of compute budgets, focused on small models, and did not isolate the effect of training schedule choices on the loss extrapolations. In particular, the Kaplan models were trained with learning-rate schedules tuned for a fixed number of steps, and Kaplan extrapolated the loss-versus-compute frontier without retuning these schedules at each scale. The Chinchilla authors hypothesized that improperly tuned schedules might exaggerate how much benefit comes from increasing model size relative to data. Settling the question required a much larger and more carefully controlled experiment.
Hoffmann et al. trained more than 400 transformer language models, ranging from 70 million to over 16 billion parameters, on subsets of the MassiveText corpus ranging from 5 billion to 500 billion tokens [1]. The training schedule for each model was tuned to that model's exact step count rather than extrapolated from a generic schedule. The authors then fit three independent statistical procedures to the resulting loss surface, each designed to estimate the compute-optimal frontier from a different angle.
The first approach, "fix model sizes and vary training tokens," trained a fixed model at several different token budgets and identified the inflection point on the loss-versus-tokens curve. The second approach, "IsoFLOP profiles," fixed total training compute and varied the model size, identifying the model size that achieved the lowest loss at each compute budget. The third approach, "fitting a parametric loss function," globally fit a closed-form expression for the loss as a function of N (parameters) and D (tokens), and derived the compute-optimal allocation analytically.
The parametric loss function takes the form L(N, D) = E + A divided by N to the alpha plus B divided by D to the beta, where E is the irreducible entropy of natural language, A and B are fitted coefficients, and alpha and beta are fitted exponents. Hoffmann et al. reported point estimates of E equal to 1.69 nats per token, A equal to 406.4, B equal to 410.7, alpha equal to 0.34, and beta equal to 0.28. The exponents were close enough that the implied compute-optimal allocation puts roughly equal exponents on parameters and tokens.
All three approaches gave the same qualitative answer: the compute-optimal allocation puts roughly equal weight on growing the model and growing the dataset. The combined estimate suggested that the compute-optimal model for the Gopher compute budget would have about 67 to 70 billion parameters trained on about 1.5 trillion tokens, roughly four times the data Gopher actually saw and roughly a quarter of Gopher's parameter count.
To test their prediction, the DeepMind team trained a single model called Chinchilla at the recipe their analysis prescribed: 70 billion parameters and 1.4 trillion tokens, using essentially the same training compute as the original 280-billion-parameter Gopher run [1]. The Chinchilla architecture is a decoder-only transformer with 80 layers, an internal model dimension of 8,192, 64 attention heads with key and value dimensions of 128, and the standard residual transformer block layout. The context length used during training was 2,048 tokens, and the batch size was ramped from roughly 1.5 million tokens up to 3 million tokens during training to maintain numerical stability and improve sample efficiency at scale.
Several architectural and training-recipe differences distinguish Chinchilla from Gopher. Chinchilla replaces standard LayerNorm with RMSNorm, swaps absolute positional encodings for relative positional encodings, and uses AdamW instead of plain Adam. The peak learning rate was 1e-4, with a cosine decay schedule reducing the rate to 10 percent of the peak. The tokenizer is a slightly modified SentencePiece variant that does not apply NFKC normalization, with about 94 percent vocabulary overlap with Gopher's tokenizer.
The training corpus was a recomposed subset of MassiveText, the same proprietary mixture used to train Gopher. MassiveText is a curated blend of web text, books, news articles, GitHub code, scientific papers, dialogue, and Wikipedia, with the web component (called MassiveWeb) making up the largest single fraction. For Chinchilla, the source mixture weights were rebalanced to account for the much larger total token budget. Even at 1.4 trillion tokens, much of the Chinchilla corpus is seen only once during training; only the highest-quality subsets are repeated within the run.
| Property | Value |
|---|---|
| Parameters | 70 billion |
| Layers | 80 |
| Internal dimension | 8,192 |
| Attention heads | 64 |
| Key/value head dimension | 128 |
| Context length | 2,048 |
| Batch size (tokens) | 1.5M ramped to 3M |
| Tokenizer | Modified SentencePiece, vocab approximately 32k |
| Optimizer | AdamW |
| Peak learning rate | 1e-4 |
| Schedule | Linear warmup then cosine decay to 10% of peak |
| Normalization | RMSNorm |
| Positional encoding | Relative |
| Training tokens | 1.4 trillion |
| Training corpus | MassiveText (rebalanced) |
| Total training compute | Approximately 5.76 × 10^23 FLOPs (matched Gopher) |
The table below summarizes the salient differences between Chinchilla and the most-discussed contemporary frontier language models in early 2022. Note that the training compute is shown in zettaFLOPs (10^21 FLOPs); Chinchilla and Gopher were trained with the same total compute.
| Model | Lab | Released | Parameters | Training tokens | Tokens per parameter | Training compute (zFLOPs) |
|---|---|---|---|---|---|---|
| GPT-3 | OpenAI | 2020 | 175B | 300B | 1.7 | approximately 314 |
| Jurassic-1 | AI21 Labs | 2021 | 178B | 300B | 1.7 | approximately 320 |
| Gopher | DeepMind | 2021 | 280B | 300B | 1.1 | approximately 504 |
| Megatron-Turing NLG | Microsoft / NVIDIA | 2022 | 530B | 270B | 0.5 | approximately 858 |
| Chinchilla | DeepMind | 2022 | 70B | 1.4T | 20.0 | approximately 504 |
| LLaMA 65B | Meta | 2023 | 65B | 1.4T | 21.5 | approximately 449 |
| Llama 2 70B | Meta | 2023 | 70B | 2.0T | 28.6 | approximately 720 |
| Llama 3 70B | Meta | 2024 | 70B | 15.0T | 214 | approximately 5,400 |
Reading this table from top to bottom shows the dramatic shift in training philosophy that Chinchilla triggered. The 2020 to early 2022 models cluster near one to two tokens per parameter, the original Kaplan recommendation. Chinchilla represents a sudden jump to 20 tokens per parameter. The post-Chinchilla open-weight models trained between 2023 and 2025 push the ratio steadily higher as labs accept the trade-off of additional training compute in exchange for cheaper inference and stronger downstream behavior on the same parameter footprint.
Chinchilla outperformed Gopher and the other contemporary frontier models on essentially every benchmark Hoffmann et al. evaluated. Aggregated across over 150 downstream tasks, Chinchilla improved on Gopher in roughly 80 percent of cases, often by margins large enough to be visible without statistical sophistication [1]. The clearest single-number summary is the MMLU benchmark, a multiple-choice test covering 57 academic and professional subjects: Chinchilla scored 67.5 percent on average, against 60.0 percent for Gopher, a 7.5-percentage-point absolute improvement. The same trend appeared on BIG-bench (Chinchilla 65.1 percent versus Gopher 54.4 percent), on TruthfulQA, and on a long list of zero-shot and few-shot tasks covering reading comprehension, common-sense reasoning, question answering, and translation.
Language modeling perplexity also improved markedly. On The Pile, a widely used benchmark suite of held-out text from many domains, Chinchilla achieved a perplexity of 7.16 versus Gopher's 7.75, a substantial improvement that holds up across every individual subset of The Pile rather than being driven by a single domain. The fact that the gains are uniform across domains is important: it suggests that compute-optimal training does not just trade off performance on one type of text against another but produces a strictly better model in essentially every measurable way.
The practical implications of these results extend beyond the benchmark numbers. Because Chinchilla is one quarter the size of Gopher, it requires roughly four times less compute to fine-tune and roughly four times less compute to serve at inference time. For applications where the trained model will be deployed and run many times, the Chinchilla recipe is therefore strictly preferable: it gives a better model, and that better model is also cheaper to use. This combination of strictly better quality and strictly lower deployment cost is unusual in machine learning research and helps explain why the field accepted the Chinchilla recommendations so quickly.
| Benchmark | Chinchilla 70B | Gopher 280B | GPT-3 175B |
|---|---|---|---|
| MMLU (5-shot, average) | 67.5% | 60.0% | 43.9% |
| BIG-bench (average) | 65.1% | 54.4% | not reported |
| The Pile (perplexity, lower is better) | 7.16 | 7.75 | not directly comparable |
| LAMBADA (zero-shot) | 77.4% | 74.5% | 76.2% |
| TriviaQA (1-shot) | 73.3% | 64.0% | 68.0% |
| HellaSwag (zero-shot) | 80.8% | 79.2% | 78.9% |
The table below contrasts the practical recommendations that fall out of the Kaplan and Chinchilla analyses. The numbers in the rightmost column are the implied tokens-per-parameter ratio that each scaling law would suggest for a model trained at roughly the same total compute as Chinchilla.
| Scaling law | Year | Lab | Parameter exponent (a) | Token exponent (b) | Compute split | Tokens per parameter at Chinchilla scale |
|---|---|---|---|---|---|---|
| Kaplan et al. | 2020 | OpenAI | approximately 0.73 | approximately 0.27 | most compute into parameters | approximately 1.7 |
| Hoffmann et al. (Chinchilla) | 2022 | DeepMind | approximately 0.50 | approximately 0.50 | balanced | approximately 20 |
| Besiroglu et al. (replication) | 2024 | Epoch AI | approximately 0.49 | approximately 0.51 | balanced (slightly favors data) | approximately 22 |
| Sardana et al. (inference-aware) | 2024 | MosaicML / DBRX | depends on inference demand | depends on inference demand | favors more tokens than Chinchilla | 50 to 200+ |
The parameter exponent a and the token exponent b describe how to split compute as a fraction: if a equals 0.5 and b equals 0.5, then a doubling of compute should result in a 1.41-fold increase in both parameters and tokens. Kaplan's exponents allocate roughly three-quarters of the additional compute to parameter growth and one-quarter to data growth. Hoffmann's exponents allocate roughly equal amounts to each. The inference-aware scaling laws of Sardana et al. depend on how many inference tokens the model is expected to generate over its lifetime; for models expected to serve heavy inference loads, the optimal training shifts further toward more data on smaller models.
The most widely repeated takeaway from the paper is the rule that compute-optimal training requires roughly 20 tokens per parameter. This number is not dogma; it is a single point on a continuum that depends on the compute budget. Hoffmann et al. derived slightly different ratios at different scales, with the optimum drifting upward as compute grows. At the Chinchilla compute budget of about 5.76 × 10^23 FLOPs, the optimum lands very near 20 tokens per parameter. The rule is best understood as the answer at the scale where most public LLM training was happening in 2022, and as a reminder that pre-Chinchilla models were systematically trained on far less data than they should have been.
The rule is easy to apply. A planner with C FLOPs of compute can roughly approximate the compute-optimal model size as N equal to the square root of C divided by 120 (using the standard approximation that one forward and backward pass through a parameter requires about 6 floating-point operations). Once the parameter count is fixed, the token count is simply 20 times the parameter count. In practice, almost no team trains exactly at this point because inference cost, fine-tuning cost, deployment latency, and engineering complexity all push the optimum away from the strict Chinchilla recommendation. The Chinchilla recipe is the right answer to the question "if I could only minimize pretraining loss for a fixed amount of training compute, what should I do?" but that is rarely the most important question for a production system.
The practical influence of Chinchilla on subsequent LLM training has been enormous. The first generation of LLaMA models from Meta, released in February 2023, was the first widely visible application of Chinchilla-style training in the open-weight world: LLaMA 7B was trained on 1 trillion tokens, and LLaMA 13B, 33B, and 65B on 1.4 trillion tokens, all roughly at or beyond the Chinchilla recommendation [3]. Touvron et al. explicitly cited Chinchilla as the rationale and noted that, while the analysis identified the compute-optimal training point, they intentionally trained past it to lower inference cost.
The same logic carried through subsequent open-weight families. Llama 2 70B was trained on 2 trillion tokens, roughly 28 tokens per parameter. Falcon 40B was trained on 1 trillion tokens. Mistral 7B was trained on far more tokens than the Chinchilla recommendation of about 140 billion would suggest. Qwen, DeepSeek, and most other open-weight models have followed a similar template: a parameter count chosen for efficient deployment on commodity GPUs, trained on a token budget several times larger than the strict Chinchilla minimum so that inference cost can be amortized over many billions of requests.
The trend culminated in Llama 3 and Llama 3.1, where the 8B and 70B models were trained on 15 trillion tokens, roughly 200 tokens per parameter, an order of magnitude beyond Chinchilla optimal [6]. Meta justified this by pointing to the falling cost of training compute relative to the deployment lifetime of the model. Closed frontier labs are presumed to follow similar reasoning: while OpenAI has never published the exact training data sizes for GPT-4 or its successors, public estimates suggest GPT-4-class models were trained on multiple trillions of tokens, well past the Chinchilla recommendation for their compute budget. Anthropic, Google, and others appear to have followed similar trajectories.
In April 2024, Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You at Epoch AI published a replication study under the title Chinchilla Scaling: A Replication Attempt [4]. The paper refit the parametric loss function using data points reproduced from the figures of the original Hoffmann et al. paper and reported several discrepancies. The fitted exponents and constants from the replication did not match the values Hoffmann et al. reported, and the original confidence intervals were narrower than statistically plausible given the number of training runs the team could have performed.
Besiroglu and colleagues identified two causes. First, the reported confidence intervals would have required hundreds of thousands of independent training runs to be that narrow, while the original paper described running fewer than 500. Second, some parameter values reported in the body of the original paper were rounded in a way that produced biased predictions when used to derive the compute-optimal recommendation. The replication reported revised estimates of A approximately 482, B approximately 2,085, E approximately 1.82, alpha approximately 0.348, and beta approximately 0.366, with substantially wider confidence intervals.
Importantly, the replication did not overturn the qualitative Chinchilla conclusion. The corrected exponents still suggested that compute-optimal allocation puts roughly equal weight on parameter growth and data growth, with the data slightly favored. The 20-tokens-per-parameter rule survived in essentially intact form, though the precise number drifts toward 22 tokens per parameter under the revised constants. Several DeepMind authors of the original paper publicly acknowledged the corrections and credited Epoch AI with strengthening the analysis. A follow-up study published in 2025, Evaluating the Robustness of Chinchilla Compute-Optimal Scaling, found that the qualitative recommendation of equal scaling is robust, but that the precise ratio depends meaningfully on training-recipe choices that were held fixed in the original experiments [7].
The original analysis considers only the cost of training. In production, a model is then run on inference workloads that may consume orders of magnitude more compute than the training run itself. For widely deployed models, lifetime inference compute can dwarf training compute by ten to one hundred times. In this regime, choosing a model that is half the size and trained on twice the data offers substantial savings even if it requires extra training compute, because every inference request is half as expensive forever after.
Nicholas Sardana, Jacob Portes, and colleagues at MosaicML formalized this argument in the 2024 paper Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws [5]. They derived a modified scaling law that adds an inference-cost term to the optimization objective and showed that, for any anticipated inference demand greater than zero, the optimal model is smaller than the Chinchilla recipe would suggest, trained on more tokens. For a model expected to generate a trillion tokens of inference (a heavily used product), the optimum shifts to several hundred tokens per parameter. This framework explained a shift in industry practice that had already happened, and put a quantitative footing under it. The current consensus is that, for most production LLMs, the right training point is somewhere between two and ten times the Chinchilla recommendation, depending on how aggressively the model will be served.
The Chinchilla paper was an immediate sensation when it was released as an arXiv preprint in March 2022 and went on to win an Outstanding Paper award at NeurIPS 2022. The paper has been cited tens of thousands of times since and is regularly listed in top-ten compilations of the most influential modern AI papers. The phrase "Chinchilla optimal" entered the standard vocabulary of LLM practitioners almost immediately.
Part of the paper's impact comes from the elegance of its central claim. The conclusion that parameters and tokens should grow together is simple enough to remember and broad enough to apply to almost any LLM training plan. The recommendation is constructive: it does not just identify a problem but tells you how to fix it, and the fix turns out to be cheaper to deploy than what came before. A second source of impact is the empirical confidence the paper provides. Hoffmann et al. did not just propose a theory; they trained Chinchilla itself as a proof of concept and demonstrated that the recipe outperformed the prevailing alternative on essentially every metric.
A third source of impact, sometimes underestimated, is timing. The paper landed exactly when the field was transitioning from a regime where only very large industry labs trained frontier models to a regime where many labs and academic groups were attempting to do so. The Chinchilla recipe gave smaller players a path to compete: rather than scaling parameter counts into the trillions, they could compete by training relatively modest models on massive corpora. This democratization, combined with the open release of LLaMA and its successors, helped fuel the open-weight LLM ecosystem that took shape over 2023 and 2024.
The Chinchilla scaling laws are an empirical fit to a particular set of training experiments and inherit several limitations from the experimental setup. The experiments used MassiveText, a specific corpus composition that may not generalize to all training data, and a fixed transformer architecture; substantially different architectures (mixture-of-experts, state-space models) may have different scaling behavior. The analysis also assumes that pretraining loss is a sufficient summary of model quality, but downstream task performance and emergent abilities depend on the loss in complicated ways.
The scaling-law extrapolation has limits at the extremes. The Chinchilla experiments did not explore models beyond about 16 billion parameters or training runs beyond about 500 billion tokens, and extrapolating to trillion-parameter or trillion-token regimes requires faith in the power-law fit far outside its tested range. Several follow-up studies have found that the simple parametric form starts to break down at very large scales, with the exponents drifting modestly as N and D grow. Finally, the Chinchilla recipe addresses pretraining only. Modern frontier models include extensive post-training (supervised fine-tuning, RLHF, distillation), and the optimal pretraining checkpoint may not be the optimal starting point for these procedures.