Chinchilla
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,483 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,483 words
Add missing citations, update stale details, or suggest a clearer explanation.
Chinchilla is a 70-billion-parameter transformer large language model and an accompanying family of compute-optimal scaling laws, both introduced in the March 2022 paper Training Compute-Optimal Large Language Models by Jordan Hoffmann and colleagues at Google DeepMind [1]. The paper's central empirical claim was that frontier foundation models of the era, including DeepMind's own Gopher 280B and OpenAI's GPT-3 175B, were substantially undertrained: under a fixed training-compute budget, model size and the number of training tokens should grow together rather than parameters racing ahead of data. Translated into a memorable rule of thumb, the compute-optimal ratio is roughly 20 training tokens per parameter at the Gopher-scale compute budget, far above the one or two tokens per parameter that most contemporaneous models had been trained on.
To validate the recipe empirically, the DeepMind team trained a single model, Chinchilla, with 70 billion parameters on 1.4 trillion tokens, using essentially the same training compute as the 280-billion-parameter Gopher run [1]. Chinchilla outperformed Gopher on the great majority of downstream evaluations, often by large margins, and also beat GPT-3 175B, Jurassic-1 178B, and Megatron-Turing NLG 530B despite being a fraction of their parameter counts. The Chinchilla weights were never publicly released; the artifact that propagated through the field was the paper and the recipe it prescribed [2].
The Chinchilla result reshaped LLM training in 2023 and beyond. Meta's LLaMA family (February 2023) was the first widely visible application of Chinchilla-style training in the open-weight world, and every subsequent open-weight family (Llama 2, Llama 3, Llama 3.1, Mistral, Qwen, DeepSeek) followed broadly Chinchilla-influenced compute allocations. In practice most of these models intentionally went beyond Chinchilla optimal (Llama 2 70B at ~28 tokens per parameter, Llama 3 and Llama 3.1 70B at ~215) because for systems serving billions of inference requests, training a smaller model on more data is a strictly better trade [3][4]. A 2024 Epoch AI reproduction by Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You (with informal notes from Robin Pearce) identified arithmetic and confidence-interval errors in the parametric-fit method (Approach 3) of the original paper, but the qualitative recommendation of equal scaling survived rederivation [5].
The dominant guidance on how to scale transformer language models from 2020 to early 2022 came from Scaling Laws for Neural Language Models by Jared Kaplan, Sam McCandlish, and colleagues at OpenAI [6]. Kaplan et al. trained many small to medium transformer language models, varying parameter count, dataset size, and training compute over several orders of magnitude, and fit power laws to the resulting test loss. Their headline conclusion: given a fixed compute budget, the optimal allocation puts the great majority of additional compute into model size, with only a modest increase in training tokens. In their formulation, for every tenfold increase in compute, model size should grow by roughly a factor of 5.5 while training tokens should grow by only about a factor of 1.8.
This recommendation lined up with the prevailing experimental trajectory. By late 2021, every visible frontier model was running on something close to the Kaplan recipe:
Each of these models was an aggressive scale-up of parameter count with relatively modest growth in training data. The implication of Kaplan's analysis was that further progress would come from continuing to grow models even faster, leading to credible roadmaps for trillion-parameter models that would still be trained on only a few hundred billion tokens.
Gopher, released by DeepMind in December 2021, occupies a special place in the Chinchilla story. Gopher was the largest dense LLM DeepMind had trained, at 280 billion parameters and roughly 300 billion training tokens drawn from MassiveText, an internal curated corpus of web text, books, news, code, and scientific papers [9]. Gopher consumed roughly the same compute the Chinchilla team would later use, making it the natural foil: by holding total training compute fixed and re-allocating it between parameters and tokens, Hoffmann et al. could ask whether DeepMind's flagship had been built at the wrong point on the parameter-versus-data Pareto frontier. The answer turned out to be yes, by a wide margin.
Several authors of the Chinchilla paper were uneasy with the Kaplan extrapolations. The Kaplan analysis had used a relatively narrow range of compute budgets, focused on small models, and had not isolated the effect of training-schedule choices on the loss extrapolations. In particular, the Kaplan models were trained with learning-rate schedules tuned for a fixed number of steps; the loss-versus-compute frontier was then extrapolated without retuning these schedules at each compute level. Hoffmann et al. hypothesized that improperly tuned cosine schedules would underestimate the benefit of extra data and exaggerate the benefit of extra parameters. Settling the question required a larger and more carefully controlled experiment.
The paper, Training Compute-Optimal Large Language Models, was posted to arXiv on March 29, 2022 as arXiv:2203.15556 [1]. The author list runs to twenty-one DeepMind researchers; first author Jordan Hoffmann is joined by Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack Rae, Oriol Vinyals, and Laurent Sifre. The paper went on to win an Outstanding Paper award at NeurIPS 2022.
The paper's contribution has two complementary parts. The first is an empirical scaling-law study that uses three independent statistical procedures to estimate the compute-optimal frontier from different angles, then combines their predictions. The second is the training of Chinchilla itself as a single, expensive proof of concept that the recipe holds at the scale of a frontier model. DeepMind also published a companion blog post summarizing the findings [2].
To support the central claim, Hoffmann et al. trained more than 400 transformer language models, ranging from 70 million to over 16 billion parameters, on subsets of the MassiveText corpus ranging from 5 billion to 500 billion tokens [1]. Crucially, the training schedule for each model was retuned for that model's exact step count rather than extrapolated from a generic schedule, addressing what the team believed was the principal methodological flaw in the Kaplan analysis. The team then fit three independent statistical procedures to the resulting loss surface.
The first approach holds a small set of model sizes fixed and sweeps the training token count over several orders of magnitude for each one. For each model size, the resulting loss-versus-tokens curves can be intersected with curves of constant training compute (an "isoFLOP" intersection). Reading the family of these intersections across model sizes traces out a compute-optimal frontier: for each compute budget, the model size and token count that minimizes loss [1]. Hoffmann et al. fit a power law to the optimal parameter count and the optimal token count as functions of training compute and report exponents of approximately 0.50 for both, implying that compute should be split roughly equally between parameters and tokens at the scale studied.
The second approach inverts the first. Rather than varying tokens at a fixed model size, Hoffmann et al. fix nine training compute budgets ranging from 6×10^18 to 3×10^21 FLOPs and, for each, train a family of models of different sizes for exactly that compute. The resulting loss-versus-parameter-count curves form a parabola whose minimum identifies the compute-optimal model size at that compute budget. Stitching together these minima across compute budgets yields a second estimate of the compute-optimal frontier [1]. Approach 2 again returns exponents very close to 0.50 for both parameters and tokens, in close agreement with Approach 1.
The third approach fits a closed-form loss function globally to all 400+ training runs. The function takes the form
L(N, D) = E + A / N^alpha + B / D^beta
where N is the parameter count, D is the token count, E is the irreducible entropy of natural language (sometimes called the Bayes error), and A, B, alpha, beta are fitted constants [1]. The exponents alpha and beta describe how loss falls off with parameters and tokens respectively, and the compute-optimal allocation can be derived analytically by minimizing L subject to the constraint that training compute (roughly proportional to N × D) is fixed.
Hoffmann et al. reported point estimates of E ≈ 1.69 nats per token, A ≈ 406.4, B ≈ 410.7, alpha ≈ 0.34, and beta ≈ 0.28, with confidence intervals on the exponents that were narrow enough to seem statistically very strong [1]. Translated into a compute-optimal recipe, these constants implied that for the Gopher compute budget the optimal model would have roughly 67 to 70 billion parameters trained on roughly 1.5 trillion tokens, close enough to the predictions from Approaches 1 and 2 that the three methods could be combined.
The Epoch AI reproduction (April 2024) by Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You re-examined this third method specifically [5]. Refitting the parametric loss function from data points extracted from the figures of the original paper, the Epoch AI team found that:
Independent commentary, including Robin Pearce's notes on the replication, emphasized that the issue with Approach 3 was arithmetic and statistical rather than conceptual: Approaches 1 and 2 are unaffected, and the joint qualitative conclusion of equal scaling survives the correction [5]. Several DeepMind authors publicly acknowledged the corrections and credited Epoch AI with strengthening the analysis. The headline 20-tokens-per-parameter rule drifts upward to roughly 22 tokens per parameter under the corrected Approach 3 constants.
To test the prediction with real compute rather than only by extrapolation, the DeepMind team trained a single model, Chinchilla, at the recipe their scaling-law analysis prescribed: 70 billion parameters and 1.4 trillion tokens, using approximately the same total training compute as the 280-billion-parameter Gopher run (about 5.76 × 10^23 floating-point operations, or roughly 576 zettaFLOPs) [1].
Chinchilla is a decoder-only transformer with 80 layers, an internal model dimension of 8,192, 64 attention heads with key/value dimension 128, and the standard residual transformer block layout [1]. Training context length was 2,048 tokens; batch size was ramped from roughly 1.5 million tokens up to 3 million during training. Several recipe differences distinguish Chinchilla from Gopher: it uses RMSNorm instead of LayerNorm, relative rather than absolute positional encodings, the AdamW optimizer (peak learning rate 1×10^−4 with cosine decay to 10 percent of peak), and a modified SentencePiece tokenizer without NFKC normalization (~94 percent vocab overlap with Gopher). The training corpus is a rebalanced subset of MassiveText; even at 1.4 trillion tokens, most subsets are seen only once, with only the highest-quality subsets repeated.
The Chinchilla weights were never publicly released. The artifact that propagated through the field was the paper, its scaling-law results, its tables of downstream benchmark scores, and the rule of thumb that compute-optimal training requires roughly 20 tokens per parameter [2].
| Property | Value |
|---|---|
| Parameters | 70 billion |
| Layers | 80 |
| Internal dimension | 8,192 |
| Attention heads | 64 |
| Key/value head dimension | 128 |
| Context length | 2,048 |
| Batch size (tokens) | 1.5M ramped to 3M |
| Tokenizer | Modified SentencePiece, ~32k vocab |
| Optimizer | AdamW |
| Peak learning rate | 1×10^−4 |
| Schedule | Linear warmup then cosine decay to 10% of peak |
| Normalization | RMSNorm |
| Positional encoding | Relative |
| Training tokens | 1.4 trillion |
| Training corpus | MassiveText (rebalanced) |
| Total training compute | ~5.76 × 10^23 FLOPs (matched Gopher) |
| Publicly released weights | No |
The single most widely repeated takeaway from the Chinchilla paper is the rule that compute-optimal training requires roughly 20 tokens per parameter. The number is striking because it is roughly an order of magnitude larger than what most contemporary models were doing; for example, Gopher had been trained at slightly more than one token per parameter.
Important caveats apply:
A planner with C FLOPs of compute can roughly approximate the compute-optimal model size as N ≈ √(C / 120), using the standard approximation that one forward and backward pass through a parameter requires about 6 floating-point operations. Once N is fixed, the token count is simply 20 × N.
The table below summarizes the salient differences between Chinchilla and the most-discussed contemporary frontier language models in early 2022. Training compute is shown in zettaFLOPs (10^21 FLOPs); Chinchilla and Gopher were trained with essentially the same total compute, while Chinchilla used one quarter the parameters and 4.67× the tokens.
| Model | Lab | Released | Parameters | Training tokens | Tokens / param | Training compute (zFLOPs) |
|---|---|---|---|---|---|---|
| GPT-3 | OpenAI | 2020 | 175B | 300B | 1.7 | ~314 |
| Jurassic-1 | AI21 Labs | 2021 | 178B | 300B | 1.7 | ~320 |
| Gopher | DeepMind | 2021 | 280B | 300B | 1.1 | ~504 |
| Megatron-Turing NLG | Microsoft / NVIDIA | 2022 | 530B | 270B | 0.5 | ~858 |
| Chinchilla | DeepMind | 2022 | 70B | 1.4T | 20.0 | ~504 |
| LLaMA 65B | Meta | 2023 | 65B | 1.4T | 21.5 | ~449 |
| Llama 2 70B | Meta | 2023 | 70B | 2.0T | 28.6 | ~720 |
| Llama 3 / 3.1 70B | Meta | 2024 | 70B | 15.0T | 214 | ~5,400 |
On the downstream task suites reported in the Chinchilla paper, the 70B Chinchilla beat Gopher 280B in roughly 80 percent of cases and beat GPT-3 175B on essentially every benchmark where direct comparison was possible [1].
| Benchmark | Chinchilla 70B | Gopher 280B | GPT-3 175B |
|---|---|---|---|
| MMLU (5-shot, average) | 67.5% | 60.0% | 43.9% |
| BIG-bench (average) | 65.1% | 54.4% | not reported |
| The Pile (perplexity, lower better) | 7.16 | 7.75 | not directly comparable |
| LAMBADA (zero-shot) | 77.4% | 74.5% | 76.2% |
| TriviaQA (1-shot) | 73.3% | 64.0% | 68.0% |
| HellaSwag (zero-shot) | 80.8% | 79.2% | 78.9% |
MMLU is the clearest single-number summary: Chinchilla scored 67.5 percent versus Gopher's 60.0 percent and GPT-3's 43.9 percent, a 7.5-point absolute improvement over Gopher despite using a quarter of the parameters and the same compute. On The Pile, Chinchilla achieved a perplexity of 7.16 versus Gopher's 7.75, and the gain holds uniformly across every individual subset of The Pile rather than being driven by a single domain. Because Chinchilla is a quarter the size of Gopher, it also costs roughly four times less to fine-tune and four times less to serve, a strict Pareto improvement that helps explain why the Chinchilla recommendations were absorbed by the field so rapidly.
In April 2024, Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You at Epoch AI published Chinchilla Scaling: A Replication Attempt [5]. Refitting Approach 3 from data points extracted from the figures of the original paper, the replication identified three specific problems with the parametric fit:
Robin Pearce's informal notes on the replication emphasized that the corrections are confined to the parametric-fit method (Approach 3); the IsoFLOP-based Approach 2 and the variable-token Approach 1 are unaffected. The 20-tokens-per-parameter rule survives in essentially intact form, drifting upward to roughly 22 tokens per parameter under the corrected constants [5]. Several DeepMind authors publicly acknowledged the Epoch AI work as a useful correction to the literature.
Subsequent work has examined whether the Chinchilla recipe is robust under changes to training-recipe details that the original paper held fixed. Studies in 2024 and 2025 have generally found that the qualitative recommendation of equal scaling is robust across reasonable variations in optimizer, learning-rate schedule, tokenizer, and data mixture, but that the precise ratio depends meaningfully on these choices. The exponents drift as a function of architecture (mixture-of-experts and state-space models do not obey the same scaling law as dense transformers), tokenizer, and corpus quality [10].
The practical influence of Chinchilla on subsequent LLM training was enormous and almost immediate. Within twelve months of the paper's release, every major open-weight LLM training plan visibly cited the Chinchilla recipe as motivation, even when the team chose to deviate from it intentionally.
Meta's LLaMA family, released February 2023, was the first widely visible application of Chinchilla-style training in the open-weight world [11]. LLaMA 7B was trained on 1 trillion tokens; LLaMA 13B, 33B, and 65B were trained on 1.4 trillion tokens. The 65B model lands almost exactly on the Chinchilla recipe (21.5 tokens per parameter); the smaller sizes were trained well past Chinchilla optimal. Hugo Touvron and colleagues explicitly cited Chinchilla as the rationale and noted that they intentionally trained past the compute-optimal point to lower inference cost.
Llama 2 extended the same philosophy. All Llama 2 sizes (7B, 13B, 70B) were trained on 2 trillion tokens, putting Llama 2 70B at roughly 28 tokens per parameter, about 40 percent past the Chinchilla recipe [4]. Meta's rationale, again articulated explicitly in the paper, was that the additional training cost is amortized over the inference lifetime of a widely deployed model.
The trend accelerated further with Llama 3 and Llama 3.1. The 8B and 70B Llama 3 models were trained on roughly 15 trillion tokens, putting Llama 3 70B at roughly 214 tokens per parameter, an order of magnitude beyond Chinchilla optimal [3]. The Llama 3.1 405B model was trained on a similar 15-trillion-token corpus (about 37 tokens per parameter), only modestly past Chinchilla optimal because at 405B the absolute number of training tokens becomes constrained by the supply of high-quality public web text [3]. Meta justified the aggressive overtraining of the smaller Llama 3 models by pointing to the falling cost of training compute relative to the deployment lifetime of the model.
The same template, choosing a parameter count for efficient deployment on commodity GPUs and then training on a token budget several times larger than the strict Chinchilla minimum, has dominated the open-weight world. Mistral 7B was trained on a token budget far in excess of the ~140 billion that Chinchilla would have recommended; DeepSeek, Qwen, and Falcon have followed broadly Chinchilla-influenced compute allocations with similar overtraining for inference economics. Closed labs do not publish full compute or data budgets, but credible public estimates suggest GPT-4-class models from OpenAI, Gemini-class models from Google, and the Claude family from Anthropic were trained on multiple trillions of tokens, well past the Chinchilla recommendation for their compute budgets.
Hoffmann et al. considered only the cost of training. In production, a model is then run on inference workloads that may consume orders of magnitude more compute than the training run itself. For widely deployed models, lifetime inference compute can dwarf training compute by ten to one hundred times. In this regime, choosing a model that is half the size and trained on twice the data offers substantial savings even if it costs extra training compute, because every inference request is half as expensive forever after.
Nicholas Sardana, Jacob Portes, and colleagues at MosaicML formalized this argument in the January 2024 paper Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws [12]. They derived a modified scaling law that adds an inference-cost term to the optimization objective and showed that, for any anticipated inference demand greater than zero, the optimal model is smaller than the Chinchilla recipe would suggest, trained on more tokens. For a model expected to generate a trillion tokens of inference, the optimum shifts to several hundred tokens per parameter; for very heavy serving it can land at 1,000+ tokens per parameter.
This framework explains a shift in industry practice that had already happened. Llama 2 70B at 28 tokens per parameter, Llama 3 70B at 214 tokens per parameter, and OpenAI's GPT-3.5 and GPT-4o (presumed to be heavily overtrained relative to their compute-optimal size for inference-economics reasons) all sit far past the Chinchilla recipe. The current consensus is that, for most production LLMs, the right training point is somewhere between two and one hundred times the Chinchilla recommendation, depending on how aggressively the model will be served. The corollary is that "Chinchilla optimal" is no longer the right design target for a deployed model; the right target is "inference-aware optimal," and the optimal training-token budget grows roughly linearly with expected lifetime inference volume.
The Chinchilla scaling laws are an empirical fit to a particular set of training experiments and inherit several limitations from the experimental setup.