Chinchilla scaling laws

AI Research Deep Learning Large Language Models Machine Learning

25 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v5 · 4,917 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Chinchilla scaling laws are a set of empirical findings published by DeepMind researchers in 2022 showing that, for a fixed compute budget, a large language model trains most efficiently when its number of training tokens is roughly 20 times its number of parameters, meaning model size and training data should be scaled in equal proportion.^[1] The central result, presented in the paper "Training Compute-Optimal Large Language Models" by Jordan Hoffmann and colleagues, overturned the prevailing wisdom about how to scale language models.^[1] The authors stated their conclusion bluntly in the abstract: "We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant."^[1] Rather than investing most additional compute into larger models with relatively little extra data, Hoffmann et al. demonstrated that "for every doubling of model size the number of training tokens should also be doubled."^[1]

To validate these findings, the team trained a 70-billion-parameter model called Chinchilla on 1.4 trillion tokens.^[1] Despite being four times smaller than DeepMind's earlier Gopher model (280 billion parameters), Chinchilla outperformed Gopher on a wide range of benchmarks while using the same compute budget.^[1] This result sent shockwaves through the AI research community and reshaped how organizations approach the training of foundation models.

Background and motivation

The era of scaling up parameters

Between 2018 and 2022, the dominant strategy in language model development was to make models bigger. GPT-2 had 1.5 billion parameters, GPT-3 jumped to 175 billion, and models like Megatron-Turing NLG reached 530 billion parameters.^[9] This trend was supported by influential research from OpenAI, particularly the scaling laws published by Jared Kaplan and collaborators in January 2020.^[2]

Kaplan et al. studied the relationship between model performance (measured as cross-entropy loss) and three factors: the number of parameters (N), the size of the training dataset in tokens (D), and the total compute budget (C).^[2] They found smooth power-law relationships governing how loss decreases as each factor increases.^[2] Critically, their analysis suggested that when allocating additional compute, the vast majority should go toward increasing model size rather than training data.^[2]

Specifically, Kaplan et al. reported that the optimal number of parameters scales as N_opt proportional to C^0.73, while the optimal number of training tokens scales as D_opt proportional to C^0.27.^[2] This implied that for a tenfold increase in compute, roughly seven-fold more should be spent making the model larger, with only a modest increase in training data. Their paper also suggested that models should be trained well short of convergence, meaning that large models trained on comparatively little data would be the most compute-efficient approach.^[2]

This philosophy directly shaped the development of GPT-3 (175 billion parameters trained on only 300 billion tokens, a ratio of about 1.7 tokens per parameter), Gopher (280 billion parameters trained on 300 billion tokens), Jurassic-1 (178 billion parameters), and Megatron-Turing NLG (530 billion parameters).^[9] These models all prioritized massive parameter counts over extensive training data.

The data question

Several researchers had begun questioning whether this parameter-heavy approach was truly optimal. The practical consequence of the Kaplan scaling laws was that organizations raced to build ever-larger models while keeping training datasets roughly the same size. Gopher, for instance, was trained on approximately 300 billion tokens from the MassiveText dataset, a 10.5 TB corpus comprising web pages, Wikipedia, GitHub code, books, and news articles.^[3] At 280 billion parameters, this gave Gopher a tokens-to-parameters ratio of roughly 1.07, far below what Chinchilla would later show to be optimal.

The DeepMind team suspected that existing models were significantly undertrained.^[1] Their hypothesis was straightforward: if you have a fixed compute budget, perhaps it is better to train a smaller model on much more data rather than a gigantic model on relatively little data.^[1]

The paper

When was the Chinchilla paper published?

"Training Compute-Optimal Large Language Models" was posted to arXiv on March 29, 2022 (arXiv:2203.15556) and subsequently presented at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022).^[1] The paper was authored by Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre.^[1]

Experimental setup

The researchers conducted a large-scale empirical study, training over 400 language models with parameter counts ranging from 70 million to over 16 billion.^[1] Each model was a transformer-based autoregressive language model. The models were trained on varying amounts of data, from 5 billion to 500 billion tokens, all drawn from subsets of the MassiveText dataset.^[1]

The compute cost for training a model with N parameters on D tokens was approximated as:

C ≈ 6ND

where C is measured in floating-point operations (FLOPs). This approximation (6 FLOPs per parameter per token) accounts for both the forward and backward passes during training and is a standard estimate for transformer models.^[1]

The three estimation approaches

A distinctive feature of the paper is that the authors used three independent approaches to estimate the relationship between compute budget and the optimal balance of model size and training data.^[1] All three approaches yielded broadly consistent results, strengthening the conclusions.^[1]

Approach 1: Fix model sizes, vary training tokens

In the first approach, the team trained models at several fixed sizes (70M, 150M, 400M, 1B, 10B, and larger) and varied the number of training tokens for each size.^[1] For each model size, they recorded the final training loss at different token counts and identified the minimum loss envelope across all runs at each FLOP budget. By fitting a power law to the set of optimal (N, D) pairs extracted from this envelope, they estimated how model size and data should scale with compute.^[1]

For a compute budget equivalent to that used for Gopher (approximately 5.76 x 10^23 FLOPs), this approach predicted an optimal model size of roughly 67 billion parameters trained on approximately 1.5 trillion tokens.^[1]

Approach 2: IsoFLOP profiles

The second approach, called the IsoFLOP method, fixed the total compute budget at nine different levels (ranging from 10^18 to 10^21 FLOPs) and varied the model size at each budget.^[1] For each fixed compute budget, the researchers trained multiple models of different sizes, with the number of training tokens adjusted so that each model consumed exactly the specified amount of compute. They then identified which model size produced the lowest loss at each budget level.^[1]

This approach directly answers the question: "Given a fixed FLOP budget, what model size minimizes loss?" The resulting IsoFLOP curves showed that for every compute level, there was a clear optimum model size, and that optimum shifted to larger models as compute increased. For the Gopher-equivalent compute budget, Approach 2 predicted an optimal model size of about 63 billion parameters trained on approximately 1.4 trillion tokens.^[1]

Approach 3: Parametric loss function fitting

The third approach fitted a parametric model to the final loss values from all experiments conducted in Approaches 1 and 2.^[1] The proposed functional form was:

L(N, D) = E + A / N^alpha + B / D^beta

where L is the loss, N is the number of parameters, D is the number of training tokens, and E, A, B, alpha, and beta are fitted constants. The term E represents the irreducible entropy of natural language (the theoretical minimum loss that no model could beat). The terms A / N^alpha and B / D^beta capture the additional loss attributable to having a finite model and finite training data, respectively.

The original fitted values reported by Hoffmann et al. were approximately: E = 1.69, A = 406.4, B = 410.7, alpha = 0.34, and beta = 0.28.^[1] By minimizing L(N, D) subject to the constraint C = 6ND, one can solve for the optimal N and D at any given compute budget.

For the Gopher-equivalent budget, Approach 3 yielded a somewhat different prediction of about 40 billion parameters.^[1] This discrepancy between Approach 3 and the other two approaches became a point of later scrutiny (see the Criticisms section below).

Summary of approach predictions

The following table summarizes the optimal model sizes and training token counts predicted by each approach for the Gopher-equivalent compute budget (approximately 5.76 x 10^23 FLOPs).

Approach	Method	Predicted optimal parameters	Predicted optimal tokens
Approach 1	Fixed model sizes, varying tokens	~67B	~1.5T
Approach 2	IsoFLOP profiles	~63B	~1.4T
Approach 3	Parametric loss fitting	~40B	~2.3T

Approaches 1 and 2 agreed closely, both suggesting a model in the 63-67 billion parameter range trained on 1.4 to 1.5 trillion tokens. Approach 3 predicted a smaller model with more data, though all three approaches agreed that existing models like Gopher (280B parameters, 300B tokens) were heavily over-parameterized and under-trained.^[1]

What is the key result of the Chinchilla scaling laws?

The central finding across all approaches was that the optimal number of parameters and the optimal number of training tokens should each scale roughly as the square root of the compute budget:^[1]

N_opt proportional to C^0.50

D_opt proportional to C^0.50

This means that doubling the compute budget should lead to roughly a 1.4x increase in both model size and training data. As Hoffmann et al. put it, "for every doubling of model size the number of training tokens should also be doubled."^[1] This stands in stark contrast to the Kaplan et al. prediction, where most additional compute would go toward model size.^[2] The community summarized the result as a rule of thumb: train on roughly 20 tokens for every parameter (Chinchilla's 70 billion parameters multiplied by about 20 gives its 1.4 trillion tokens).

Chinchilla vs. Kaplan: how do the two scaling laws differ?

The disagreement between the Chinchilla and Kaplan scaling laws represents one of the most consequential debates in modern AI research. The following table highlights the key differences.

Aspect	Kaplan et al. (2020)	Chinchilla / Hoffmann et al. (2022)
Affiliation	OpenAI	DeepMind
Parameter scaling with compute	N_opt proportional to C^0.73	N_opt proportional to C^0.50
Token scaling with compute	D_opt proportional to C^0.27	D_opt proportional to C^0.50
Recommended tokens-to-parameters ratio	~1-2 tokens per parameter (implied)	~20 tokens per parameter
Compute allocation philosophy	Favor larger models, modest data increase	Scale model size and data equally
Training strategy	Stop well before convergence	Train closer to convergence on more data
Models influenced	GPT-3, Gopher, Jurassic-1, MT-NLG	LLaMA, Chinchilla, PaLM

Why did Kaplan et al. reach different conclusions?

Several factors contributed to the discrepancy. Kaplan et al. used a fixed learning rate schedule that was not adjusted for different training durations, which may have biased their results toward favoring larger models trained for fewer steps.^[7] They also used a smaller range of model sizes and did not fully explore the space of possible token-to-parameter ratios. Additionally, the Kaplan team's analysis was conducted using models trained on data that was sometimes reused across epochs, whereas the Chinchilla analysis generally considered single-epoch training on fresh data.^[7]

A 2024 paper, "Reconciling Kaplan and Chinchilla Scaling Laws" (Porian et al., 2024), analyzed both sets of findings in detail and showed that methodological differences in learning rate tuning and data handling largely explain the divergent conclusions.^[7]

The Chinchilla model

Architecture and training

To validate their scaling predictions, the DeepMind team trained a model called Chinchilla. The design was straightforward: take the compute budget used for Gopher (approximately 5.76 x 10^23 FLOPs), but instead of building a 280-billion-parameter model trained on 300 billion tokens, build a 70-billion-parameter model trained on 1.4 trillion tokens.^[1]

Chinchilla used the same transformer architecture as Gopher, with appropriate adjustments to layer count, hidden dimension, and attention heads to reach the 70-billion-parameter target.^[1] It was trained on data from the MassiveText dataset, the same source used for Gopher, but with roughly four times more data sampled from it.^[1]

How well did Chinchilla perform on benchmarks?

Chinchilla delivered consistently superior performance compared to Gopher and other contemporary large language models, despite being four times smaller.^[1] The following table shows selected benchmark results.

Benchmark	Chinchilla (70B)	Gopher (280B)	GPT-3 (175B)
MMLU (average accuracy)	67.5%	60.0%	43.9% (few-shot)
HellaSwag	80.8%	79.2%	78.9%
PIQA	81.8%	81.8%	81.0%
WinoGrande	74.9%	70.1%	70.2%
BoolQ	83.7%	79.3%	60.5%

The MMLU result was particularly striking. The paper highlighted it directly: "Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher."^[1] This 7.5 percentage point gain came despite using the same training compute, and at the time of publication it was a new state of the art on MMLU.^[1]

Chinchilla also outperformed Jurassic-1 (178B parameters, developed by AI21 Labs) and Megatron-Turing NLG (530B parameters, developed jointly by NVIDIA and Microsoft) on the majority of evaluated tasks.^[1]

Practical advantages

Beyond raw performance, Chinchilla offered significant practical benefits. At 70 billion parameters, the model required roughly one quarter of the memory needed to serve Gopher.^[1] This translated directly into lower inference costs, faster response times, and the ability to run the model on fewer GPUs. For organizations deploying language models at scale, these savings were substantial. The result demonstrated that compute-optimal training was not just an academic curiosity but had immediate practical implications for model deployment.

How did Chinchilla influence the field?

The Chinchilla paper had a profound and lasting effect on how the AI research community approaches language model training. Its influence can be seen across multiple dimensions.

Meta's LLaMA

Perhaps the most visible response to the Chinchilla scaling laws came from Meta AI. In February 2023, Hugo Touvron and colleagues released LLaMA (Large Language Model Meta AI), a family of open-weight models ranging from 7 billion to 65 billion parameters.^[4] The paper explicitly cited the Chinchilla findings and designed its training regime accordingly.^[4]

LLaMA's training token counts were far above what the pre-Chinchilla Kaplan scaling laws would have recommended:

Model	Parameters	Training tokens	Tokens-per-parameter ratio
LLaMA-7B	7B	1.0T	~143
LLaMA-13B	13B	1.0T	~77
LLaMA-33B	33B	1.4T	~42
LLaMA-65B	65B	1.4T	~22

Notably, the LLaMA-65B model closely matched the Chinchilla-optimal ratio of ~20 tokens per parameter, while the smaller models were trained well beyond the Chinchilla-optimal point. The result was remarkable: LLaMA-13B outperformed GPT-3 (175B) on most benchmarks, and LLaMA-65B was competitive with Chinchilla-70B and PaLM-540B.^[4] This demonstrated that smaller, well-trained models could rival or exceed much larger ones.

Google's PaLM and Gemini

Google's PaLM (Pathways Language Model), released in April 2022 with 540 billion parameters trained on 780 billion tokens (a ratio of about 1.4 tokens per parameter), was arguably designed before the Chinchilla findings became widely known. Later Google models, including PaLM 2 and the Gemini family, incorporated Chinchilla-informed training strategies with significantly more training data relative to model size.

Other adopters

The Chinchilla scaling laws influenced a wide range of model developers:

MosaicML released MPT-7B (2023), trained on 1 trillion tokens, closely following Chinchilla-optimal principles.
Mistral AI designed Mistral-7B (2023) as a highly efficient 7.3-billion-parameter model that outperformed the larger LLaMA 2-13B, reflecting the emphasis on training quality over raw parameter count.
Anthropic incorporated scaling law research into its development of the Claude model family, though specific training details remain undisclosed.
Technology Innovation Institute released the Falcon models, with Falcon-40B trained on 1 trillion tokens of the RefinedWeb dataset.

Over-training: beyond Chinchilla-optimal

While the Chinchilla scaling laws established the concept of compute-optimal training, a clear trend emerged in 2023 and 2024 where many practitioners deliberately trained models far beyond the Chinchilla-optimal point. This practice, often called "over-training," involves using more training tokens than the scaling laws suggest is compute-optimal.

Why do practitioners train past the Chinchilla-optimal point?

The key insight motivating over-training is that the Chinchilla scaling laws optimize only for training compute. They do not account for the cost of inference, which for widely deployed models can far exceed the one-time training cost.^[6] A smaller model that has been trained longer will produce worse loss per training FLOP, but it will be cheaper to serve at inference time because it has fewer parameters.^[6]

Sardana and Frankle (2024) formalized this reasoning in their paper "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws."^[6] Their core recommendation was direct: "LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal."^[6] They showed that when inference demand is factored in, the optimal strategy shifts toward training smaller models on more data.^[6] For example, they estimated that a developer expecting 2 trillion tokens of total inference demand could reduce total compute (training plus inference) by 17% by training a 7-billion-parameter model on extra data instead of using a compute-optimal 13-billion-parameter model of equivalent quality.^[6]

Their modified objective function accounts for both training and inference compute:^[6]

Total cost = 6N * D_train + 2N * D_inference

where D_train is the number of training tokens and D_inference is the total number of tokens processed during inference over the model's deployment lifetime.

LLaMA 3: a case study in over-training

The most dramatic example of over-training is Meta's LLaMA 3, released in April 2024.^[8] The 8-billion-parameter model was trained on approximately 15 trillion tokens, giving it a tokens-to-parameters ratio of roughly 1,875.^[8] This is nearly 100 times the Chinchilla-optimal ratio of 20. Even the 70-billion-parameter variant was trained on 15 trillion tokens, yielding a ratio of about 214 tokens per parameter, more than 10 times the Chinchilla point.^[8]

Meta reported that performance kept improving even at these extreme token counts, observing "both the 8B and 70B parameter models continuing to improve log-linearly after training on up to 15 trillion tokens."^[8] By Meta's own framing, "the Chinchilla-optimal amount of training compute for an 8B parameter model corresponds to ~200B tokens," yet LLaMA 3 8B used roughly 75 times that amount.^[8]

Model	Parameters	Training tokens	Chinchilla-optimal tokens	Over-training factor
Chinchilla	70B	1.4T	1.4T	1x
LLaMA-65B	65B	1.4T	1.3T	~1.1x
LLaMA 2-7B	7B	2.0T	140B	~14x
LLaMA 3-8B	8B	15T	160B	~94x
LLaMA 3-70B	70B	15T	1.4T	~11x

This trend reflects a pragmatic shift in the industry. For models that will serve billions of inference requests, the one-time cost of extended training is a worthwhile investment if it yields a smaller, faster, and cheaper model at deployment time.

The parametric scaling law in detail

The mathematical core of the Chinchilla paper is the parametric loss function from Approach 3. Understanding this function helps clarify what the scaling laws actually predict and where their limitations lie.

The loss function

The proposed loss function takes the form:

L(N, D) = E + A / N^alpha + B / D^beta

Each component has a specific interpretation:

E (irreducible loss): This represents the entropy of natural language itself. No matter how large the model or how much data it is trained on, the loss cannot fall below E because language has inherent unpredictability. The fitted value of approximately 1.69 nats corresponds to this theoretical floor.^[1]
A / N^alpha (model capacity term): This captures the loss attributable to having a finite number of parameters. A larger model reduces this term. The exponent alpha (approximately 0.34) governs how quickly loss decreases as model size increases.^[1]
B / D^beta (data term): This captures the loss attributable to training on a finite amount of data. More training tokens reduce this term. The exponent beta (approximately 0.28) governs how quickly loss decreases as data increases.^[1]

Deriving the optimal allocation

Given the constraint C = 6ND, one can use Lagrange multipliers to minimize L(N, D) subject to this constraint. The solution yields the optimal model size and data size as functions of the compute budget:

N_opt proportional to C^(beta / (alpha + beta))

D_opt proportional to C^(alpha / (alpha + beta))

Using the fitted values alpha = 0.34 and beta = 0.28, this gives:^[1]

N_opt proportional to C^0.45

D_opt proportional to C^0.55

This is close to, but not exactly, the 50/50 split suggested by Approaches 1 and 2. The slight asymmetry means that data should grow slightly faster than model size, which is consistent with the approximately 20:1 tokens-to-parameters ratio.

Criticisms and limitations

Despite its enormous influence, the Chinchilla paper has faced several important criticisms.

Could the Chinchilla results be reproduced?

In April 2024, Besiroglu, Erdil, Barnett, and You from Epoch AI published "Chinchilla Scaling: A Replication Attempt."^[5] This paper systematically examined the Chinchilla results and identified several issues, particularly with Approach 3.^[5]

First, the reported parameter estimates for the Approach 3 loss function appeared inconsistent with the underlying data. When the Epoch team extracted the data from the original paper's figures and re-fitted the parametric model, they obtained significantly different coefficients.^[5] Their revised fit was:

L(N, D) = 1.82 + 514.0 / N^0.35 + 2115.2 / D^0.37

compared to the original:

L(N, D) = 1.69 + 406.4 / N^0.34 + 410.7 / D^0.28

The most notable difference is in the data term: the Epoch team found B = 2115.2 and beta = 0.37, compared to the original B = 410.7 and beta = 0.28.^[5] This suggests that data contributes more to reducing loss than the original paper indicated.

Second, the confidence intervals reported by Hoffmann et al. were implausibly narrow. The Epoch authors found that Hoffmann et al. "report implausibly narrow confidence intervals, intervals this narrow would require over 600,000 experiments, while they likely only ran fewer than 500."^[5] For the exponent alpha, the original paper reported a confidence interval of just 0.454 to 0.455.^[5]

Third, there was an internal inconsistency: the Approach 3 estimates implied an optimal ratio closer to 70 tokens per parameter, contradicting the approximately 20 tokens per parameter derived from Approaches 1 and 2 (and actually used for the Chinchilla model).^[5] The Epoch team's revised Approach 3 estimates, however, aligned well with Approaches 1 and 2, suggesting approximately 20 tokens per parameter.^[5]

Limited scope of the original study

The models used in the scaling study ranged from 70 million to 16 billion parameters, trained on up to 500 billion tokens.^[1] The predictions for much larger models (like the 70-billion-parameter Chinchilla itself) were extrapolations beyond the range of the fitted data. While Chinchilla's strong performance validated these extrapolations in one case, the reliability of the scaling laws at much larger compute budgets remains an open question.

Data quality is not modeled

The Chinchilla scaling laws treat all training tokens as equivalent. In practice, data quality varies enormously, and a token from a curated textbook is not equivalent to a token from a low-quality web scrape. The scaling laws do not capture the effects of data curation, deduplication, or domain-specific filtering, all of which can significantly affect model quality independent of raw token count.

Architecture dependence

The scaling laws were derived from experiments with standard transformer architectures.^[1] Whether the same relationships hold for alternative architectures (such as mixture-of-experts models, state-space models like Mamba, or retrieval-augmented models) is not established by the original work.

The training-only perspective

As discussed in the over-training section above, the Chinchilla scaling laws optimize for training compute alone. They do not model the total cost of developing and deploying a model, which includes inference costs, fine-tuning costs, and the engineering overhead of working with larger models.^[6] For many real-world applications, a smaller model trained beyond the Chinchilla-optimal point is a better economic choice.^[6]

Parameter definition ambiguity

The Epoch AI replication also noted that the definition of "model parameters" in the Chinchilla paper was ambiguous.^[5] Three different interpretations of what counts as a parameter (for instance, whether to include embedding parameters) are possible, with relative differences as high as 15.2%.^[5] This ambiguity complicates precise replication of the results.

Impact on AI development economics

The Chinchilla scaling laws fundamentally altered the economics of large language model development in several ways.

Shifting the bottleneck from compute to data

Before Chinchilla, the primary bottleneck for building better language models was compute: organizations needed more GPUs and more training time. The Chinchilla findings shifted attention to data as an equally important bottleneck. Training a 70-billion-parameter model requires 1.4 trillion tokens of high-quality text, and training a 400-billion-parameter model would require roughly 8 trillion tokens under Chinchilla-optimal scaling. This created intense competition for training data and spurred investment in data curation, synthetic data generation, and data licensing.

Lowering the barrier to entry

By demonstrating that a 70-billion-parameter model could outperform a 280-billion-parameter model, the Chinchilla paper showed that raw scale was not the only path to state-of-the-art performance.^[1] This lowered the barrier to entry for organizations with moderate compute budgets but access to large, high-quality datasets. Startups like Mistral AI and open-source efforts like LLaMA benefited directly from this insight.

The data wall

A consequence of Chinchilla-optimal training is that model development quickly runs into the finite supply of high-quality text data on the internet. Estimates by Epoch AI suggest that the stock of publicly available text data suitable for language model training is on the order of a few trillion to tens of trillions of tokens. As models grow, the demand for training data under Chinchilla-optimal scaling grows linearly. This has been called the "data wall" and has motivated research into synthetic data, multi-epoch training, and curriculum learning strategies.

Inference-aware scaling

The realization that inference costs often dominate total deployment costs led to a new paradigm of inference-aware scaling.^[6] Instead of optimizing purely for training efficiency, developers now consider the total lifecycle cost of a model. This has driven the trend toward smaller, over-trained models (like LLaMA 3) that sacrifice training efficiency for inference efficiency.

Legacy and ongoing relevance

The Chinchilla scaling laws remain one of the most cited and discussed results in modern AI research. While the specific numbers (20 tokens per parameter, equal scaling exponents) may be refined by future work, the core insight has proven durable: data matters as much as model size, and the most productive use of additional compute is not always building a bigger model.

The paper also established a methodological template for scaling law research. The use of multiple independent estimation approaches, large sweeps of training runs, and parametric loss fitting has been adopted by subsequent studies on scaling in areas such as reinforcement learning, multimodal models, and mixture-of-experts architectures.

More broadly, the Chinchilla paper demonstrated that careful empirical analysis of scaling behavior can yield actionable insights worth hundreds of millions of dollars in training efficiency. It showed that the AI research community's collective intuitions about scaling can be wrong and that systematic measurement can overturn even widely held beliefs.

References

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., & Sifre, L. (2022). "Training Compute-Optimal Large Language Models." *Advances in Neural Information Processing Systems*, 35 (NeurIPS 2022). arXiv:2203.15556. ↩
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. ↩
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2022). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher." arXiv:2112.11446. ↩
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. ↩
Besiroglu, T., Erdil, E., Barnett, M., & You, J. (2024). "Chinchilla Scaling: A Replication Attempt." arXiv:2404.10102. ↩
Sardana, N., & Frankle, J. (2024). "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." *Proceedings of the 41st International Conference on Machine Learning* (ICML 2024). arXiv:2401.00448. ↩
Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. (2024). "Reconciling Kaplan and Chinchilla Scaling Laws." arXiv:2406.12907. ↩
Meta AI (2024). "Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date." Meta AI Blog. https://ai.meta.com/blog/meta-llama-3/ ↩
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33 (NeurIPS 2020). arXiv:2005.14165. ↩
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." *Proceedings of the International Conference on Learning Representations* (ICLR 2021). arXiv:2009.03300.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit