Scaling Laws for Neural Language Models

AI Research Large Language Models Machine Learning

23 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v7 · 4,535 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Scaling Laws for Neural Language Models is a landmark research paper published by OpenAI on January 23, 2020 (arXiv:2001.08361) that established that the test loss of a neural language model falls as a smooth power law in three quantities: model size (non-embedding parameters), dataset size (training tokens), and training compute.^[1] Authored by Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, the paper found that these relationships span more than seven orders of magnitude and that architecture details such as network width and depth have minimal effect within a wide range.^[1] In the authors' words, "the loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude."^[1] The paper's prescription to favor large models trained on relatively modest data directly shaped the design of GPT-3 and influenced billions of dollars in compute investment, before being revised in 2022 by the Chinchilla paper.

Background and Motivation

Before the publication of this paper, the deep learning community had observed that larger models tend to perform better, but this intuition lacked a rigorous mathematical basis. Researchers trained models at different scales and observed improvements, yet there was no systematic understanding of how performance would change as a function of scale. Training large language models is expensive, and without reliable predictions, organizations risked wasting significant computational resources on suboptimal configurations.

The paper's motivation was to answer a fundamental question: given a fixed compute budget, how should resources be allocated between model size, dataset size, and training duration to achieve the best possible performance?^[1] This question had enormous practical implications. If the relationships between these variables and performance could be quantified, researchers could plan training runs more efficiently, predict the returns from scaling, and make informed decisions about hardware investments.

The work built on prior observations from the machine learning community about the benefits of scale. In March 2019, Richard Sutton published "The Bitter Lesson," an influential essay arguing that seventy years of AI research demonstrated a consistent pattern: general methods that leverage computation ultimately outperform specialized methods that exploit human knowledge.^[5] Sutton's essay articulated the philosophical foundation, and the Kaplan et al. paper provided the empirical and mathematical specifics.^[1]

Who wrote the scaling laws paper?

The paper was produced by a team of ten researchers, primarily affiliated with OpenAI.^[1] The lead author, Jared Kaplan, held a dual appointment as an associate professor of physics at Johns Hopkins University and a researcher at OpenAI. His background in theoretical physics, particularly quantum gravity and holography (AdS/CFT), brought a physicist's perspective to the study of neural network behavior. Kaplan later became a co-founder and chief science officer of Anthropic.

The remaining authors were all OpenAI researchers at the time of publication. Several of them also contributed to the GPT-3 paper (Brown et al., 2020), which was published five months later.^[2] Nine of the ten scaling laws authors (Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, and Amodei) also appeared on the GPT-3 author list.^[2] Dario Amodei, who served as VP of Research at OpenAI, later co-founded Anthropic. Alec Radford was a key architect of the GPT series of models. Tom B. Brown served as lead author on the GPT-3 paper.^[2] This overlap between the scaling laws and GPT-3 teams was not coincidental; the scaling laws research directly informed GPT-3's design and training decisions.

Experimental Setup

What data and models did the paper use?

All experiments used the WebText2 dataset, an expanded version of the WebText corpus originally collected for training GPT-2.^[1] WebText2 consists of 20.3 million documents sourced from Reddit outbound links that received at least three karma points, covering the period from December 2017 through October 2018.^[1] The dataset contains approximately 96 GB of text, corresponding to $1.62 \times 10^{10}$ words and $2.29 \times 10^{10}$ tokens after byte pair encoding (BPE) tokenization with a vocabulary size of 50,257.^[1]

The test set contained $6.6 \times 10^8$ tokens.^[1] The researchers also evaluated models on several out-of-distribution datasets, including the Books Corpus, Common Crawl, Wikipedia, and Internet Books, to test transfer and generalization properties.^[1]

Model Architecture

The experiments used Transformer decoder-only language models, following the architecture of the GPT-2 family.^[1] The researchers parameterized models using several hyperparameters: $n_{\text{layer}}$ (number of layers), $d_{\text{model}}$ (dimension of the residual stream), $d_{\text{ff}}$ (dimension of the intermediate feed-forward layer), $d_{\text{attn}}$ (dimension of the attention output), and $n_{\text{heads}}$ (number of attention heads per layer).^[1]

Models ranged from 768 non-embedding parameters at the smallest scale to 1.5 billion non-embedding parameters at the largest.^[1] The researchers varied model shapes extensively, testing configurations with $n_{\text{layer}}$ ranging from 6 to 207 and $d_{\text{model}}$ ranging from 768 to 4,288.^[1] Most runs used a context length of 1,024 tokens and a batch size of 512 sequences (approximately $5 \times 10^5$ tokens per batch).^[1]

A critical methodological choice was that the paper measured model size $N$ as the number of non-embedding parameters, excluding vocabulary and positional embeddings.^[1] This decision later proved to be a significant source of divergence from subsequent scaling law studies.^[4]

Performance Metric

The primary metric was cross-entropy loss (negative log-likelihood per token) on the test set, measured in nats.^[1] This metric directly reflects how well a model predicts the next token in a sequence and is the standard objective for autoregressive language model training.

What did the scaling laws paper find?

The paper presented a set of interconnected findings that, taken together, provided a comprehensive picture of how language model performance scales.^[1] The authors organized their results around several core observations.

1. Performance Depends Strongly on Scale, Weakly on Architecture

The central discovery was that language model loss is primarily determined by three factors: the number of non-embedding parameters ( $N$ ), the dataset size in tokens ( $D$ ), and the compute budget ( $C$ ).^[1] Within a wide range of reasonable architectural choices, other hyperparameters such as the ratio of depth to width, the number of attention heads, and the feed-forward layer dimension had minimal impact on performance.^[1] As the abstract puts it, "other architectural details such as network width or depth have minimal effects within a wide range."^[1]

This finding was practically significant because it meant that researchers could focus on scale rather than architecture search when seeking performance improvements. The specific shape of the Transformer mattered far less than its total size.

2. Smooth Power Laws

When performance was not bottlenecked by one of the other two variables, the test loss followed smooth power-law relationships with each individual scaling variable.^[1] These power laws held across more than six orders of magnitude with no signs of flattening at the upper end of the ranges tested.^[1]

The smoothness and consistency of these curves was striking. Unlike many empirical relationships in machine learning that exhibit noisy or irregular behavior, the scaling curves were remarkably clean and predictable.

3. Universality of Overfitting

The degree of overfitting depended predictably on the ratio of model size to dataset size.^[1] Specifically, the performance penalty from overfitting scaled with $N^{0.74} / D$ , meaning that an 8x increase in model size required only about 5x more data to maintain the same level of overfitting.^[1] This sublinear relationship was the basis for the claim that larger models are more sample-efficient.

4. Universality of Training

Training curves followed predictable power-law trajectories with parameters that were roughly independent of model size.^[1] This enabled extrapolation: by observing the early portion of a training curve, researchers could predict the final loss with reasonable accuracy.^[1]

5. Transfer and Generalization

When models were evaluated on text distributions different from their training data, the results were strongly correlated with in-distribution performance but offset by a roughly constant amount.^[1] In other words, transfer to a different distribution incurred a fixed penalty but otherwise improved in line with training-set performance. Generalization depended almost exclusively on in-distribution validation loss rather than on training duration or proximity to convergence.^[1]

6. Sample Efficiency of Larger Models

Larger models reached the same performance level as smaller models while using fewer optimization steps and fewer data points.^[1] This sample efficiency was one of the paper's most consequential findings, because it implied that for a fixed compute budget, training a larger model on less data could be more efficient than training a smaller model on more data. The abstract states that "larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence."^[1]

7. Convergence Is Inefficient

Optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping well before the model has converged.^[1] The paper suggested that models should be trained to only about 10% above their converged loss level for optimal compute efficiency, a practice that was far from conventional at the time.^[1]

The Power-Law Formulas

The paper quantified the scaling relationships through a set of fitted power-law equations.^[1] These formulas became widely referenced throughout the AI research community.

Univariate Scaling Laws

When only one variable is allowed to change (with the others effectively unlimited), the loss follows:

Relationship	Formula	Exponent	Scale Constant
Loss vs. Parameters	$L(N) = (N_c / N)^{\alpha_N}$	$\alpha_N = 0.076$	$N_c = 8.8 \times 10^{13}$
Loss vs. Data	$L(D) = (D_c / D)^{\alpha_D}$	$\alpha_D = 0.095$	$D_c = 5.4 \times 10^{13}$
Loss vs. Compute	$L(C_{\min}) = (C_c / C_{\min})^{\alpha_C}$	$\alpha_C = 0.050$	$C_c = 3.1 \times 10^8$ PF-days

In these formulas, $N$ refers to non-embedding parameters, $D$ refers to the number of training tokens, and $C_{\min}$ refers to the compute budget in PetaFLOP-days when compute is allocated optimally between model size and training duration.^[1]

The exponents indicate diminishing but persistent returns from scaling. Each 10x increase in parameters reduces loss by approximately 5.3% (since $10^{0.076} = 1.19$ , corresponding to a factor of $1/1.19$ reduction).^[1] Each 10x increase in data reduces loss by approximately 6.4%. Each 10x increase in optimally allocated compute reduces loss by approximately 3.4%.

Combined Loss Formula

When both model size and dataset size are varied simultaneously, the loss follows a combined formula:

L(N, D) = \left[(N_c / N)^{\alpha_N / \alpha_D} + D_c / D\right]^{\alpha_D}

The fitted parameters for this combined formula are:

Parameter	Value
$\alpha_N$	0.076
$\alpha_D$	0.103
$N_c$	$6.4 \times 10^{13}$
$D_c$	$1.8 \times 10^{13}$

This formula captures the interaction between model size and data size, correctly predicting the onset of overfitting when a model is too large for its dataset.^[1]

Data Requirements to Avoid Overfitting

The paper derived a rule of thumb for the minimum dataset size needed to avoid significant overfitting:

D \ge (5 \times 10^3) \times N^{0.74}

This means that dataset size should grow sublinearly with model size.^[1] An 8x increase in model parameters requires only about 5x more data to maintain the same overfitting level.

Critical Batch Size

The paper also identified a critical batch size that scales with loss:

B_{\text{crit}}(L) = B_* / L^{1/\alpha_B}

With $B_* = 2 \times 10^8$ tokens and $\alpha_B = 0.21$ .^[1] The critical batch size approximately doubles for every 13% decrease in loss. Training with a batch size near $B_{\text{crit}}$ provides the most efficient use of both compute and wall-clock time.^[1]

How should a fixed compute budget be allocated?

One of the paper's most influential contributions was its prescription for how to allocate a fixed compute budget.^[1] Given a total compute budget $C$ (measured in FLOPs, where $C = 6ND$ for training, with 6 FLOPs per parameter per token), the paper found that optimal allocation follows:

Resource	Scaling with Compute	Exponent
Model parameters ( $N$ )	$N_{\text{opt}} \propto C^{0.73}$	0.73
Batch size ( $B$ )	$B_{\text{opt}} \propto C^{0.24}$	0.24
Training steps ( $S$ )	$S_{\text{opt}} \propto C^{0.03}$	0.03

The exponents sum to approximately 1.0, reflecting the constraint $C = 6ND = 6NBS$ .

The practical implication was clear: when scaling up compute, the overwhelming majority of the additional budget should go toward making the model larger.^[1] With a 10x increase in compute, the recommendation was to increase model size by approximately 5.5x, dataset size by approximately 1.8x, and training steps by a negligible amount.^[1] This was summarized as "scale models over data."

The tokens-per-parameter ratio under this optimal allocation worked out to approximately 1.7 to 4 tokens per parameter, a number that proved to be substantially lower than the ratio later recommended by the Chinchilla study.^[3]

How did the paper influence GPT-3?

The scaling laws paper and the GPT-3 paper were developed concurrently at OpenAI by largely overlapping teams.^[2] The scaling laws findings directly shaped GPT-3's training configuration.

Following the prescription to prioritize model size over data, GPT-3 was trained with 175 billion parameters on approximately 300 billion tokens.^[2] This produced a tokens-per-parameter ratio of about 1.7, closely matching the compute-optimal allocation predicted by the Kaplan scaling laws.^[1] The model was trained on a diverse dataset consisting of Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia.^[2]

GPT-3's success validated the scaling laws in several ways. The model demonstrated strong few-shot learning abilities that improved smoothly with scale, consistent with the power-law predictions.^[2] Performance on downstream benchmarks tracked the predicted improvement curves. The model's zero-shot and few-shot capabilities on tasks it was never explicitly trained for provided evidence that the scaling relationships captured something fundamental about language model behavior.

The GPT-3 paper explicitly referenced the scaling laws work and presented additional evidence for power-law scaling across its eight model sizes (ranging from 125 million to 175 billion parameters).^[2] The smooth improvement across these sizes reinforced the conclusion that scale, not architecture, was the primary driver of performance.

Several other large language models developed in 2020 and 2021 also followed the Kaplan scaling prescription. Models like Gopher (280B parameters, 300B tokens), Jurassic-1 (178B parameters), and Megatron-Turing NLG (530B parameters) all exhibited relatively low tokens-per-parameter ratios, consistent with the recommendation to prioritize model size.

How did Chinchilla revise the scaling laws?

In March 2022, Jordan Hoffmann and colleagues at DeepMind published "Training Compute-Optimal Large Language Models" (commonly known as the Chinchilla paper), which significantly revised the compute-optimal allocation recommendations from Kaplan et al.^[3]

Key Differences

The Chinchilla paper trained over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens.^[3] Its central finding was that model size and training data should be scaled equally: for every doubling of compute, both the number of parameters and the number of training tokens should be doubled.^[3] As the Chinchilla authors concluded, "for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled."^[3] This yielded a dramatically different allocation:

Aspect	Kaplan et al. (2020)	Hoffmann et al. (2022)
Optimal N scaling	$N_{\text{opt}} \propto C^{0.73}$	$N_{\text{opt}} \propto C^{0.50}$
Optimal D scaling	$D_{\text{opt}} \propto C^{0.27}$	$D_{\text{opt}} \propto C^{0.50}$
Tokens per parameter	~1.7 to 4	~20
Core recommendation	Scale models over data	Scale models and data equally
Model size range tested	768 to 1.5B parameters	70M to 16B parameters
Practical example	GPT-3: 175B params, 300B tokens	Chinchilla: 70B params, 1.4T tokens

The Chinchilla model (70 billion parameters trained on 1.4 trillion tokens) matched or exceeded the performance of Gopher (280 billion parameters trained on 300 billion tokens) despite using the same compute budget.^[3] This demonstrated that Gopher, GPT-3, and similar models were significantly undertrained relative to their size.^[3]

Why did Kaplan and Chinchilla disagree?

Subsequent research, notably "Reconciling Kaplan and Chinchilla Scaling Laws" (Porian et al., 2024), identified the primary reasons for the divergence between the two sets of recommendations.^[4]

The most significant factor was the parameter counting methodology.^[4] Kaplan et al. counted only non-embedding parameters, while Hoffmann et al. counted total parameters.^[4] At small model scales, embedding parameters (vocabulary embeddings and positional embeddings) constitute a meaningful fraction of total parameters. This creates a systematic bias: as models grow, the proportion of embedding parameters shrinks, changing the apparent scaling relationship.^[4] When the reconciliation study reanalyzed Chinchilla's data using non-embedding parameters at Kaplan's small scale (768M to 1.5B parameters), they obtained scaling coefficients of 0.74 to 0.78, closely matching Kaplan's 0.73.^[4]

The second factor was the scale of experiments.^[4] Kaplan et al. used models up to 1.5 billion parameters, while Hoffmann et al. used models up to 16 billion parameters. The relationship between non-embedding and total parameters exhibits curvature, producing different local power-law fits depending on the model size range examined.^[4]

A third factor, though less significant, was the learning rate schedule.^[4] Kaplan et al. used a fixed cosine cycle length, while Hoffmann et al. argued that a cosine cycle that extends well beyond the target number of training steps leads to suboptimally trained models.^[3] However, ablation studies showed this had a smaller impact on scaling coefficients than the parameter counting issue.^[4]

Implications of the Revision

The Chinchilla findings triggered a significant shift in how the AI industry approached model training.^[3] Instead of building ever-larger models trained on relatively small datasets, organizations began investing heavily in data curation and training smaller models on much larger datasets.

LLaMA (Meta, 2023) was explicitly designed following Chinchilla-optimal principles: the 65B parameter version was trained on 1.4 trillion tokens, and the smaller 7B version was trained on over 1 trillion tokens.^[10] This approach made inference significantly cheaper (smaller models require less memory and compute to run) while maintaining competitive performance.^[10]

The Compute-Optimal Frontier

Taken together, the Kaplan and Chinchilla studies established the concept of the compute-optimal frontier: for any given compute budget, there exists a specific combination of model size and training data that minimizes loss.^[3] Operating on this frontier means that neither the model nor the data is the bottleneck.

The parametric loss function underlying this frontier takes the general form:

L(N, D) = E + A / N^\alpha + B / D^\beta

where $E$ represents the irreducible loss (the entropy of natural language itself), and the remaining terms capture the reducible contributions from finite model size and finite data.^[3] Both Kaplan and Chinchilla agree on this functional form; they differ in the fitted values of $\alpha$ , $\beta$ , and the resulting optimal allocation.^[4]

The concept of a compute-optimal frontier has practical implications for AI labs planning training runs. Given an estimate of available compute, the frontier prescribes the ideal model size and dataset size. Deviation from this frontier in either direction (model too large and undertrained, or model too small and overtrained) wastes resources.

More recent work has noted that the compute-optimal frontier is relevant mainly for training-time efficiency. In practice, many organizations choose to "overtrain" smaller models (training them on more data than the compute-optimal ratio suggests), because the one-time training cost is less important than the ongoing inference cost of deploying a smaller, faster model.

Connection to The Bitter Lesson

Richard Sutton's 2019 essay "The Bitter Lesson" articulated a principle that the Kaplan scaling laws quantified empirically.^[5] Sutton observed that across seven decades of AI research, methods that leverage increasing computation consistently outperform methods that rely on human-engineered knowledge.^[5] Sophisticated hand-crafted features, domain-specific architectures, and expert systems all eventually lose to general approaches powered by more compute.^[5]

The Kaplan paper provided the mathematical scaffolding for this observation. By showing that performance follows smooth, predictable power laws with compute, the paper demonstrated that the returns from scaling are not random or architecture-dependent but follow reliable quantitative laws.^[1] The exponent $\alpha_C = 0.050$ means that every 10x increase in optimally allocated compute yields a predictable reduction in loss.^[1]

This connection between the philosophical argument (Sutton) and the empirical evidence (Kaplan) helped solidify what became known as the scaling hypothesis: the proposition that the primary path to more capable AI systems runs through increasing scale rather than through algorithmic innovation. While the scaling hypothesis remains debated, the Kaplan paper gave it a quantitative foundation that proved influential.

Impact on AI Industry Investment

The scaling laws paper fundamentally changed how the AI industry approached research and development planning. Before its publication, decisions about model size and training duration were made largely through intuition and limited experimentation. After the paper, these decisions could be guided by mathematical predictions.

Resource Planning

AI labs could now estimate the compute required to reach a target performance level. If a current model achieves a certain loss, the power-law relationships predict how much additional compute, data, or model size is needed to reduce loss by a specific amount.^[1] This predictability enabled more accurate budgeting for training runs costing millions of dollars.

Hardware Investment

The finding that returns from scaling are smooth and predictable (with no plateau in sight at the ranges tested) provided justification for massive hardware investments.^[1] Companies like NVIDIA, Google, Microsoft, and Amazon expanded their GPU and TPU clusters based partly on the expectation that more compute would translate reliably into better models. The global spending on AI-specific computing hardware accelerated dramatically in the years following the paper.

The Scaling Race

The paper's implications contributed to an industry-wide race to train larger models. If performance scales predictably with compute, then the organization with the most compute can build the best model. This dynamic drove investments in increasingly large training clusters, competitive hiring of AI researchers, and partnerships between AI labs and cloud computing providers.

Performance Prediction

Perhaps most importantly, the scaling laws enabled prediction before training. Rather than spending millions on a training run and hoping for the best, labs could estimate the outcome in advance. This capability proved valuable for both research planning and business decisions.

What are the limitations of the scaling laws paper?

Despite its influence, the paper has several known limitations.

Narrow model range. The experiments used models up to only 1.5 billion non-embedding parameters.^[1] Extrapolating the fitted power laws to 100x or 1,000x larger models carries inherent risk, as the relationships may change at scales not yet tested.

Single architecture family. All experiments used Transformer decoder-only language models.^[1] The degree to which the specific exponents generalize to other architectures (encoder-decoder models, mixture-of-experts models, state space models, etc.) was not established.

Single dataset. The primary experiments were conducted on WebText2.^[1] While the paper tested generalization to other text distributions, the fitted constants are specific to the training data.

Non-embedding parameter counting. As the Chinchilla revision showed, the choice to count non-embedding parameters introduced a systematic bias in the compute-optimal allocation recommendations, particularly at smaller scales.^[4]

Loss as the sole metric. The paper focused exclusively on cross-entropy loss rather than downstream task performance.^[1] While loss is correlated with task performance, the relationship is not always straightforward, and specific capabilities may emerge nonlinearly at certain scales.

Fixed learning rate schedule. The use of a fixed cosine cycle length may have caused some models to be suboptimally trained, potentially affecting the fitted scaling exponents.^[4]

Legacy and Subsequent Work

The Kaplan et al. paper spawned an extensive body of follow-up research extending scaling laws to new domains and refining the original findings.

Scaling Laws for Transfer (Hernandez et al., 2021), co-authored by Kaplan, extended the framework to transfer learning and fine-tuning, showing that pre-training effectively multiplies the fine-tuning dataset size.^[7]

Chinchilla (Hoffmann et al., 2022) revised the compute-optimal allocation, as discussed above, and triggered a major shift in training practices.^[3]

Scaling Laws for Autoregressive Generative Modeling (Henighan et al., 2020), also from the OpenAI team, extended scaling laws to other modalities including images, video, math, and code.^[6]

Broken Neural Scaling Laws (Caballero et al., 2023) explored cases where the smooth power-law behavior breaks down, identifying functional forms that better capture transitions between scaling regimes.^[8]

Explaining Neural Scaling Laws (Bahri et al., 2024, published in PNAS) provided theoretical frameworks for understanding why power-law relationships emerge in neural network training.^[9]

Reconciling Kaplan and Chinchilla Scaling Laws (Porian et al., 2024) resolved the discrepancy between the two landmark papers, as discussed above.^[4]

The paper has accumulated over 3,000 citations as of 2025, placing it among the most influential AI research papers of its era. Its impact extends beyond academic citation counts: the paper's core ideas (predictable scaling, compute-optimal allocation, sample efficiency of larger models) became foundational assumptions guiding the development of GPT-4, Claude, Gemini, LLaMA, and other frontier language models.

Summary of Core Equations

The following table consolidates the primary equations from the paper for reference.

Equation	Formula	Key Parameters
Loss vs. parameters	$L(N) = (8.8 \times 10^{13} / N)^{0.076}$	$N$ = non-embedding params
Loss vs. data	$L(D) = (5.4 \times 10^{13} / D)^{0.095}$	$D$ = tokens
Loss vs. compute	$L(C) = (3.1 \times 10^8 / C)^{0.050}$	$C$ = PetaFLOP-days
Combined loss	$L(N,D) = [(N_c/N)^{\alpha_N/\alpha_D} + D_c/D]^{\alpha_D}$	$\alpha_N=0.076$ , $\alpha_D=0.103$
Data requirement	$D \ge 5 \times 10^3 \times N^{0.74}$	To avoid significant overfitting
Critical batch size	$B_{\text{crit}} = 2 \times 10^8 / L^{1/0.21}$	$L$ = current loss
Optimal model size	$N_{\text{opt}} \propto C^{0.73}$	$C$ = compute budget
Optimal data size	$D_{\text{opt}} \propto C^{0.27}$	$C$ = compute budget

References

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv preprint arXiv:2001.08361 (submitted January 23, 2020). https://arxiv.org/abs/2001.08361 ↩
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems, 33. arXiv:2005.14165. https://arxiv.org/abs/2005.14165 ↩
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., et al. (2022). "Training Compute-Optimal Large Language Models." Advances in Neural Information Processing Systems, 35. arXiv:2203.15556. https://arxiv.org/abs/2203.15556 ↩
Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. (2024). "Reconciling Kaplan and Chinchilla Scaling Laws." arXiv preprint arXiv:2406.12907. https://arxiv.org/abs/2406.12907 ↩
Sutton, R. (2019). "The Bitter Lesson." http://www.incompleteideas.net/IncIdeas/BitterLesson.html ↩
Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., et al. (2020). "Scaling Laws for Autoregressive Generative Modeling." arXiv preprint arXiv:2010.14701. https://arxiv.org/abs/2010.14701 ↩
Hernandez, D., Kaplan, J., Henighan, T., & McCandlish, S. (2021). "Scaling Laws for Transfer." arXiv preprint arXiv:2102.01293. https://arxiv.org/abs/2102.01293 ↩
Caballero, E., Gupta, K., Rish, I., & Krueger, D. (2023). "Broken Neural Scaling Laws." International Conference on Learning Representations. https://arxiv.org/abs/2210.14891 ↩
Bahri, Y., Dyer, E., Kaplan, J., Lee, J., & Sharma, U. (2024). "Explaining Neural Scaling Laws." Proceedings of the National Academy of Sciences, 121(27). https://www.pnas.org/doi/10.1073/pnas.2311878121 ↩
Touvron, H., Lavril, T., Izcard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit