Scaling Laws for Neural Language Models is a landmark research paper published by OpenAI in January 2020 (arXiv:2001.08361). Authored by Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, the paper established that the performance of neural language models follows predictable power-law relationships with three key variables: model size (number of parameters), dataset size (number of training tokens), and the amount of compute used for training. These relationships span more than seven orders of magnitude and hold with remarkable consistency, providing the AI research community with a quantitative framework for predicting how language model performance scales. The paper's findings directly shaped the development of GPT-3 and influenced billions of dollars in compute investment decisions across the AI industry.
Before the publication of this paper, the deep learning community had observed that larger models tend to perform better, but this intuition lacked a rigorous mathematical basis. Researchers trained models at different scales and observed improvements, yet there was no systematic understanding of how performance would change as a function of scale. Training large language models is expensive, and without reliable predictions, organizations risked wasting significant computational resources on suboptimal configurations.
The paper's motivation was to answer a fundamental question: given a fixed compute budget, how should resources be allocated between model size, dataset size, and training duration to achieve the best possible performance? This question had enormous practical implications. If the relationships between these variables and performance could be quantified, researchers could plan training runs more efficiently, predict the returns from scaling, and make informed decisions about hardware investments.
The work built on prior observations from the machine learning community about the benefits of scale. In March 2019, Richard Sutton published "The Bitter Lesson," an influential essay arguing that seventy years of AI research demonstrated a consistent pattern: general methods that leverage computation ultimately outperform specialized methods that exploit human knowledge. Sutton's essay articulated the philosophical foundation, and the Kaplan et al. paper provided the empirical and mathematical specifics.
The paper was produced by a team of ten researchers, primarily affiliated with OpenAI. The lead author, Jared Kaplan, held a dual appointment as an associate professor of physics at Johns Hopkins University and a researcher at OpenAI. His background in theoretical physics, particularly quantum gravity and holography (AdS/CFT), brought a physicist's perspective to the study of neural network behavior. Kaplan later became a co-founder and chief science officer of Anthropic.
The remaining authors were all OpenAI researchers at the time of publication. Several of them also contributed to the GPT-3 paper (Brown et al., 2020), which was published five months later. Nine of the ten scaling laws authors (Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, and Amodei) also appeared on the GPT-3 author list. Dario Amodei, who served as VP of Research at OpenAI, later co-founded Anthropic. Alec Radford was a key architect of the GPT series of models. Tom B. Brown served as lead author on the GPT-3 paper. This overlap between the scaling laws and GPT-3 teams was not coincidental; the scaling laws research directly informed GPT-3's design and training decisions.
All experiments used the WebText2 dataset, an expanded version of the WebText corpus originally collected for training GPT-2. WebText2 consists of 20.3 million documents sourced from Reddit outbound links that received at least three karma points, covering the period from December 2017 through October 2018. The dataset contains approximately 96 GB of text, corresponding to 1.62 x 10^10 words and 2.29 x 10^10 tokens after byte pair encoding (BPE) tokenization with a vocabulary size of 50,257.
The test set contained 6.6 x 10^8 tokens. The researchers also evaluated models on several out-of-distribution datasets, including the Books Corpus, Common Crawl, Wikipedia, and Internet Books, to test transfer and generalization properties.
The experiments used Transformer decoder-only language models, following the architecture of the GPT-2 family. The researchers parameterized models using several hyperparameters: n_layer (number of layers), d_model (dimension of the residual stream), d_ff (dimension of the intermediate feed-forward layer), d_attn (dimension of the attention output), and n_heads (number of attention heads per layer).
Models ranged from 768 non-embedding parameters at the smallest scale to 1.5 billion non-embedding parameters at the largest. The researchers varied model shapes extensively, testing configurations with n_layer ranging from 6 to 207 and d_model ranging from 768 to 4,288. Most runs used a context length of 1,024 tokens and a batch size of 512 sequences (approximately 5 x 10^5 tokens per batch).
A critical methodological choice was that the paper measured model size N as the number of non-embedding parameters, excluding vocabulary and positional embeddings. This decision later proved to be a significant source of divergence from subsequent scaling law studies.
The primary metric was cross-entropy loss (negative log-likelihood per token) on the test set, measured in nats. This metric directly reflects how well a model predicts the next token in a sequence and is the standard objective for autoregressive language model training.
The paper presented a set of interconnected findings that, taken together, provided a comprehensive picture of how language model performance scales. The authors organized their results around several core observations.
The central discovery was that language model loss is primarily determined by three factors: the number of non-embedding parameters (N), the dataset size in tokens (D), and the compute budget (C). Within a wide range of reasonable architectural choices, other hyperparameters such as the ratio of depth to width, the number of attention heads, and the feed-forward layer dimension had minimal impact on performance.
This finding was practically significant because it meant that researchers could focus on scale rather than architecture search when seeking performance improvements. The specific shape of the Transformer mattered far less than its total size.
When performance was not bottlenecked by one of the other two variables, the test loss followed smooth power-law relationships with each individual scaling variable. These power laws held across more than six orders of magnitude with no signs of flattening at the upper end of the ranges tested.
The smoothness and consistency of these curves was striking. Unlike many empirical relationships in machine learning that exhibit noisy or irregular behavior, the scaling curves were remarkably clean and predictable.
The degree of overfitting depended predictably on the ratio of model size to dataset size. Specifically, the performance penalty from overfitting scaled with N^0.74 / D, meaning that an 8x increase in model size required only about 5x more data to maintain the same level of overfitting. This sublinear relationship was the basis for the claim that larger models are more sample-efficient.
Training curves followed predictable power-law trajectories with parameters that were roughly independent of model size. This enabled extrapolation: by observing the early portion of a training curve, researchers could predict the final loss with reasonable accuracy.
When models were evaluated on text distributions different from their training data, the results were strongly correlated with in-distribution performance but offset by a roughly constant amount. In other words, transfer to a different distribution incurred a fixed penalty but otherwise improved in line with training-set performance. Generalization depended almost exclusively on in-distribution validation loss rather than on training duration or proximity to convergence.
Larger models reached the same performance level as smaller models while using fewer optimization steps and fewer data points. This sample efficiency was one of the paper's most consequential findings, because it implied that for a fixed compute budget, training a larger model on less data could be more efficient than training a smaller model on more data.
Optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping well before the model has converged. The paper suggested that models should be trained to only about 10% above their converged loss level for optimal compute efficiency, a practice that was far from conventional at the time.
The paper quantified the scaling relationships through a set of fitted power-law equations. These formulas became widely referenced throughout the AI research community.
When only one variable is allowed to change (with the others effectively unlimited), the loss follows:
| Relationship | Formula | Exponent | Scale Constant |
|---|---|---|---|
| Loss vs. Parameters | L(N) = (N_c / N)^alpha_N | alpha_N = 0.076 | N_c = 8.8 x 10^13 |
| Loss vs. Data | L(D) = (D_c / D)^alpha_D | alpha_D = 0.095 | D_c = 5.4 x 10^13 |
| Loss vs. Compute | L(C_min) = (C_c / C_min)^alpha_C | alpha_C = 0.050 | C_c = 3.1 x 10^8 PF-days |
In these formulas, N refers to non-embedding parameters, D refers to the number of training tokens, and C_min refers to the compute budget in PetaFLOP-days when compute is allocated optimally between model size and training duration.
The exponents indicate diminishing but persistent returns from scaling. Each 10x increase in parameters reduces loss by approximately 5.3% (since 10^0.076 = 1.19, corresponding to a factor of 1/1.19 reduction). Each 10x increase in data reduces loss by approximately 6.4%. Each 10x increase in optimally allocated compute reduces loss by approximately 3.4%.
When both model size and dataset size are varied simultaneously, the loss follows a combined formula:
L(N, D) = [(N_c / N)^(alpha_N / alpha_D) + D_c / D]^alpha_D
The fitted parameters for this combined formula are:
| Parameter | Value |
|---|---|
| alpha_N | 0.076 |
| alpha_D | 0.103 |
| N_c | 6.4 x 10^13 |
| D_c | 1.8 x 10^13 |
This formula captures the interaction between model size and data size, correctly predicting the onset of overfitting when a model is too large for its dataset.
The paper derived a rule of thumb for the minimum dataset size needed to avoid significant overfitting:
D >= (5 x 10^3) x N^0.74
This means that dataset size should grow sublinearly with model size. An 8x increase in model parameters requires only about 5x more data to maintain the same overfitting level.
The paper also identified a critical batch size that scales with loss:
B_crit(L) = B_ / L^(1/alpha_B)*
With B_* = 2 x 10^8 tokens and alpha_B = 0.21. The critical batch size approximately doubles for every 13% decrease in loss. Training with a batch size near B_crit provides the most efficient use of both compute and wall-clock time.
One of the paper's most influential contributions was its prescription for how to allocate a fixed compute budget. Given a total compute budget C (measured in FLOPs, where C = 6ND for training, with 6 FLOPs per parameter per token), the paper found that optimal allocation follows:
| Resource | Scaling with Compute | Exponent |
|---|---|---|
| Model parameters (N) | N_opt proportional to C^0.73 | 0.73 |
| Batch size (B) | B_opt proportional to C^0.24 | 0.24 |
| Training steps (S) | S_opt proportional to C^0.03 | 0.03 |
The exponents sum to approximately 1.0, reflecting the constraint C = 6ND = 6NBS.
The practical implication was clear: when scaling up compute, the overwhelming majority of the additional budget should go toward making the model larger. With a 10x increase in compute, the recommendation was to increase model size by approximately 5.5x, dataset size by approximately 1.8x, and training steps by a negligible amount. This was summarized as "scale models over data."
The tokens-per-parameter ratio under this optimal allocation worked out to approximately 1.7 to 4 tokens per parameter, a number that proved to be substantially lower than the ratio later recommended by the Chinchilla study.
The scaling laws paper and the GPT-3 paper were developed concurrently at OpenAI by largely overlapping teams. The scaling laws findings directly shaped GPT-3's training configuration.
Following the prescription to prioritize model size over data, GPT-3 was trained with 175 billion parameters on approximately 300 billion tokens. This produced a tokens-per-parameter ratio of about 1.7, closely matching the compute-optimal allocation predicted by the Kaplan scaling laws. The model was trained on a diverse dataset consisting of Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia.
GPT-3's success validated the scaling laws in several ways. The model demonstrated strong few-shot learning abilities that improved smoothly with scale, consistent with the power-law predictions. Performance on downstream benchmarks tracked the predicted improvement curves. The model's zero-shot and few-shot capabilities on tasks it was never explicitly trained for provided evidence that the scaling relationships captured something fundamental about language model behavior.
The GPT-3 paper explicitly referenced the scaling laws work and presented additional evidence for power-law scaling across its eight model sizes (ranging from 125 million to 175 billion parameters). The smooth improvement across these sizes reinforced the conclusion that scale, not architecture, was the primary driver of performance.
Several other large language models developed in 2020 and 2021 also followed the Kaplan scaling prescription. Models like Gopher (280B parameters, 300B tokens), Jurassic-1 (178B parameters), and Megatron-Turing NLG (530B parameters) all exhibited relatively low tokens-per-parameter ratios, consistent with the recommendation to prioritize model size.
In March 2022, Jordan Hoffmann and colleagues at DeepMind published "Training Compute-Optimal Large Language Models" (commonly known as the Chinchilla paper), which significantly revised the compute-optimal allocation recommendations from Kaplan et al.
The Chinchilla paper trained over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens. Its central finding was that model size and training data should be scaled equally: for every doubling of compute, both the number of parameters and the number of training tokens should be doubled. This yielded a dramatically different allocation:
| Aspect | Kaplan et al. (2020) | Hoffmann et al. (2022) |
|---|---|---|
| Optimal N scaling | N_opt proportional to C^0.73 | N_opt proportional to C^0.50 |
| Optimal D scaling | D_opt proportional to C^0.27 | D_opt proportional to C^0.50 |
| Tokens per parameter | ~1.7 to 4 | ~20 |
| Core recommendation | Scale models over data | Scale models and data equally |
| Model size range tested | 768 to 1.5B parameters | 70M to 16B parameters |
| Practical example | GPT-3: 175B params, 300B tokens | Chinchilla: 70B params, 1.4T tokens |
The Chinchilla model (70 billion parameters trained on 1.4 trillion tokens) matched or exceeded the performance of Gopher (280 billion parameters trained on 300 billion tokens) despite using the same compute budget. This demonstrated that Gopher, GPT-3, and similar models were significantly undertrained relative to their size.
Subsequent research, notably "Reconciling Kaplan and Chinchilla Scaling Laws" (Porian et al., 2024), identified the primary reasons for the divergence between the two sets of recommendations.
The most significant factor was the parameter counting methodology. Kaplan et al. counted only non-embedding parameters, while Hoffmann et al. counted total parameters. At small model scales, embedding parameters (vocabulary embeddings and positional embeddings) constitute a meaningful fraction of total parameters. This creates a systematic bias: as models grow, the proportion of embedding parameters shrinks, changing the apparent scaling relationship. When the reconciliation study reanalyzed Chinchilla's data using non-embedding parameters at Kaplan's small scale (768M to 1.5B parameters), they obtained scaling coefficients of 0.74 to 0.78, closely matching Kaplan's 0.73.
The second factor was the scale of experiments. Kaplan et al. used models up to 1.5 billion parameters, while Hoffmann et al. used models up to 16 billion parameters. The relationship between non-embedding and total parameters exhibits curvature, producing different local power-law fits depending on the model size range examined.
A third factor, though less significant, was the learning rate schedule. Kaplan et al. used a fixed cosine cycle length, while Hoffmann et al. argued that a cosine cycle that extends well beyond the target number of training steps leads to suboptimally trained models. However, ablation studies showed this had a smaller impact on scaling coefficients than the parameter counting issue.
The Chinchilla findings triggered a significant shift in how the AI industry approached model training. Instead of building ever-larger models trained on relatively small datasets, organizations began investing heavily in data curation and training smaller models on much larger datasets.
LLaMA (Meta, 2023) was explicitly designed following Chinchilla-optimal principles: the 65B parameter version was trained on 1.4 trillion tokens, and the smaller 7B version was trained on over 1 trillion tokens. This approach made inference significantly cheaper (smaller models require less memory and compute to run) while maintaining competitive performance.
Taken together, the Kaplan and Chinchilla studies established the concept of the compute-optimal frontier: for any given compute budget, there exists a specific combination of model size and training data that minimizes loss. Operating on this frontier means that neither the model nor the data is the bottleneck.
The parametric loss function underlying this frontier takes the general form:
L(N, D) = E + A / N^alpha + B / D^beta
where E represents the irreducible loss (the entropy of natural language itself), and the remaining terms capture the reducible contributions from finite model size and finite data. Both Kaplan and Chinchilla agree on this functional form; they differ in the fitted values of alpha, beta, and the resulting optimal allocation.
The concept of a compute-optimal frontier has practical implications for AI labs planning training runs. Given an estimate of available compute, the frontier prescribes the ideal model size and dataset size. Deviation from this frontier in either direction (model too large and undertrained, or model too small and overtrained) wastes resources.
More recent work has noted that the compute-optimal frontier is relevant mainly for training-time efficiency. In practice, many organizations choose to "overtrain" smaller models (training them on more data than the compute-optimal ratio suggests), because the one-time training cost is less important than the ongoing inference cost of deploying a smaller, faster model.
Richard Sutton's 2019 essay "The Bitter Lesson" articulated a principle that the Kaplan scaling laws quantified empirically. Sutton observed that across seven decades of AI research, methods that leverage increasing computation consistently outperform methods that rely on human-engineered knowledge. Sophisticated hand-crafted features, domain-specific architectures, and expert systems all eventually lose to general approaches powered by more compute.
The Kaplan paper provided the mathematical scaffolding for this observation. By showing that performance follows smooth, predictable power laws with compute, the paper demonstrated that the returns from scaling are not random or architecture-dependent but follow reliable quantitative laws. The exponent alpha_C = 0.050 means that every 10x increase in optimally allocated compute yields a predictable reduction in loss.
This connection between the philosophical argument (Sutton) and the empirical evidence (Kaplan) helped solidify what became known as the scaling hypothesis: the proposition that the primary path to more capable AI systems runs through increasing scale rather than through algorithmic innovation. While the scaling hypothesis remains debated, the Kaplan paper gave it a quantitative foundation that proved influential.
The scaling laws paper fundamentally changed how the AI industry approached research and development planning. Before its publication, decisions about model size and training duration were made largely through intuition and limited experimentation. After the paper, these decisions could be guided by mathematical predictions.
AI labs could now estimate the compute required to reach a target performance level. If a current model achieves a certain loss, the power-law relationships predict how much additional compute, data, or model size is needed to reduce loss by a specific amount. This predictability enabled more accurate budgeting for training runs costing millions of dollars.
The finding that returns from scaling are smooth and predictable (with no plateau in sight at the ranges tested) provided justification for massive hardware investments. Companies like NVIDIA, Google, Microsoft, and Amazon expanded their GPU and TPU clusters based partly on the expectation that more compute would translate reliably into better models. The global spending on AI-specific computing hardware accelerated dramatically in the years following the paper.
The paper's implications contributed to an industry-wide race to train larger models. If performance scales predictably with compute, then the organization with the most compute can build the best model. This dynamic drove investments in increasingly large training clusters, competitive hiring of AI researchers, and partnerships between AI labs and cloud computing providers.
Perhaps most importantly, the scaling laws enabled prediction before training. Rather than spending millions on a training run and hoping for the best, labs could estimate the outcome in advance. This capability proved valuable for both research planning and business decisions.
Despite its influence, the paper has several known limitations.
Narrow model range. The experiments used models up to only 1.5 billion non-embedding parameters. Extrapolating the fitted power laws to 100x or 1,000x larger models carries inherent risk, as the relationships may change at scales not yet tested.
Single architecture family. All experiments used Transformer decoder-only language models. The degree to which the specific exponents generalize to other architectures (encoder-decoder models, mixture-of-experts models, state space models, etc.) was not established.
Single dataset. The primary experiments were conducted on WebText2. While the paper tested generalization to other text distributions, the fitted constants are specific to the training data.
Non-embedding parameter counting. As the Chinchilla revision showed, the choice to count non-embedding parameters introduced a systematic bias in the compute-optimal allocation recommendations, particularly at smaller scales.
Loss as the sole metric. The paper focused exclusively on cross-entropy loss rather than downstream task performance. While loss is correlated with task performance, the relationship is not always straightforward, and specific capabilities may emerge nonlinearly at certain scales.
Fixed learning rate schedule. The use of a fixed cosine cycle length may have caused some models to be suboptimally trained, potentially affecting the fitted scaling exponents.
The Kaplan et al. paper spawned an extensive body of follow-up research extending scaling laws to new domains and refining the original findings.
Scaling Laws for Transfer (Hernandez et al., 2021), co-authored by Kaplan, extended the framework to transfer learning and fine-tuning, showing that pre-training effectively multiplies the fine-tuning dataset size.
Chinchilla (Hoffmann et al., 2022) revised the compute-optimal allocation, as discussed above, and triggered a major shift in training practices.
Scaling Laws for Autoregressive Generative Modeling (Henighan et al., 2020), also from the OpenAI team, extended scaling laws to other modalities including images, video, math, and code.
Broken Neural Scaling Laws (Caballero et al., 2023) explored cases where the smooth power-law behavior breaks down, identifying functional forms that better capture transitions between scaling regimes.
Explaining Neural Scaling Laws (Bahri et al., 2024, published in PNAS) provided theoretical frameworks for understanding why power-law relationships emerge in neural network training.
Reconciling Kaplan and Chinchilla Scaling Laws (Porian et al., 2024) resolved the discrepancy between the two landmark papers, as discussed above.
The paper has accumulated over 3,000 citations as of 2025, placing it among the most influential AI research papers of its era. Its impact extends beyond academic citation counts: the paper's core ideas (predictable scaling, compute-optimal allocation, sample efficiency of larger models) became foundational assumptions guiding the development of GPT-4, Claude, Gemini, LLaMA, and other frontier language models.
The following table consolidates the primary equations from the paper for reference.
| Equation | Formula | Key Parameters |
|---|---|---|
| Loss vs. parameters | L(N) = (8.8 x 10^13 / N)^0.076 | N = non-embedding params |
| Loss vs. data | L(D) = (5.4 x 10^13 / D)^0.095 | D = tokens |
| Loss vs. compute | L(C) = (3.1 x 10^8 / C)^0.050 | C = PetaFLOP-days |
| Combined loss | L(N,D) = [(N_c/N)^(alpha_N/alpha_D) + D_c/D]^alpha_D | alpha_N=0.076, alpha_D=0.103 |
| Data requirement | D >= 5 x 10^3 x N^0.74 | To avoid significant overfitting |
| Critical batch size | B_crit = 2 x 10^8 / L^(1/0.21) | L = current loss |
| Optimal model size | N_opt proportional to C^0.73 | C = compute budget |
| Optimal data size | D_opt proportional to C^0.27 | C = compute budget |