Gopher (language model)
Last reviewed
May 3, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,260 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 3, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,260 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gopher is a 280-billion-parameter autoregressive transformer language model developed by DeepMind and described in a trio of companion papers released on December 8, 2021. The model was never released publicly as weights or as an API, though DeepMind's research papers documented its architecture, training data, and benchmark results in unusual depth. Gopher is best known today as the immediate predecessor to Chinchilla, the 2022 follow-up that demonstrated Gopher had been heavily undertrained, and as a key step in the lineage that eventually led to Google's Gemini family. The Gopher work also produced two influential companion publications: the RETRO retrieval-augmented model from Borgeaud et al., and the Weidinger et al. taxonomy of ethical and social risks from language models.
| Field | Value |
|---|---|
| Developer | DeepMind |
| Type | Autoregressive transformer language model |
| Parameters | 280 billion (largest in family) |
| Training tokens | 300 billion |
| Training data | MassiveText (10.5 TB after filtering) |
| Training hardware | TPU v3 (4 pods, 4096 chips) |
| Architecture | 80 layers, 16,384 hidden dim, 128 attention heads |
| Sequence length | 2,048 tokens |
| Vocabulary | SentencePiece, 32,000 tokens |
| Initial release | December 8, 2021 (paper only) |
| License | Closed weights (research access only) |
| Status | Superseded by Chinchilla (March 2022) |
In 2020 and 2021, the dominant trend in language model research was raw scale. GPT-3 had reached 175 billion parameters using a relatively simple recipe: a decoder-only transformer trained on a few hundred billion tokens of internet text. Megatron-Turing NLG hit 530 billion parameters, and Chinese labs were releasing models at similar scales. The implicit assumption, supported by the Kaplan et al. scaling laws from OpenAI, was that bigger almost always meant better.
DeepMind had not yet released a model at this scale. Its earlier language work focused on Compressive Transformers and other architectural research rather than headline parameter counts. The Gopher project was an effort to catch up, and to do so methodically: build a family of models spanning more than three orders of magnitude in size, train them all on the same dataset, and use that ladder to study how scale interacts with task difficulty. The 280-billion-parameter top-of-stack model was the headline number, but the real product was the analysis.
The lead author, Jack Rae, framed the work as an attempt to identify which capabilities improve smoothly with scale and which do not. The team trained models at six sizes, all on the same 300-billion-token training run and the same data mixture, then evaluated all of them on 152 tasks. That gave them six points on the scaling curve for every benchmark.
The project sat alongside two companion efforts. One investigated what happens when you give a smaller language model access to a large external retrieval database, which became the RETRO paper. The other surveyed the ethical and social risks of large language models, intended as a taxonomy that would help DeepMind and the wider research community decide what to release and what not to. All three papers were posted to arXiv on the same day.
Gopher uses a decoder-only transformer with two modifications relative to the original 2017 design. The first is RMSNorm in place of LayerNorm, a simpler normalization that drops the mean-centering step and tends to give marginally better stability at scale. The second is relative positional encoding in the Transformer-XL style rather than absolute positional embeddings. The relative scheme lets the model evaluate on sequences longer than it was trained on, which mattered for some downstream tasks even though training itself used a fixed 2,048-token context.
Tokenization is SentencePiece with a 32,000-token vocabulary and byte-level fallback so that any UTF-8 input can be encoded. The optimizer is Adam with a cosine learning rate decay schedule.
The largest model is 80 layers deep, with a hidden dimension of 16,384 and 128 attention heads. Each attention head uses a key/value size of 128. The full model family scales these dimensions down together, sharing the same overall recipe but with smaller width, depth, and head count.
The Gopher paper trained six transformer language models with otherwise identical recipes, varying only the size. All six were trained on the same 300 billion tokens of MassiveText to keep the comparison clean.
| Model | Parameters | Layers | Heads | Key/value size | Hidden dim |
|---|---|---|---|---|---|
| Gopher-44M | 44 million | 8 | 16 | 32 | 512 |
| Gopher-117M | 117 million | 12 | 12 | 64 | 768 |
| Gopher-417M | 417 million | 12 | 12 | 128 | 1,536 |
| Gopher-1.4B | 1.4 billion | 24 | 16 | 128 | 2,048 |
| Gopher-7.1B | 7.1 billion | 32 | 32 | 128 | 4,096 |
| Gopher | 280 billion | 80 | 128 | 128 | 16,384 |
The naming convention got a little awkward in practice. "Gopher" without a suffix usually means the 280B model, while the smaller variants are referred to by parameter count. The 1.4B and 7.1B models were the workhorses for ablation studies in the paper, since training a single 280B run was prohibitively expensive to repeat.
Gopher was trained on MassiveText, an in-house dataset assembled by DeepMind specifically for the project. After filtering and deduplication, MassiveText totals roughly 10.5 TB of text and around 2.35 billion documents. The full corpus contains far more tokens than any of the Gopher models were trained on; the 300-billion-token training run sampled only about 12.8% of the available data.
The dataset combines six sources, with sampling proportions chosen empirically by training smaller models on different mixtures and picking the configuration that performed best across downstream evaluations.
| Source | Tokens | Sampling proportion | Notes |
|---|---|---|---|
| MassiveWeb | ~506 billion | 48% | Curated Common Crawl web pages |
| Books | ~560 billion | 27% | Published written works |
| C4 | ~182 billion | 10% | Common Crawl subset used for T5 |
| News | ~676 billion | 10% | News article archives |
| GitHub | ~422 billion | 3% | Open source code |
| Wikipedia | ~4 billion | 2% | Encyclopedia text |
MassiveWeb is DeepMind's own web crawl, filtered for quality and language. The pipeline strips out explicit content, performs document-level deduplication, removes low-quality pages, and filters out documents that overlap with the evaluation test sets to avoid contamination. The Books portion is the largest by raw token count after News, and the small but heavily upweighted Wikipedia slice contributes outsized influence because it is sampled at fifty times its raw token share.
The sampling weights are notable for what they de-emphasize. GitHub code makes up around 4% of total tokens but only 3% of the training mix, and C4 is downweighted from its raw share. The mixture was tuned for general-purpose language modeling rather than for code or for any specific downstream task.
DeepMind trained Gopher on 4,096 TPU v3 chips, organized as four pods of 1,024 chips each. The largest model used a combination of model parallelism within each pod and data parallelism across pods, with pipeline parallelism for the 280B run that incurred only about 10% overhead compared to single-device training. The smaller family members fit on smaller slices of TPU.
The full 300-billion-token training run for the 280B model took several weeks of wall-clock time. The smaller models in the family were faster to train and were used to design the data mixture, set hyperparameters, and run ablation studies. The training cost has not been published as a dollar figure, but at TPU v3 cloud rates of the era a back-of-envelope estimate puts the 280B run in the range of tens of millions of dollars in compute.
The Gopher paper evaluated all six model sizes on a battery of 152 tasks. The benchmark suite included 62 tasks from BIG-bench, 57 from MMLU, plus a wide range of language modeling, reading comprehension, fact-checking, question answering, and common-sense reasoning evaluations. Out of the 124 tasks where a prior state-of-the-art result existed, Gopher beat the previous best on 100 of them.
The pattern of where Gopher won and where it lost was almost as interesting as the headline number. Gains were concentrated in knowledge-heavy tasks: reading comprehension, fact-checking, closed-book question answering, and topics that depend on memorized world knowledge such as humanities and social sciences. Logical and mathematical reasoning improved much less with scale; on some math benchmarks, the 280B model performed barely better than the 7.1B variant. The team's own write-up was blunt about this, noting that scale alone did not appear to be the answer to the reasoning gap.
A selection of representative results, drawn from the paper:
| Benchmark | Gopher (280B) | Prior SOTA at time of release |
|---|---|---|
| MMLU (57 subjects, average) | 60.0% | ~43.9% (GPT-3 5-shot) |
| LAMBADA (zero-shot) | 74.5% | 76.6% (Megatron-Turing NLG) |
| Winograd (WSC273) | 83.2% | 90.1% (PaLM later) |
| TriviaQA (zero-shot) | 52.8% | 28.5% (GPT-3) |
| Natural Questions (open) | 21.0% | 14.6% (GPT-3) |
| BIG-bench (subset) | best on majority of 62 tasks | varied by task |
MMLU was the most-discussed result. Gopher's 60% average put it well ahead of GPT-3's 43.9% and made it the leading model on the benchmark at the time of publication. That lead held for only a few months before being eclipsed by Chinchilla, but it was a significant marker. On TriviaQA in zero-shot, Gopher nearly doubled GPT-3's accuracy, a result the DeepMind team attributed largely to the scale and quality of the MassiveText training data.
The failure modes were also catalogued openly. Gopher tended to repeat itself in long generations, exhibited the same kinds of stereotypical biases as other large models, and could state false information confidently. The companion ethical risks paper picked up many of these threads.
The RETRO paper, "Improving language models by retrieving from trillions of tokens" by Sebastian Borgeaud and colleagues, was released the same day as the main Gopher paper. RETRO stands for Retrieval-Enhanced Transformer. The model architecture combines a comparatively small decoder-only transformer (the largest variant has 7.5 billion parameters) with a frozen BERT retriever and a 2-trillion-token retrieval database.
During inference, RETRO splits the input into chunks, encodes each chunk with the BERT retriever, and finds the nearest neighbors in the retrieval database. Those retrieved passages are then fed into the transformer through a chunked cross-attention mechanism. The effect is that the model has access at inference time to a corpus far larger than what it was trained on.
The headline claim was that the 7.5B-parameter RETRO matched the performance of much larger conventional language models on the Pile, with roughly 25 times fewer parameters than GPT-3 and Jurassic-1 on some evaluations. The team also showed that retrieval helps even when the retrieval database mostly overlaps with the training corpus, suggesting that the explicit retrieval mechanism allows the model to look up rare facts on demand rather than memorizing everything in its weights.
RETRO was an early and unusually well-engineered example of retrieval-augmented language modeling. Its architectural choices, especially the chunked cross-attention design, influenced subsequent work in retrieval-augmented generation, though the specific RETRO recipe was not widely adopted at scale.
The third companion paper, "Ethical and social risks of harm from Language Models" by Laura Weidinger and colleagues, did not introduce a new model. It was a survey and taxonomy, intended as a reference document for researchers and policy teams trying to reason about what could go wrong with systems like Gopher.
The paper organized risks into six categories:
| Category | Examples |
|---|---|
| Discrimination, exclusion, and toxicity | Stereotyping, unfair bias, offensive output, disparate performance across groups |
| Information hazards | Leaking private data, inferring sensitive information, helping people find harmful instructions |
| Misinformation harms | Confidently false output, eroding trust in shared sources |
| Malicious uses | Targeted manipulation, scaled disinformation, helping with cyberattacks |
| Human-computer interaction harms | Unsafe usage, manipulation of users, anthropomorphism, over-reliance |
| Automation, access, and environmental harms | Labor displacement, unequal access, training compute footprint |
The paper analyzed 21 specific risks within these categories and laid out what mitigations existed and where current evaluation methods fell short. It was not framed as a release-decision document, but in practice it set part of the rationale for why Gopher itself was kept behind closed doors. The taxonomy went on to be widely cited in subsequent responsible-AI literature and was extended in later DeepMind publications.
In March 2022, just three months after the Gopher papers, a different DeepMind team led by Jordan Hoffmann published "Training Compute-Optimal Large Language Models" (arXiv:2203.15556). Their work, which used Gopher as the central reference point, argued that Gopher and other large models of the era had been trained badly, not because of any flaw in their architectures but because the field had been misallocating compute.
Hoffmann et al. trained more than 400 transformer language models at sizes from 70 million to 16 billion parameters and on token counts from 5 billion to 500 billion. They used three independent methodologies to estimate, for any given compute budget, the model size and training token count that minimize loss. All three methods agreed: the right answer was to train smaller models on much more data than the field had been doing.
Applied to the compute budget that had produced Gopher (roughly 5.76 x 10^23 FLOPs), the new analysis predicted an optimal model of around 63 to 67 billion parameters trained on 1.4 to 1.5 trillion tokens. That is roughly four times smaller and four times more data than Gopher, which had used 280 billion parameters and 300 billion tokens. The implied tokens-to-parameters ratio was about 20, compared to roughly 1.07 for Gopher.
To verify the prediction, the team trained Chinchilla, a 70-billion-parameter model on 1.4 trillion tokens of MassiveText, using the same compute budget Gopher had consumed. Chinchilla outperformed Gopher across nearly every benchmark, often by significant margins. On MMLU, Chinchilla reached 67.5% average accuracy, 7.5 percentage points above Gopher. On reasoning, knowledge, and commonsense tasks, the smaller model was consistently better.
The implication was uncomfortable for the broader field. Gopher had not failed because the architecture was wrong or the data was bad; it had failed (in the sense of being beaten by a four-times-smaller model on the same compute) because the training recipe was off. The same was true, by extension, of GPT-3, Megatron-Turing NLG, and Jurassic-1, all of which had been trained with similarly skewed parameter-to-token ratios.
The principles derived from this work became known as the Chinchilla scaling laws, or compute optimal scaling, and they reshaped how almost everyone trained large language models afterward. The general guideline (model size and training tokens should scale roughly equally with compute, at a ratio of about 20 tokens per parameter) became the default for designs like Meta's LLaMA family and many subsequent open and closed models.
Gopher itself had a short half-life as a frontier model. By mid-2022 it had been outperformed by Chinchilla (DeepMind's own follow-up), PaLM (Google Brain's 540B model), and other systems. The 280B weights were never released, the model was never offered as a public API, and direct access was limited to DeepMind researchers and selected collaborators. As a deployed system, Gopher's footprint outside DeepMind was effectively zero.
Its impact on the field came through the research output rather than the model itself. The Gopher paper provided the most detailed published analysis at that time of how transformer language models scale across more than three orders of magnitude. The MMLU results helped establish that benchmark as the standard knowledge-and-reasoning test for large language models. The accompanying taxonomy of ethical risks became a reference point for responsible-AI work, and the RETRO paper seeded a long line of retrieval-augmented language model research.
The most consequential legacy is that Gopher was the model Chinchilla measured itself against. Without the 280B Gopher run as a concrete reference point, the Chinchilla scaling laws would have been a much harder argument to make. The two models together produced the most influential single-year revision of training practice in the history of large language models.
The lineage continues through DeepMind's later work. After the 2022 reorganization that combined DeepMind and Google Brain into Google DeepMind, much of the team that had built Gopher and Chinchilla moved on to the Gemini project. Gemini's training recipes inherit the compute-optimal scaling philosophy that came directly out of comparing Chinchilla to Gopher. The path Gopher to Chinchilla to Gemini is one of the cleaner research-to-product lineages in modern AI.
Gopher itself, meanwhile, sits in a strange position in retrospective accounts. It was a serious effort that pushed the state of the art on dozens of benchmarks, was documented in unusual depth, and produced two important companion papers. It was also the canonical example, three months later, of how not to train a large language model. Both things are true.