Gopher (language model)

AI Models Google DeepMind Large Language Models

17 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 3,361 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Gopher is a 280-billion-parameter autoregressive transformer language model developed by DeepMind and described in a trio of companion papers released on December 8, 2021. Across a family of six models spanning tens of millions to 280 billion parameters, Gopher was evaluated on 152 diverse tasks and set a new state of the art on 100 of the 124 tasks where a prior best result existed.^[1]^[5] The model was never released publicly as weights or as an API, though DeepMind's research papers documented its architecture, training data, and benchmark results in unusual depth. Gopher is best known today as the immediate predecessor to Chinchilla, the 2022 follow-up that demonstrated Gopher had been heavily undertrained, and as a key step in the lineage that eventually led to Google's Gemini family. The Gopher work also produced two influential companion publications: the RETRO retrieval-augmented model from Borgeaud et al., and the Weidinger et al. taxonomy of ethical and social risks from language models.^[2]^[3]

Infobox

Field	Value
Developer	DeepMind
Type	Autoregressive transformer language model
Parameters	280 billion (largest in family)
Training tokens	300 billion
Training data	MassiveText (10.5 TB after filtering)
Training hardware	TPU v3 (4 pods, 4096 chips)
Architecture	80 layers, 16,384 hidden dim, 128 attention heads
Sequence length	2,048 tokens
Vocabulary	SentencePiece, 32,000 tokens
Initial release	December 8, 2021 (paper only)
License	Closed weights (research access only)
Status	Superseded by Chinchilla (March 2022)

Why did DeepMind build Gopher?

In 2020 and 2021, the dominant trend in language model research was raw scale. GPT-3 had reached 175 billion parameters using a relatively simple recipe: a decoder-only transformer trained on a few hundred billion tokens of internet text. Megatron-Turing NLG hit 530 billion parameters, and Chinese labs were releasing models at similar scales. The implicit assumption, supported by the Kaplan et al. scaling laws from OpenAI, was that bigger almost always meant better.^[6]

DeepMind had not yet released a model at this scale. Its earlier language work focused on Compressive Transformers and other architectural research rather than headline parameter counts. The Gopher project was an effort to catch up, and to do so methodically: build a family of models spanning more than three orders of magnitude in size, train them all on the same dataset, and use that ladder to study how scale interacts with task difficulty. The 280-billion-parameter top-of-stack model was the headline number, but the real product was the analysis.^[1]

The lead author, Jack Rae, framed the work as an attempt to identify which capabilities improve smoothly with scale and which do not. The team trained models at six sizes, all on the same 300-billion-token training run and the same data mixture, then evaluated all of them on 152 tasks. That gave them six points on the scaling curve for every benchmark.^[1]

The project sat alongside two companion efforts. In the words of DeepMind's announcement, the team released "three papers on language models that reflect this interdisciplinary approach": the Gopher study itself, "a study of ethical and social risks associated with large language models, and a paper investigating a new architecture with better training efficiency."^[5] All three papers were posted to arXiv on the same day, December 8, 2021.^[1]^[2]^[3]

Model architecture

Gopher uses a decoder-only transformer with two modifications relative to the original 2017 design. The first is RMSNorm in place of LayerNorm, a simpler normalization that drops the mean-centering step and tends to give marginally better stability at scale.^[10] The second is relative positional encoding in the Transformer-XL style rather than absolute positional embeddings.^[9] The relative scheme lets the model evaluate on sequences longer than it was trained on, which mattered for some downstream tasks even though training itself used a fixed 2,048-token context.^[1]

Tokenization is SentencePiece with a 32,000-token vocabulary and byte-level fallback so that any UTF-8 input can be encoded. The optimizer is Adam with a cosine learning rate decay schedule.^[1]

The largest model is 80 layers deep, with a hidden dimension of 16,384 and 128 attention heads. Each attention head uses a key/value size of 128. The full model family scales these dimensions down together, sharing the same overall recipe but with smaller width, depth, and head count.^[1]

What sizes does the Gopher family include?

The Gopher paper trained six transformer language models with otherwise identical recipes, varying only the size. All six were trained on the same 300 billion tokens of MassiveText to keep the comparison clean.^[1]

Model	Parameters	Layers	Heads	Key/value size	Hidden dim
Gopher-44M	44 million	8	16	32	512
Gopher-117M	117 million	12	12	64	768
Gopher-417M	417 million	12	12	128	1,536
Gopher-1.4B	1.4 billion	24	16	128	2,048
Gopher-7.1B	7.1 billion	32	32	128	4,096
Gopher	280 billion	80	128	128	16,384

The naming convention got a little awkward in practice. "Gopher" without a suffix usually means the 280B model, while the smaller variants are referred to by parameter count. The 1.4B and 7.1B models were the workhorses for ablation studies in the paper, since training a single 280B run was prohibitively expensive to repeat.^[1]

What is Gopher trained on?

Gopher was trained on MassiveText, an in-house dataset assembled by DeepMind specifically for the project. After filtering and deduplication, MassiveText totals roughly 10.5 TB of text and around 2.35 billion documents.^[1] The full corpus contains far more tokens than any of the Gopher models were trained on; the 300-billion-token training run sampled only about 12.8% of the available data.^[1]

The dataset combines six sources, with sampling proportions chosen empirically by training smaller models on different mixtures and picking the configuration that performed best across downstream evaluations.^[1]

Source	Tokens	Sampling proportion	Notes
MassiveWeb	~506 billion	48%	Curated Common Crawl web pages
Books	~560 billion	27%	Published written works
C4	~182 billion	10%	Common Crawl subset used for T5
News	~676 billion	10%	News article archives
GitHub	~422 billion	3%	Open source code
Wikipedia	~4 billion	2%	Encyclopedia text

MassiveWeb is DeepMind's own web crawl, filtered for quality and language. The pipeline strips out explicit content, performs document-level deduplication, removes low-quality pages, and filters out documents that overlap with the evaluation test sets to avoid contamination.^[1] The Books portion is the largest by raw token count after News, and the small but heavily upweighted Wikipedia slice contributes outsized influence because it is sampled at fifty times its raw token share.

The sampling weights are notable for what they de-emphasize. GitHub code makes up around 4% of total tokens but only 3% of the training mix, and C4 is downweighted from its raw share. The mixture was tuned for general-purpose language modeling rather than for code or for any specific downstream task.^[1]

How was Gopher trained?

DeepMind trained Gopher on 4,096 TPU v3 chips, organized as four pods of 1,024 chips each. The largest model used a combination of model parallelism within each pod and data parallelism across pods, with pipeline parallelism for the 280B run that incurred only about 10% overhead compared to single-device training.^[1] The smaller family members fit on smaller slices of TPU.

The full 300-billion-token training run for the 280B model took several weeks of wall-clock time. The smaller models in the family were faster to train and were used to design the data mixture, set hyperparameters, and run ablation studies.^[1] The training cost has not been published as a dollar figure, but at TPU v3 cloud rates of the era a back-of-envelope estimate puts the 280B run in the range of tens of millions of dollars in compute.

How did Gopher perform on benchmarks?

The Gopher paper evaluated all six model sizes on a battery of 152 tasks. The benchmark suite included 62 tasks from BIG-bench, 57 from MMLU, plus a wide range of language modeling, reading comprehension, fact-checking, question answering, and common-sense reasoning evaluations. Out of the 124 tasks where a prior state-of-the-art result existed, Gopher beat the previous best on 100 of them.^[1]^[5]

The pattern of where Gopher won and where it lost was almost as interesting as the headline number. As the paper's abstract put it, "Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit."^[1] Gains were concentrated in knowledge-heavy tasks: reading comprehension, fact-checking, closed-book question answering, and topics that depend on memorized world knowledge such as humanities and social sciences. Logical and mathematical reasoning improved much less with scale; on some math benchmarks, the 280B model performed barely better than the 7.1B variant. The team's own write-up was blunt about this, noting that scale alone did not appear to be the answer to the reasoning gap.^[5]

A selection of representative results, drawn from the paper:^[1]

Benchmark	Gopher (280B)	Prior SOTA at time of release
MMLU (57 subjects, average)	60.0%	~43.9% (GPT-3 5-shot)
LAMBADA (zero-shot)	74.5%	76.6% (Megatron-Turing NLG)
Winograd (WSC273)	83.2%	90.1% (PaLM later)
TriviaQA (zero-shot)	52.8%	28.5% (GPT-3)
Natural Questions (open)	21.0%	14.6% (GPT-3)
BIG-bench (subset)	best on majority of 62 tasks	varied by task

MMLU was the most-discussed result. Gopher's 60.0% average put it well ahead of GPT-3's 43.9% and made it the leading model on the benchmark at the time of publication; DeepMind noted the result "almost halves the accuracy gap from GPT-3 to human expert performance," estimated at roughly 89.8%.^[1] That lead held for only a few months before being eclipsed by Chinchilla, but it was a significant marker. On TriviaQA in zero-shot, Gopher nearly doubled GPT-3's accuracy, a result the DeepMind team attributed largely to the scale and quality of the MassiveText training data.^[1]

The failure modes were also catalogued openly. Gopher tended to repeat itself in long generations, exhibited the same kinds of stereotypical biases as other large models, and could state false information confidently. The companion ethical risks paper picked up many of these threads.^[3]

Companion work

RETRO (retrieval-augmented Gopher)

The RETRO paper, "Improving language models by retrieving from trillions of tokens" by Sebastian Borgeaud and colleagues, was released the same day as the main Gopher paper.^[2] RETRO stands for Retrieval-Enhanced Transformer. The model architecture combines a comparatively small decoder-only transformer (the largest variant has 7.5 billion parameters) with a frozen BERT retriever and a 2-trillion-token retrieval database.^[2]

During inference, RETRO splits the input into chunks, encodes each chunk with the BERT retriever, and finds the nearest neighbors in the retrieval database. Those retrieved passages are then fed into the transformer through a chunked cross-attention mechanism. The effect is that the model has access at inference time to a corpus far larger than what it was trained on.^[2]

The headline claim was that the 7.5B-parameter RETRO "obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25x fewer parameters."^[2] The team also showed that retrieval helps even when the retrieval database mostly overlaps with the training corpus, suggesting that the explicit retrieval mechanism allows the model to look up rare facts on demand rather than memorizing everything in its weights.

RETRO was an early and unusually well-engineered example of retrieval-augmented language modeling. Its architectural choices, especially the chunked cross-attention design, influenced subsequent work in retrieval-augmented generation, though the specific RETRO recipe was not widely adopted at scale.

The third companion paper, "Ethical and social risks of harm from Language Models" by Laura Weidinger and colleagues, did not introduce a new model.^[3] It was a survey and taxonomy, intended as a reference document for researchers and policy teams trying to reason about what could go wrong with systems like Gopher.

The paper organized risks into six categories:^[3]

Category	Examples
Discrimination, exclusion, and toxicity	Stereotyping, unfair bias, offensive output, disparate performance across groups
Information hazards	Leaking private data, inferring sensitive information, helping people find harmful instructions
Misinformation harms	Confidently false output, eroding trust in shared sources
Malicious uses	Targeted manipulation, scaled disinformation, helping with cyberattacks
Human-computer interaction harms	Unsafe usage, manipulation of users, anthropomorphism, over-reliance
Automation, access, and environmental harms	Labor displacement, unequal access, training compute footprint

The paper analyzed 21 specific risks within these categories and laid out what mitigations existed and where current evaluation methods fell short.^[3] It was not framed as a release-decision document, but in practice it set part of the rationale for why Gopher itself was kept behind closed doors. The taxonomy went on to be widely cited in subsequent responsible-AI literature and was extended in later DeepMind publications.

How does Gopher differ from Chinchilla?

In March 2022, just three months after the Gopher papers, a different DeepMind team led by Jordan Hoffmann published "Training Compute-Optimal Large Language Models" (arXiv:2203.15556).^[4] Their work, which used Gopher as the central reference point, argued that Gopher and other large models of the era had been trained badly, not because of any flaw in their architectures but because the field had been misallocating compute.

Hoffmann et al. trained more than 400 transformer language models at sizes from 70 million to 16 billion parameters and on token counts from 5 billion to 500 billion. They used three independent methodologies to estimate, for any given compute budget, the model size and training token count that minimize loss. All three methods agreed: the right answer was to train smaller models on much more data than the field had been doing.^[4]

Applied to the compute budget that had produced Gopher (roughly 5.76 x 10^23 FLOPs), the new analysis predicted an optimal model of around 63 to 67 billion parameters trained on 1.4 to 1.5 trillion tokens. That is roughly four times smaller and four times more data than Gopher, which had used 280 billion parameters and 300 billion tokens. The implied tokens-to-parameters ratio was about 20, compared to roughly 1.07 for Gopher.^[4]

To verify the prediction, the team trained Chinchilla, a 70-billion-parameter model on 1.4 trillion tokens of MassiveText, using the same compute budget Gopher had consumed. Chinchilla outperformed Gopher across nearly every benchmark, often by significant margins. On MMLU, Chinchilla reached 67.5% average accuracy, 7.5 percentage points above Gopher. On reasoning, knowledge, and commonsense tasks, the smaller model was consistently better.^[4]

The implication was uncomfortable for the broader field. Gopher had not failed because the architecture was wrong or the data was bad; it had failed (in the sense of being beaten by a four-times-smaller model on the same compute) because the training recipe was off. The same was true, by extension, of GPT-3, Megatron-Turing NLG, and Jurassic-1, all of which had been trained with similarly skewed parameter-to-token ratios.^[4]

The principles derived from this work became known as the Chinchilla scaling laws, or compute optimal scaling, and they reshaped how almost everyone trained large language models afterward. The general guideline (model size and training tokens should scale roughly equally with compute, at a ratio of about 20 tokens per parameter) became the default for designs like Meta's LLaMA family and many subsequent open and closed models.^[4]

Was Gopher ever released?

Gopher itself had a short half-life as a frontier model. By mid-2022 it had been outperformed by Chinchilla (DeepMind's own follow-up), PaLM (Google Brain's 540B model), and other systems. The 280B weights were never released, the model was never offered as a public API, and direct access was limited to DeepMind researchers and selected collaborators. As a deployed system, Gopher's footprint outside DeepMind was effectively zero.

Its impact on the field came through the research output rather than the model itself. The Gopher paper provided the most detailed published analysis at that time of how transformer language models scale across more than three orders of magnitude.^[1] The MMLU results helped establish that benchmark as the standard knowledge-and-reasoning test for large language models. The accompanying taxonomy of ethical risks became a reference point for responsible-AI work, and the RETRO paper seeded a long line of retrieval-augmented language model research.^[2]^[3]

The most consequential legacy is that Gopher was the model Chinchilla measured itself against. Without the 280B Gopher run as a concrete reference point, the Chinchilla scaling laws would have been a much harder argument to make. The two models together produced the most influential single-year revision of training practice in the history of large language models.

The lineage continues through DeepMind's later work. After the 2022 reorganization that combined DeepMind and Google Brain into Google DeepMind, much of the team that had built Gopher and Chinchilla moved on to the Gemini project. Gemini's training recipes inherit the compute-optimal scaling philosophy that came directly out of comparing Chinchilla to Gopher. The path Gopher to Chinchilla to Gemini is one of the cleaner research-to-product lineages in modern AI.

Gopher itself, meanwhile, sits in a strange position in retrospective accounts. It was a serious effort that pushed the state of the art on dozens of benchmarks, was documented in unusual depth, and produced two important companion papers. It was also the canonical example, three months later, of how not to train a large language model. Both things are true.

References

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher." arXiv:2112.11446. https://arxiv.org/abs/2112.11446 ↩
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J.-B., Damoc, B., Clark, A., et al. (2021). "Improving language models by retrieving from trillions of tokens." arXiv:2112.04426. https://arxiv.org/abs/2112.04426 ↩
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al. (2021). "Ethical and social risks of harm from Language Models." arXiv:2112.04359. https://arxiv.org/abs/2112.04359 ↩
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556. https://arxiv.org/abs/2203.15556 ↩
DeepMind (2021). "Language modelling at scale: Gopher, ethical considerations, and retrieval." DeepMind Blog, December 8, 2021. https://deepmind.google/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval/ ↩
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. https://arxiv.org/abs/2001.08361 ↩
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems, 33 (NeurIPS 2020). arXiv:2005.14165.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." Proceedings of the International Conference on Learning Representations (ICLR 2021). arXiv:2009.03300.
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." Proceedings of ACL 2019. arXiv:1901.02860. ↩
Zhang, B., & Sennrich, R. (2019). "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems, 32 (NeurIPS 2019). arXiv:1910.07467. ↩
Epoch AI. "Gopher (280B)." Notable AI Models database. Accessed 2026.
InfoQ (2022). "Google Trains 280 Billion Parameter AI Language Model Gopher." January 2022.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Chinchilla Chinchilla scaling laws FineWeb OPT (Open Pre-trained Transformer)PaLM RefinedWeb Scaling WebGPT🤖

Infobox

Why did DeepMind build Gopher?

Model architecture

What sizes does the Gopher family include?

What is Gopher trained on?

How was Gopher trained?

How did Gopher perform on benchmarks?

Companion work

RETRO (retrieval-augmented Gopher)

Ethical and social risks of harm

How does Gopher differ from Chinchilla?

Was Gopher ever released?

See also

References

Improve this article

Related Articles

Gemini 2.5 Pro

Gemini 3

Gemma 2

Gemma 3

Gemini 2.5 Flash

SmolVLA

What links here

Related Articles

Gemini 2.5 Pro

Gemini 3

Gemma 2

Gemma 3

Gemini 2.5 Flash

SmolVLA

What links here