Atlas (language model)
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,723 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,723 words
Add missing citations, update stale details, or suggest a clearer explanation.
Atlas is a retrieval-augmented language model developed by researchers at Meta AI (the group then known as Facebook AI Research, or FAIR). It was introduced in the paper "Atlas: Few-shot Learning with Retrieval Augmented Language Models," first posted to arXiv in August 2022 and later published in the Journal of Machine Learning Research in 2023 [1][2]. The model was built to test a specific hypothesis: that the knowledge a language model needs for tasks like question answering and fact checking does not have to be memorized in the model's weights, and can instead be supplied at inference time by a retrieval component reading from an external corpus. Under this design a model with relatively few parameters can match or beat far larger models that store everything internally, and its knowledge can be edited by swapping the document index rather than retraining [1].
Atlas should not be confused with the humanoid robot of the same name from Boston Dynamics, nor with other unrelated systems called "Atlas." This article concerns only the language model.
Few-shot learning, in which a model adapts to a new task from a handful of examples, became a prominent capability of large language models such as GPT-3, Gopher, Chinchilla, and PaLM. For tasks where success depends on world knowledge, the prevailing approach was to scale the parameter count so the model could memorize more facts during pretraining [1]. Retrieval-augmented models, including REALM, RAG, and RETRO, had already shown that an external memory could improve knowledge-intensive tasks, but it was not clear whether such models could be strong few-shot learners. The authors of Atlas set out to close that gap, arguing that memorization can be decoupled from generalization: a neural retriever handles the knowledge, while a smaller language model handles reasoning and generation [1].
Atlas has two components that are pretrained and fine-tuned together [1].
The retriever is based on Contriever, a dense information-retrieval method built on continuous embeddings [1][3]. It uses a dual-encoder architecture with a transformer encoder (BERT base) that embeds the query and each document independently; average pooling over the final layer produces one vector per query or document, and relevance is the dot product between the two embeddings [1]. The base Contriever model is pretrained with the MoCo contrastive loss on unlabeled data only, which means both encoders can be trained without any human-annotated query-document pairs [1].
The reader, called the language model in the paper, is a sequence-to-sequence model based on the T5 architecture, specifically the T5 1.1 "lm-adapt" variants [1][4]. It uses the Fusion-in-Decoder scheme: each retrieved document is concatenated with the query and processed independently in the encoder, the encoder outputs for all documents are concatenated, and the decoder attends jointly over that combined sequence to generate the answer [1]. Processing documents separately in the encoder avoids the quadratic cost that a single long concatenated input would incur, so the reader scales to many retrieved passages.
The whole system follows a text-to-text framework: every task, whether question answering, fact checking, or entity linking, is cast as mapping a text query to a text output [1].
Atlas was released in four sizes, defined by the size of the T5-based reader. The retriever (Contriever, BERT base, about 110M parameters) is shared across all of them [1][4].
| Variant | Reader parameters | Retriever parameters |
|---|---|---|
| Atlas-base | 220M | ~110M |
| Atlas-large | 770M | ~110M |
| Atlas-XL | 3B | ~110M |
| Atlas-XXL | 11B | ~110M |
The headline results are reported for the 11B configuration, usually referred to as Atlas-11B [1][4].
A central finding of the paper is that jointly pretraining the retriever and reader, rather than fixing a pretrained retriever, is important for few-shot performance [1]. The authors evaluated several ways to use the language model's own signal to train the retriever, so that no document annotations are needed:
The four objectives gave broadly similar downstream results; the authors adopted Likelihood Distillation for the main models because it was more stable than EMDR2 or ADist and cheaper than LOOL [1]. For the self-supervised pretext task used during joint pretraining they compared prefix language modeling, masked language modeling, and a title-to-section generation task, and settled on masked language modeling [1].
The pretraining and fine-tuning corpus, which also serves as the retrieval index, combined a December 2021 Wikipedia dump (about 37 million passages, averaging 78 words each, with lists and infoboxes linearized) and the 2020-10 Common Crawl dump processed with the CCNet pipeline (about 350 million passages) [1]. During pretraining the passage being denoised is excluded from its own retrieval results so the model cannot trivially copy the answer. The final models were pretrained for 10,000 iterations with AdamW, retrieving 20 documents per step [1]. To keep the index from going stale as the retriever updates, the authors used strategies including periodic full index refresh, retrieve-then-rerank, and query-side fine-tuning (training only the query encoder so the document embeddings stay fixed) [1].
Atlas was evaluated in both few-shot settings (commonly 64 training examples) and full-dataset fine-tuning. Its central claim is parameter efficiency: with 11B parameters it reached 42.4% accuracy on Natural Questions using 64 examples (and 45.1% with a Wikipedia-only index), outperforming the 540B-parameter PaLM by roughly 3 points despite having about 50 times fewer parameters [1].
On the open-domain versions of Natural Questions and TriviaQA, evaluated by exact match, Atlas set state-of-the-art results in the 64-shot setting and on the full training set [1].
| Benchmark | Atlas-11B (64-shot) | Atlas-11B (full) | Prior comparison |
|---|---|---|---|
| Natural Questions | 42.4 | 60.4 | GPT-3 29.9, PaLM 39.6 (64-shot); R2-D2 55.9 (full) |
| TriviaQA (filtered) | 74.5 | 79.8 | Chinchilla 64.6 (64-shot) |
| TriviaQA (unfiltered) | 84.7 | 89.4 | PaLM 81.4 (64-shot) |
On the full Natural Questions training set, Atlas improved the previous best exact match from 55.9% to 60.4% [1].
On the three-class FEVER fact-checking task, Atlas reached 64.3% accuracy with 64 examples. In a 15-shot setting (5 examples per class) it scored 56.2%, beating Gopher by 5.1 points. With full fine-tuning it reached 78.0%, and 80.1% when given an index built from the FEVER Wikipedia corpus, a new state of the art [1].
On MMLU, a 57-domain multiple-choice benchmark, Atlas-11B with de-biased inference scored 47.9% in the 5-shot setting, ahead of GPT-3 (43.9%) while using about 15 times fewer parameters and roughly 10 times less pretraining compute [1]. With multitask 5-shot training it reached 56.6%, and with additional auxiliary training data it reached 66.0%, approaching the strongest reported systems [1].
| MMLU setting | Atlas-11B | GPT-3 (175B) | Chinchilla (70B) |
|---|---|---|---|
| Zero-shot | 47.1 | N/A | N/A |
| 5-shot | 47.9 | 43.9 | 67.5 |
| 5-shot multitask | 56.6 | N/A | N/A |
| Full / transfer | 66.0 | 53.9 | N/A |
On the KILT benchmark, which bundles 11 datasets across five knowledge-intensive task types, Atlas in the 64-shot setting was competitive with several fully fine-tuned leaderboard entries. After full fine-tuning it set new state-of-the-art results on five KILT datasets, including AIDA CoNLL-YAGO entity linking (90.6 accuracy), FEVER (93.5), Natural Questions (61.3 exact match), and HotpotQA (50.6 exact match) [1].
Because Atlas stores knowledge in an external index, that knowledge can be changed after training. The authors tested this with TempLAMA, a set of time-sensitive cloze questions whose answers change between 2017 and 2020 [1]. After fine-tuning on 2017 answers with a 2017 Wikipedia index, swapping in a 2020 index (without any retraining) raised 2020 accuracy to 53.1%, close to the model's 2017 performance, while a closed-book T5 had no comparable mechanism to update its facts [1].
The paper also examined interpretability and data leakage. Inspecting retrieved passages showed that for the MMLU multitask model about 85% of retrieved passages came from Common Crawl and roughly 15% from Wikipedia, and that accuracy rose with how often the correct answer appeared in retrieved text [1]. The authors estimated MMLU leakage in their Common Crawl corpus at about 2.8% of questions, and noted that filtering out potentially leaked passages reduced the MMLU score only slightly, from 56.4% to 55.8% [1]. They also showed that compressing the index with product quantization gave comparable accuracy while cutting memory roughly fivefold [1].
The paper lists ten authors: Gautier Izacard and Patrick Lewis (joint first authors), Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave [1][2]. The work was done at Meta AI; by the time of the JMLR version several authors had moved on, with Izacard listed at Inflection AI and Lewis at Cohere, and affiliations also spanning ENS, PSL University and Inria, and University College London [2]. The JMLR edition appeared in Volume 24 (2023), edited by Ivan Titov [2].
Code and pretrained Atlas checkpoints were released on GitHub under the facebookresearch organization, with the code under a CC-BY-NC license; the repository is no longer actively maintained [4]. Atlas is frequently cited as an example of retrieval-augmented generation applied to few-shot, knowledge-intensive tasks, and as evidence that retrieval can substitute for raw parameter scale on such tasks [1].