Atlas (language model)

Information Retrieval Large Language Models Meta AI

9 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v2 · 1,719 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Atlas is a retrieval-augmented language model developed by researchers at Meta AI (the group then known as Facebook AI Research, or FAIR). It was introduced in the paper "Atlas: Few-shot Learning with Retrieval Augmented Language Models," first posted to arXiv in August 2022 and later published in the Journal of Machine Learning Research in 2023 ^[1]^[2]. The model was built to test a specific hypothesis: that the knowledge a language model needs for tasks like question answering and fact checking does not have to be memorized in the model's weights, and can instead be supplied at inference time by a retrieval component reading from an external corpus. Under this design a model with relatively few parameters can match or beat far larger models that store everything internally, and its knowledge can be edited by swapping the document index rather than retraining ^[1].

Atlas should not be confused with the humanoid robot of the same name from Boston Dynamics, nor with other unrelated systems called "Atlas." This article concerns only the language model.

Background

Few-shot learning, in which a model adapts to a new task from a handful of examples, became a prominent capability of large language models such as GPT-3, Gopher, Chinchilla, and PaLM. For tasks where success depends on world knowledge, the prevailing approach was to scale the parameter count so the model could memorize more facts during pretraining ^[1]. Retrieval-augmented models, including REALM, RAG, and RETRO, had already shown that an external memory could improve knowledge-intensive tasks, but it was not clear whether such models could be strong few-shot learners. The authors of Atlas set out to close that gap, arguing that memorization can be decoupled from generalization: a neural retriever handles the knowledge, while a smaller language model handles reasoning and generation ^[1].

Architecture

Atlas has two components that are pretrained and fine-tuned together ^[1].

The retriever is based on Contriever, a dense information-retrieval method built on continuous embeddings ^[1]^[3]. It uses a dual-encoder architecture with a transformer encoder (BERT base) that embeds the query and each document independently; average pooling over the final layer produces one vector per query or document, and relevance is the dot product between the two embeddings ^[1]. The base Contriever model is pretrained with the MoCo contrastive loss on unlabeled data only, which means both encoders can be trained without any human-annotated query-document pairs ^[1].

The reader, called the language model in the paper, is a sequence-to-sequence model based on the T5 architecture, specifically the T5 1.1 "lm-adapt" variants ^[1]^[4]. It uses the Fusion-in-Decoder scheme: each retrieved document is concatenated with the query and processed independently in the encoder, the encoder outputs for all documents are concatenated, and the decoder attends jointly over that combined sequence to generate the answer ^[1]. Processing documents separately in the encoder avoids the quadratic cost that a single long concatenated input would incur, so the reader scales to many retrieved passages.

The whole system follows a text-to-text framework: every task, whether question answering, fact checking, or entity linking, is cast as mapping a text query to a text output ^[1].

Model sizes

Atlas was released in four sizes, defined by the size of the T5-based reader. The retriever (Contriever, BERT base, about 110M parameters) is shared across all of them ^[1]^[4].

Variant	Reader parameters	Retriever parameters
Atlas-base	220M	~110M
Atlas-large	770M	~110M
Atlas-XL	3B	~110M
Atlas-XXL	11B	~110M

The headline results are reported for the 11B configuration, usually referred to as Atlas-11B ^[1]^[4].

Training

A central finding of the paper is that jointly pretraining the retriever and reader, rather than fixing a pretrained retriever, is important for few-shot performance ^[1]. The authors evaluated several ways to use the language model's own signal to train the retriever, so that no document annotations are needed:

Attention Distillation (ADist), which distills the reader's cross-attention scores over documents into the retriever by minimizing a KL divergence.
End-to-end training (EMDR2), an expectation-maximization style objective that treats retrieved documents as latent variables.
Likelihood Distillation (LDist), which trains the retriever to predict how much each document improves the reader's likelihood of producing the correct output.
Leave-one-out Likelihood Distillation (LOOL), which scores a document by how much the reader's prediction degrades when that document is removed ^[1].

The four objectives gave broadly similar downstream results; the authors adopted Likelihood Distillation for the main models because it was more stable than EMDR2 or ADist and cheaper than LOOL ^[1]. For the self-supervised pretext task used during joint pretraining they compared prefix language modeling, masked language modeling, and a title-to-section generation task, and settled on masked language modeling ^[1].

The pretraining and fine-tuning corpus, which also serves as the retrieval index, combined a December 2021 Wikipedia dump (about 37 million passages, averaging 78 words each, with lists and infoboxes linearized) and the 2020-10 Common Crawl dump processed with the CCNet pipeline (about 350 million passages) ^[1]. During pretraining the passage being denoised is excluded from its own retrieval results so the model cannot trivially copy the answer. The final models were pretrained for 10,000 iterations with AdamW, retrieving 20 documents per step ^[1]. To keep the index from going stale as the retriever updates, the authors used strategies including periodic full index refresh, retrieve-then-rerank, and query-side fine-tuning (training only the query encoder so the document embeddings stay fixed) ^[1].

Results

Atlas was evaluated in both few-shot settings (commonly 64 training examples) and full-dataset fine-tuning. Its central claim is parameter efficiency: with 11B parameters it reached 42.4% accuracy on Natural Questions using 64 examples (and 45.1% with a Wikipedia-only index), outperforming the 540B-parameter PaLM by roughly 3 points despite having about 50 times fewer parameters ^[1].

Open-domain question answering

On the open-domain versions of Natural Questions and TriviaQA, evaluated by exact match, Atlas set state-of-the-art results in the 64-shot setting and on the full training set ^[1].

Benchmark	Atlas-11B (64-shot)	Atlas-11B (full)	Prior comparison
Natural Questions	42.4	60.4	GPT-3 29.9, PaLM 39.6 (64-shot); R2-D2 55.9 (full)
TriviaQA (filtered)	74.5	79.8	Chinchilla 64.6 (64-shot)
TriviaQA (unfiltered)	84.7	89.4	PaLM 81.4 (64-shot)

On the full Natural Questions training set, Atlas improved the previous best exact match from 55.9% to 60.4% ^[1].

Fact checking (FEVER)

On the three-class FEVER fact-checking task, Atlas reached 64.3% accuracy with 64 examples. In a 15-shot setting (5 examples per class) it scored 56.2%, beating Gopher by 5.1 points. With full fine-tuning it reached 78.0%, and 80.1% when given an index built from the FEVER Wikipedia corpus, a new state of the art ^[1].

MMLU

On MMLU, a 57-domain multiple-choice benchmark, Atlas-11B with de-biased inference scored 47.9% in the 5-shot setting, ahead of GPT-3 (43.9%) while using about 15 times fewer parameters and roughly 10 times less pretraining compute ^[1]. With multitask 5-shot training it reached 56.6%, and with additional auxiliary training data it reached 66.0%, approaching the strongest reported systems ^[1].

MMLU setting	Atlas-11B	GPT-3 (175B)	Chinchilla (70B)
Zero-shot	47.1	N/A	N/A
5-shot	47.9	43.9	67.5
5-shot multitask	56.6	N/A	N/A
Full / transfer	66.0	53.9	N/A

KILT

On the KILT benchmark, which bundles 11 datasets across five knowledge-intensive task types, Atlas in the 64-shot setting was competitive with several fully fine-tuned leaderboard entries. After full fine-tuning it set new state-of-the-art results on five KILT datasets, including AIDA CoNLL-YAGO entity linking (90.6 accuracy), FEVER (93.5), Natural Questions (61.3 exact match), and HotpotQA (50.6 exact match) ^[1].

Updateability and analysis

Because Atlas stores knowledge in an external index, that knowledge can be changed after training. The authors tested this with TempLAMA, a set of time-sensitive cloze questions whose answers change between 2017 and 2020 ^[1]. After fine-tuning on 2017 answers with a 2017 Wikipedia index, swapping in a 2020 index (without any retraining) raised 2020 accuracy to 53.1%, close to the model's 2017 performance, while a closed-book T5 had no comparable mechanism to update its facts ^[1].

The paper also examined interpretability and data leakage. Inspecting retrieved passages showed that for the MMLU multitask model about 85% of retrieved passages came from Common Crawl and roughly 15% from Wikipedia, and that accuracy rose with how often the correct answer appeared in retrieved text ^[1]. The authors estimated MMLU leakage in their Common Crawl corpus at about 2.8% of questions, and noted that filtering out potentially leaked passages reduced the MMLU score only slightly, from 56.4% to 55.8% ^[1]. They also showed that compressing the index with product quantization gave comparable accuracy while cutting memory roughly fivefold ^[1].

Authorship and release

The paper lists ten authors: Gautier Izacard and Patrick Lewis (joint first authors), Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave ^[1]^[2]. The work was done at Meta AI; by the time of the JMLR version several authors had moved on, with Izacard listed at Inflection AI and Lewis at Cohere, and affiliations also spanning ENS, PSL University and Inria, and University College London ^[2]. The JMLR edition appeared in Volume 24 (2023), edited by Ivan Titov ^[2].

Code and pretrained Atlas checkpoints were released on GitHub under the facebookresearch organization, with the code under a CC-BY-NC license; the repository is no longer actively maintained ^[4]. Atlas is frequently cited as an example of retrieval-augmented generation applied to few-shot, knowledge-intensive tasks, and as evidence that retrieval can substitute for raw parameter scale on such tasks ^[1].

References

Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., and Grave, E. "Atlas: Few-shot Learning with Retrieval Augmented Language Models." arXiv:2208.03299, 2022. https://arxiv.org/abs/2208.03299 ↩
"Atlas: Few-shot Learning with Retrieval Augmented Language Models." Journal of Machine Learning Research, Volume 24 (2023). https://www.jmlr.org/papers/v24/23-0037.html ↩
Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., and Grave, E. "Unsupervised Dense Information Retrieval with Contrastive Learning" (Contriever). arXiv:2112.09118, 2021. https://arxiv.org/abs/2112.09118 ↩
facebookresearch/atlas. GitHub repository. https://github.com/facebookresearch/atlas ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

LLaMA

Background

Architecture

Model sizes

Training

Results

Open-domain question answering

Fact checking (FEVER)

MMLU

KILT

Updateability and analysis

Authorship and release

References

Improve this article

Related Articles

Multi-hop RAG

Qwen3 Embedding

EmbeddingGemma

LLaMA

LLaMA/Model Card

Llama 3

What links here

Related Articles

Multi-hop RAG

Qwen3 Embedding

EmbeddingGemma

LLaMA

LLaMA/Model Card

Llama 3