Galactica (language model)

AI for Science Large Language Models Meta AI

9 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,859 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Galactica is a large language model for science, built by the Papers with Code team at Meta AI and released on 15 November 2022. The model was trained on a curated corpus of roughly 106 billion tokens of scientific text and data, including more than 48 million papers, and was offered in five sizes up to 120 billion parameters ^[1]^[6]. Galactica is best known for its public demo, which Meta took offline after only about three days following criticism that it confidently generated authoritative-sounding but false scientific text and fabricated citations ^[2]^[3].

The accompanying paper, "Galactica: A Large Language Model for Science" by Ross Taylor and colleagues, was posted to arXiv on 16 November 2022 and describes the model in its first line as "a large language model that can store, combine and reason about scientific knowledge" ^[1]. The episode became an early, widely cited example of the risks of hallucination in scientific language models, and it landed two weeks before OpenAI released ChatGPT on 30 November 2022, which faced similar problems but was received very differently ^[4].

What was Galactica designed to do?

The paper frames Galactica as a response to information overload: the volume of scientific literature and data has grown faster than any individual's ability to read it, and search engines index documents without organizing the knowledge inside them. The authors argued that a language model trained on a high-quality scientific corpus could serve as a different kind of interface to that knowledge, one that can recall facts, work through reasoning, and connect related material across papers, code, and reference sources ^[1].

A central design choice distinguished Galactica from contemporaries such as OPT (Open Pre-trained Transformer), GPT-3, and BLOOM. Where those models trained on large, mostly uncurated text scraped from the web, Galactica trained on a comparatively small and deliberately curated corpus. The authors described this as a "normative" approach to dataset selection, closer in spirit to expert systems than to the prevailing scale-everything paradigm, and posed it as a research question: can a useful model be built from a corpus where you understand exactly what goes in ^[1]?

The project came out of Papers with Code, a Meta AI team that maintained a popular machine-learning resource of the same name. Ross Taylor was the research lead, and Robert Stojnic, a Papers with Code co-founder, was among the authors ^[1]^[5].

What was Galactica trained on?

Galactica was trained on a corpus of 106 billion tokens drawn from open-access scientific sources. It included more than 48 million papers, textbooks and lecture notes, along with millions of compounds and proteins, scientific websites, encyclopedias, reference material, knowledge bases, and source code ^[1]^[6]. The corpus is small by the standards of general-purpose models of the era, which is the point: the authors traded raw scale for curation and quality.

The composition reported in the paper (Table 2) is as follows ^[1]:

Data source	Documents	Tokens	Share
Papers	48 million	88 billion	83.0%
Code	2 million	7 billion	6.9%
Reference material	8 million	7 billion	6.5%
Knowledge bases	2 million	2 billion	2.0%
Filtered CommonCrawl	0.9 million	1 billion	1.0%
Prompts	1.3 million	0.4 billion	0.3%
Other	0.02 million	0.2 billion	0.2%
Total		106 billion

Because the corpus was high quality and relatively small, the team trained on it for multiple passes. Galactica was trained for about 450 billion tokens, roughly 4.25 epochs over the corpus. The authors reported that validation loss kept falling through four epochs and that the largest model only began to overfit at the start of the fifth, which ran against the then-common assumption that repeating tokens harms performance ^[1].

How did Galactica represent scientific structure?

A distinctive feature of Galactica is its use of task-specific and modality-specific tokens, so that scientific structure is represented directly in the token stream rather than left implicit in prose. The tokenization scheme included ^[1]:

Citations: references are wrapped in [START_REF] and [END_REF] tokens. This lets the model represent the citation graph inside the text and predict a citation given surrounding context.
Step-by-step reasoning: chains of reasoning are wrapped in a working-memory token, <work> ... </work>. The authors note that a transformer has no explicit working memory, so the <work> block acts as an external scratchpad. Where a step requires a computation the network cannot do reliably in a single forward pass, the model can write a short Python program inside the block and offload the arithmetic to a classical computer.
Molecular structure: SMILES chemical formulas are wrapped in [START_SMILES] and [END_SMILES] (with [START_I_SMILES]/[END_I_SMILES] for isomeric SMILES) and tokenized character by character.
Protein sequences: amino-acid sequences are wrapped in [START_AMINO] and [END_AMINO], with each residue treated as a single token.
DNA sequences: nucleotide sequences are wrapped in [START_DNA] and [END_DNA], again with character-level tokenization.

Mathematics and numbers also received special handling: digits were split into individual tokens (for example, 737612.62 becomes 7,3,7,6,1,2,.,6,2) and ASCII math operators were split into separate characters ^[1]. To train the <work> behaviour, the team included reasoning datasets in pre-training, among them the GSM8k training split and Khan Academy problems ^[1].

What architecture and model sizes did Galactica use?

Galactica uses a transformer architecture in a decoder-only setup, with several modifications: GeLU activations, no biases in the dense layers or layer norms (following PaLM), learned positional embeddings, and a 2,048-token context window across all sizes. The vocabulary contained 50,000 tokens built with byte-pair encoding from a 2% sample of the training data ^[1].

Five model sizes were trained (paper Table 5) ^[1]:

Model	Parameters	Layers	Hidden dim (d_model)	Heads
GAL 125M (mini)	125M	12	768	12
GAL 1.3B (base)	1.3B	24	2,048	32
GAL 6.7B (standard)	6.7B	32	4,096	32
GAL 30B (large)	30B	48	7,168	56
GAL 120B (huge)	120B	96	10,240	80

The largest model was deliberately capped at 120B parameters so that it would run inference on a single NVIDIA A100 node, for accessibility. Training the 120B model used 128 NVIDIA A100 80GB nodes and Meta's metaseq library. Galactica is a stand-alone base model and was not instruction-tuned, so getting good results required carefully formatted prompts ^[1]^[6].

How well did Galactica perform on benchmarks?

On the benchmarks reported in the paper, Galactica performed well relative to larger general-purpose models. On technical knowledge probes such as recalling LaTeX equations it scored 68.2% against 49.0% for the latest GPT-3. On mathematical MMLU it averaged 41.3% versus 35.7% for Chinchilla, and on the MATH benchmark it reached 20.4% against 8.8% for PaLM 540B, a model roughly 4.5 times larger. It also set state-of-the-art results on the downstream medical question-answering tasks PubMedQA (77.6%) and MedMCQA dev (52.9%), and, despite not training on a general corpus, it outperformed BLOOM and OPT-175B on a 57-task subset of BIG-bench used as an out-of-domain probe ^[1].

The paper also reported that Galactica was less toxic and less biased than comparison models on benchmarks including RealToxicityPrompts, CrowS-Pairs, and StereoSet, which the authors attributed in part to the scientific corpus having a lower incidence of stereotyped or hateful content. It scored higher than other models on TruthfulQA as well ^[1]. These results coexisted, in the same document, with explicit warnings about hallucination, discussed below.

Why was the Galactica demo taken down?

The interactive demo at galactica.org let anyone type a prompt and have the model generate scientific text, complete with formulas and references. Within hours of the 15 November launch, researchers were posting examples of confident, well-formatted output that was simply wrong ^[2]^[3].

Reported failures included fabricated papers and citations attributed to real scientists, plausible-looking articles on subjects that do not exist (one widely shared example was a wiki-style article on the history of bears in space), and entries for made-up concepts. Michael Black, director of the Max Planck Institute for Intelligent Systems, summarized his tests by writing, "In all cases, it was wrong or biased but sounded right and authoritative. I think it's dangerous," and warned that such output could pollute the scientific record ^[2]. The model also returned confident answers to pseudo-scientific or harmful prompts in some cases, while declining others. Cognitive scientist Gary Marcus and other researchers, including Princeton astrophysicist Miles Cranmer, amplified the criticism ^[2]^[3].

The core objection was not that Galactica made mistakes (all language models do) but the framing. Meta's chief AI scientist Yann LeCun had promoted the demo with the description, "Type a text and Galactica will generate a paper with relevant references, formulas, and everything" ^[2]. That kind of promotional language suggested the model could generate literature reviews, wiki articles, and papers, which invited users to treat a research demo as a finished product and to trust output that looked like vetted science but was not. LeCun defended the work as the backlash grew, but on 17 November the team took the demo offline. The Papers with Code account said it had paused the demo and appreciated the community's feedback ^[2]^[3]^[7]. The paper, the model weights, and the code stayed available ^[6].

What happened to Galactica afterward?

The model weights were released on the Hugging Face Hub in all five sizes under a non-commercial license (CC BY-NC 4.0), and the inference code (the galai library) was open-sourced under Apache 2.0 ^[6]^[5]. Third parties later produced fine-tuned variants, such as instruction-tuned versions of the 30B model.

Galactica's reception influenced how Meta released later models. Members of the team, including Ross Taylor, have said the experience shaped the more cautious, research-framed launch of LLaMA in early 2023, with clearer documentation of limitations and a gated release to researchers rather than an open public demo ^[4]^[5]. The episode is now commonly cited in discussions of scientific LLMs as a case study in two things at once: the technical limits of factual reliability in generative models, and the gap between how a model is presented and how the public will use it.

References

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Andrew Poulton, Viktor Kerkez, Robert Stojnic. "Galactica: A Large Language Model for Science." arXiv:2211.09085, 16 November 2022. https://arxiv.org/abs/2211.09085 ↩
Will Douglas Heaven. "Why Meta's latest large language model survived only three days online." MIT Technology Review, 18 November 2022. https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/ ↩
"Meta's 'biased' science-writing AI demo gets pulled after three days." Silicon Republic, 21 November 2022. https://www.siliconrepublic.com/machines/galactica-meta-ai-large-language-model ↩
Sharon Goldman. "What Meta learned from Galactica, the doomed model launched two weeks before ChatGPT." VentureBeat, 30 November 2022. https://venturebeat.com/ai/what-meta-learned-from-galactica-the-doomed-model-launched-two-weeks-before-chatgpt ↩
Papers with Code. "galai: Model API for GALACTICA." GitHub. https://github.com/paperswithcode/galai ↩
Meta AI / Papers with Code. "GALACTICA 120B model card." Hugging Face. https://huggingface.co/facebook/galactica-120b ↩
"Meta's Galactica AI Pulled After Spreading Misinformation and Harmful Content." OECD.AI Incidents, 22 November 2022. https://oecd.ai/en/incidents/2022-11-22-6a50 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Academic Research Nougat (model)Question Answering Models

What was Galactica designed to do?

What was Galactica trained on?

How did Galactica represent scientific structure?

What architecture and model sizes did Galactica use?

How well did Galactica perform on benchmarks?

Why was the Galactica demo taken down?

What happened to Galactica afterward?

References

Improve this article

Related Articles

ESMFold

Open Catalyst Project

LLaMA

LLaMA/Model Card

Llama 3

Llama 2

What links here

Related Articles

ESMFold

Open Catalyst Project

LLaMA

LLaMA/Model Card

Llama 3

Llama 2

What links here