Galactica (language model)
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,776 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,776 words
Add missing citations, update stale details, or suggest a clearer explanation.
Galactica is a large language model for science, built by the Papers with Code team at Meta AI and released in November 2022. It was trained on a curated corpus of scientific text and data rather than the general web, with the stated goal of helping researchers store, combine, and reason about scientific knowledge. The accompanying paper, "Galactica: A Large Language Model for Science" by Ross Taylor and colleagues, was posted to arXiv on 16 November 2022 [1].
Galactica is best remembered for what happened to its public demonstration. Meta put an interactive web demo online on 15 November 2022, and took it down roughly three days later, on 17 November, after researchers showed that it readily produced authoritative-sounding but false, fabricated, or biased scientific text [2][3]. The episode became an early, widely cited example of the risks of hallucination in scientific language models, and it landed two weeks before OpenAI released ChatGPT (30 November 2022), which faced similar problems but was received very differently [4].
The paper frames Galactica as a response to information overload: the volume of scientific literature and data has grown faster than any individual's ability to read it, and search engines index documents without organizing the knowledge inside them. The authors argued that a language model trained on a high-quality scientific corpus could serve as a different kind of interface to that knowledge, one that can recall facts, work through reasoning, and connect related material across papers, code, and reference sources [1].
A central design choice distinguished Galactica from contemporaries such as OPT (Open Pre-trained Transformer), GPT-3, and BLOOM. Where those models trained on large, mostly uncurated text scraped from the web, Galactica trained on a comparatively small and deliberately curated corpus. The authors described this as a "normative" approach to dataset selection, closer in spirit to expert systems than to the prevailing scale-everything paradigm, and posed it as a research question: can a useful model be built from a corpus where you understand exactly what goes in [1]?
The project came out of Papers with Code, a Meta AI team that maintained a popular machine-learning resource of the same name. Ross Taylor was the research lead, and Robert Stojnic, a Papers with Code co-founder, was among the authors [1][5].
Galactica was trained on a corpus of 106 billion tokens drawn from open-access scientific sources. It included more than 48 million papers, textbooks and lecture notes, along with millions of compounds and proteins, scientific websites, encyclopedias, reference material, knowledge bases, and source code [1][6]. The corpus is small by the standards of general-purpose models of the era, which is the point: the authors traded raw scale for curation and quality.
The composition reported in the paper (Table 2) is as follows [1]:
| Data source | Documents | Tokens | Share |
|---|---|---|---|
| Papers | 48 million | 88 billion | 83.0% |
| Code | 2 million | 7 billion | 6.9% |
| Reference material | 8 million | 7 billion | 6.5% |
| Knowledge bases | 2 million | 2 billion | 2.0% |
| Filtered CommonCrawl | 0.9 million | 1 billion | 1.0% |
| Prompts | 1.3 million | 0.4 billion | 0.3% |
| Other | 0.02 million | 0.2 billion | 0.2% |
| Total | 106 billion |
Because the corpus was high quality and relatively small, the team trained on it for multiple passes. Galactica was trained for about 450 billion tokens, roughly 4.25 epochs over the corpus. The authors reported that validation loss kept falling through four epochs and that the largest model only began to overfit at the start of the fifth, which ran against the then-common assumption that repeating tokens harms performance [1].
A distinctive feature of Galactica is its use of task-specific and modality-specific tokens, so that scientific structure is represented directly in the token stream rather than left implicit in prose. The tokenization scheme included [1]:
[START_REF] and [END_REF] tokens. This lets the model represent the citation graph inside the text and predict a citation given surrounding context.<work> ... </work>. The authors note that a transformer has no explicit working memory, so the <work> block acts as an external scratchpad. Where a step requires a computation the network cannot do reliably in a single forward pass, the model can write a short Python program inside the block and offload the arithmetic to a classical computer.[START_SMILES] and [END_SMILES] (with [START_I_SMILES]/[END_I_SMILES] for isomeric SMILES) and tokenized character by character.[START_AMINO] and [END_AMINO], with each residue treated as a single token.[START_DNA] and [END_DNA], again with character-level tokenization.Mathematics and numbers also received special handling: digits were split into individual tokens (for example, 737612.62 becomes 7,3,7,6,1,2,.,6,2) and ASCII math operators were split into separate characters [1]. To train the <work> behaviour, the team included reasoning datasets in pre-training, among them the GSM8k training split and Khan Academy problems [1].
Galactica uses a transformer architecture in a decoder-only setup, with several modifications: GeLU activations, no biases in the dense layers or layer norms (following PaLM), learned positional embeddings, and a 2,048-token context window across all sizes. The vocabulary contained 50,000 tokens built with byte-pair encoding from a 2% sample of the training data [1].
Five model sizes were trained (paper Table 5) [1]:
| Model | Parameters | Layers | Hidden dim (d_model) | Heads |
|---|---|---|---|---|
| GAL 125M (mini) | 125M | 12 | 768 | 12 |
| GAL 1.3B (base) | 1.3B | 24 | 2,048 | 32 |
| GAL 6.7B (standard) | 6.7B | 32 | 4,096 | 32 |
| GAL 30B (large) | 30B | 48 | 7,168 | 56 |
| GAL 120B (huge) | 120B | 96 | 10,240 | 80 |
The largest model was deliberately capped so that it would run inference on a single NVIDIA A100 node, for accessibility. Training the 120B model used 128 NVIDIA A100 80GB nodes and Meta's metaseq library. Galactica is a stand-alone base model and was not instruction-tuned, so getting good results required carefully formatted prompts [1][6].
On the benchmarks reported in the paper, Galactica performed well relative to larger general-purpose models. On technical knowledge probes such as recalling LaTeX equations it scored 68.2% against 49.0% for the latest GPT-3. On mathematical MMLU it averaged 41.3% versus 35.7% for Chinchilla, and on the MATH benchmark it reached 20.4% against 8.8% for PaLM 540B, a model roughly 4.5 times larger. It also set state-of-the-art results on the downstream medical question-answering tasks PubMedQA (77.6%) and MedMCQA dev (52.9%), and, despite not training on a general corpus, it outperformed BLOOM and OPT-175B on a 57-task subset of BIG-bench used as an out-of-domain probe [1].
The paper also reported that Galactica was less toxic and less biased than comparison models on benchmarks including RealToxicityPrompts, CrowS-Pairs, and StereoSet, which the authors attributed in part to the scientific corpus having a lower incidence of stereotyped or hateful content. It scored higher than other models on TruthfulQA as well [1]. These results coexisted, in the same document, with explicit warnings about hallucination, discussed below.
The interactive demo at galactica.org let anyone type a prompt and have the model generate scientific text, complete with formulas and references. Within hours of the 15 November launch, researchers were posting examples of confident, well-formatted output that was simply wrong [2][3].
Reported failures included fabricated papers and citations attributed to real scientists, plausible-looking articles on subjects that do not exist, and "wiki" entries for made-up concepts. Michael Black, director of the Max Planck Institute for Intelligent Systems, wrote that the model produced text that was "wrong or biased but sounded right and authoritative," and called it dangerous because such output could pollute the scientific record [2]. The model also returned confident answers to pseudo-scientific or harmful prompts in some cases, while declining others. Cognitive scientist Gary Marcus and several other researchers amplified the criticism [2][3].
The core objection was not that Galactica made mistakes (all language models do) but the framing. Promotional language suggested the model could generate literature reviews, wiki articles, and papers, which invited users to treat a research demo as a finished product and to trust output that looked like vetted science but was not. Meta's chief AI scientist Yann LeCun, who had promoted the demo, defended the work, but on 17 November the team took the demo offline. The Papers with Code account said it had paused the demo and appreciated the community's feedback [2][3][7]. The paper, the model weights, and the code stayed available [6].
The model weights were released on the Hugging Face Hub in all five sizes under a non-commercial license (CC BY-NC 4.0), and the inference code (the galai library) was open-sourced under Apache 2.0 [6][5]. Third parties later produced fine-tuned variants, such as instruction-tuned versions of the 30B model.
Galactica's reception influenced how Meta released later models. Members of the team, including Ross Taylor, have said the experience shaped the more cautious, research-framed launch of LLaMA in early 2023, with clearer documentation of limitations and a gated release to researchers rather than an open public demo [4][5]. The episode is now commonly cited in discussions of scientific LLMs as a case study in two things at once: the technical limits of factual reliability in generative models, and the gap between how a model is presented and how the public will use it.