OPT (Open Pre-trained Transformer)

Large Language Models Meta AI Open Source AI

10 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 2,056 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OPT (Open Pre-trained Transformer) is a suite of decoder-only large language models released by Meta AI in May 2022, ranging from 125 million to 175 billion parameters and built to reproduce the scale and capabilities of OpenAI's GPT-3 while opening the weights, training code, and a detailed training logbook to the research community. The largest model, OPT-175B, was the first dense 175-billion-parameter language model whose weights were made broadly available to outside researchers, and the paper reports it is "comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop." Meta released it on 3 May 2022 under a restrictive non-commercial license rather than a fully open one. ^[1]^[2]

The project was described in the technical report "OPT: Open Pre-trained Transformer Language Models," led by Susan Zhang with co-authors including Stephen Roller, Naman Goyal, Mikel Artetxe, Mona Diab, Xian Li, Xi Victoria Lin, Myle Ott, Luke Zettlemoyer, and others at Meta AI. The paper was posted to arXiv (2205.01068) on 2 May 2022, with a final revision on 21 June 2022. ^[1]

Quick facts	Detail
Developer	Meta AI
Released	3 May 2022
Paper	arXiv:2205.01068 (Zhang et al.)
Model sizes	125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, 175B
Architecture	Decoder-only transformer, 2048-token context
Training data	~180B tokens (~800 GB English text)
OPT-175B hardware	992 NVIDIA A100 80GB GPUs, ~2 months
Carbon footprint (final run)	~75 t CO2e, vs ~500 t estimated for GPT-3
License	Non-commercial (research)

What is OPT?

OPT (Open Pre-trained Transformer) is Meta AI's open suite of decoder-only large language models, spanning 125M to 175B parameters, that aimed to match GPT-3-class capability while making the full models, training code, and an unusually candid training logbook available to researchers. ^[1] The largest model, OPT-175B, was the first dense language model of that size whose weights were made broadly available to outside researchers, though under a restrictive non-commercial license rather than a fully open one. ^[1]^[2]

Why was OPT built?

By 2022, the most capable language models, including GPT-3, Gopher, and PaLM, were trained at enormous cost and were accessible only internally or through paid APIs. The OPT authors argued that this restricted access limited the ability of the wider research community to study how and why such models work, particularly questions of robustness, bias, and toxicity. The abstract frames the problem directly: "Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study." ^[1] A small number of open efforts existed at the time, including EleutherAI's GPT-NeoX-20B and the BigScience workshop that would later produce BLOOM, but no openly available dense model approached the 175 billion parameter scale of GPT-3. OPT was Meta's attempt to close that gap and, in the authors' words, to be "fully and responsibly" accountable for a model of this size. ^[1]

What model sizes does OPT come in?

OPT is a set of transformer decoder models whose architecture and hyperparameters largely follow GPT-3, with most variation in batch size to improve compute efficiency. The paper's prose describes "eight" models, while the architecture table actually lists nine sizes from 125M to 175B; the discrepancy reflects how the authors grouped the smaller baselines. The 350M model is an outlier in that its width does not scale smoothly with the rest of the family. All models use a sequence length of 2048, ReLU activations, and the GPT-2 byte-level BPE tokenizer. ^[1]

Model	Layers	Attention heads	Hidden dim (d_model)	Peak learning rate	Batch size (tokens)
OPT-125M	12	12	768	6.0e-4	0.5M
OPT-350M	24	16	1024	3.0e-4	0.5M
OPT-1.3B	24	32	2048	2.0e-4	1M
OPT-2.7B	32	32	2560	1.6e-4	1M
OPT-6.7B	32	32	4096	1.2e-4	2M
OPT-13B	40	40	5120	1.0e-4	4M
OPT-30B	48	56	7168	1.0e-4	4M
OPT-66B	64	72	9216	0.8e-4	2M
OPT-175B	96	96	12288	1.2e-4	2M

For weight initialization the team followed the settings in the Megatron-LM codebase, using a normal distribution with standard deviation 0.006 and scaling output-layer weights by a factor based on the number of layers. They used the AdamW optimizer with betas of (0.9, 0.95) and weight decay of 0.1, a linear learning-rate warmup, and a decay to 10 percent of the peak rate over 300 billion tokens. ^[1]

What data was OPT trained on?

OPT was trained on a corpus of roughly 180 billion tokens (about 800 GB of text) assembled mostly from previously used English-language datasets. The mixture concatenated data from three sources: the corpus used to train RoBERTa, a subset of the Pile, and a portion of the PushShift.io Reddit dataset. ^[1]^[3]

The RoBERTa component included BookCorpus and the CC-Stories subset of CommonCrawl, plus an updated CCNewsV2 collection of English news crawled through 28 September 2021. From the Pile, the team kept subsets such as Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics, and HackerNews, while dropping other Pile subsets that increased the risk of training instabilities. The Reddit data was converted to a document format by extracting the longest comment chain in each thread, which discarded roughly two thirds of the corpus. The authors removed near-duplicate documents across datasets using MinHashLSH and noted that the Pile in particular was full of duplicate content. The corpus is predominantly English, with a small amount of non-English text present via CommonCrawl. ^[1]^[3]

How was OPT-175B trained?

OPT-175B was trained on 992 NVIDIA A100 80GB GPUs, using Fully Sharded Data Parallel combined with Megatron-LM tensor parallelism. The team reported reaching up to 147 TFLOP/s per GPU, which Meta described as roughly 17 percent higher utilization than figures NVIDIA had published for similar hardware. Adam optimizer state was kept in FP32 and sharded across hosts while the model weights stayed in FP16, with dynamic loss scaling used to avoid numerical underflows. The run took on the order of two months. ^[1]^[2]

A notable part of the release was Meta's candor about how difficult the training was. The team published a logbook documenting day-to-day problems, and the paper describes frequent hardware failures that contributed to at least 35 manual restarts and the cycling of more than 100 hosts over the two-month run, plus an estimated 70 or more automatic restarts. The team also fought loss divergences, which they addressed by lowering the learning rate and restarting from earlier checkpoints, reducing gradient clipping from 1.0 to 0.3, briefly switching the optimizer, and moving to a newer version of Megatron. The metaseq codebase used for the run was, per the authors, the only known open-source implementation at the time capable of training a decoder-only transformer of at least 175B parameters without pipeline parallelism on NVIDIA GPUs. ^[1]^[4]

How much carbon did OPT use compared to GPT-3?

A central claim of the project was efficiency. The paper estimates the carbon emissions footprint of developing OPT-175B at about 75 tons of CO2 equivalent, compared with an estimated 500 tons for GPT-3 (citing Patterson et al.) and 380 tons for Gopher. That comparison is the basis for the abstract's statement that OPT-175B is "comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop." ^[1] The authors were careful to note that these accounting methods are not standardized and not always reported, and that the 75-ton figure covers only the final training run. In a footnote they add that, once ablations, smaller baselines, and downtime are included, their own estimate of total cost is roughly twice as high, around 150 tons. ^[1]

How does OPT perform versus GPT-3?

The team evaluated OPT on the same prompts and experimental setup that GPT-3 used, across 16 standard NLP tasks including HellaSwag, PIQA, ARC, WinoGrande, and SuperGLUE. In zero-shot settings, OPT's average performance tracked the trend of GPT-3 reasonably well, matching GPT-3 on about 10 of 14 tasks while underperforming on a few such as the ARC Challenge and MultiRC. In one-shot and few-shot settings, OPT lagged GPT-3 more noticeably, and per-task results varied widely. The authors could not reproduce GPT-3's published numbers using the OpenAI API within their own evaluation harness, which they took as evidence of differences in evaluation methodology. ^[1]

The paper is unusually frank about the model's weaknesses. In its limitations section, the authors note that OPT-175B does not handle declarative instructions or point-blank interrogatives well, tends to be repetitive and can get stuck in loops, and, like other models of its era, has a high propensity to generate toxic language and reinforce stereotypes even from innocuous prompts. The team chose not to apply mitigation techniques in this first release, since the goal was to replicate GPT-3, and concluded that the technology was "still premature for commercial deployment." ^[1]

Is OPT open source?

Meta staged the release. The smaller models, from 125M up to 66B parameters, were made available for download alongside the metaseq training and inference code and the full logbook. Access to the full OPT-175B weights was granted on request to academic researchers, people affiliated with government, civil society, and academia, and industry research laboratories, under a non-commercial use license. This was a more open posture than GPT-3's API-only access, but it stopped short of a permissive open-source license, and the same OPT-175B license terms were later applied to derivative models. ^[1]^[2]

The release drew a mixed response. The transparency of the code and logbook was widely praised as a step toward reproducible research at scale. At the same time, commentators pointed out that "transparency and openness is not the equivalent of democratizing large language models": the hardware needed to run OPT-175B (on the order of hundreds of thousands of dollars in GPUs) and the technical staff required to operate it kept the model out of reach for most labs, and the non-commercial license barred deployment in products. ^[5]

What is OPT-IML?

In December 2022, Meta released OPT-IML (OPT Instruction Meta-Learning), instruction-tuned versions of OPT-30B and OPT-175B. The accompanying paper, "OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization" (Iyer et al.), introduced OPT-IML Bench, a benchmark of roughly 2000 NLP tasks drawn from eight existing collections. The authors used this benchmark to study instruction-tuning decisions such as task scaling and held-out task selection, then trained the OPT-IML models on the results. OPT-IML outperformed the base OPT models across evaluation suites including PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG, reflecting the broader 2022 shift toward instruction tuning seen in models like InstructGPT and FLAN. The OPT-IML models were released under the same OPT-175B license. ^[6]^[7]

Why was OPT significant?

OPT was an important early entry in the wave of openly available large language models in 2022, alongside GPT-NeoX-20B and BLOOM, and it remains widely used as a research baseline and in studies of model behavior, quantization, and efficient inference because its full weights and training details are available. Within Meta, the engineering work on metaseq and the lessons recorded in the OPT logbook fed directly into the company's later and more influential open-weight effort, the LLaMA family, released in early 2023. The models are distributed through the Hugging Face Hub and are supported in the Transformers library and in PyTorch. ^[1]^[2]^[3]

References

Susan Zhang et al., "OPT: Open Pre-trained Transformer Language Models," arXiv:2205.01068, 2022. https://arxiv.org/abs/2205.01068 ↩
Meta AI, "Democratizing access to large-scale language models with OPT-175B," 3 May 2022. https://ai.meta.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ ↩
Hugging Face, "facebook/opt-350m" model card. https://huggingface.co/facebook/opt-350m ↩
facebookresearch/metaseq, GitHub repository. https://github.com/facebookresearch/metaseq ↩
Ben Dickson, "Can large language models be democratized?," TechTalks, 16 May 2022. https://bdtechtalks.com/2022/05/16/opt-175b-large-language-models/ ↩
Srinivasan Iyer et al., "OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization," arXiv:2212.12017, 2022. https://arxiv.org/abs/2212.12017 ↩
Hugging Face, "facebook/opt-iml-max-1.3b" license file. https://huggingface.co/facebook/opt-iml-max-1.3b/blob/main/LICENSE.md ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

BlenderBot Contrastive decoding GPT-J Galactica (language model)

What is OPT?

Why was OPT built?

What model sizes does OPT come in?

What data was OPT trained on?

How was OPT-175B trained?

How much carbon did OPT use compared to GPT-3?

How does OPT perform versus GPT-3?

Is OPT open source?

What is OPT-IML?

Why was OPT significant?

References

Improve this article

Related Articles

LLaMA

Llama 3

Code Llama

Llama 3.2

Llama 3.3

Llama 3.1

What links here

Related Articles

LLaMA

Llama 3

Code Llama

Llama 3.2

Llama 3.3

Llama 3.1

What links here