OPT (Open Pre-trained Transformer)
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,797 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,797 words
Add missing citations, update stale details, or suggest a clearer explanation.
OPT (Open Pre-trained Transformer) is a suite of decoder-only large language models released by Meta AI in May 2022. The family ranges from 125 million to 175 billion parameters, and was built explicitly to reproduce the scale and capabilities of OpenAI's GPT-3 while making the models, the training code, and a detailed account of the training process available to the research community. The largest model, OPT-175B, was the first dense language model of that size whose weights were made broadly available to outside researchers, though under a restrictive non-commercial license rather than a fully open one. [1][2]
The project was described in the technical report "OPT: Open Pre-trained Transformer Language Models," led by Susan Zhang with co-authors including Stephen Roller, Naman Goyal, Mikel Artetxe, Mona Diab, Xian Li, Xi Victoria Lin, Myle Ott, Luke Zettlemoyer, and others at Meta AI. The paper was posted to arXiv on 2 May 2022, with a final revision in June 2022. [1]
By 2022, the most capable language models, including GPT-3, Gopher, and PaLM, were trained at enormous cost and were accessible only internally or through paid APIs. The OPT authors argued that this restricted access limited the ability of the wider research community to study how and why such models work, particularly questions of robustness, bias, and toxicity. A small number of open efforts existed at the time, including EleutherAI's GPT-NeoX-20B and the BigScience workshop that would later produce BLOOM, but no openly available dense model approached the 175 billion parameter scale of GPT-3. OPT was Meta's attempt to close that gap and, in the authors' words, to be "fully and responsibly" accountable for a model of this size. [1]
OPT is a set of transformer decoder models whose architecture and hyperparameters largely follow GPT-3, with most variation in batch size to improve compute efficiency. The paper's prose describes "eight" models, while the architecture table actually lists nine sizes from 125M to 175B; the discrepancy reflects how the authors grouped the smaller baselines. The 350M model is an outlier in that its width does not scale smoothly with the rest of the family. All models use a sequence length of 2048, ReLU activations, and the GPT-2 byte-level BPE tokenizer. [1]
| Model | Layers | Attention heads | Hidden dim (d_model) | Peak learning rate | Batch size (tokens) |
|---|---|---|---|---|---|
| OPT-125M | 12 | 12 | 768 | 6.0e-4 | 0.5M |
| OPT-350M | 24 | 16 | 1024 | 3.0e-4 | 0.5M |
| OPT-1.3B | 24 | 32 | 2048 | 2.0e-4 | 1M |
| OPT-2.7B | 32 | 32 | 2560 | 1.6e-4 | 1M |
| OPT-6.7B | 32 | 32 | 4096 | 1.2e-4 | 2M |
| OPT-13B | 40 | 40 | 5120 | 1.0e-4 | 4M |
| OPT-30B | 48 | 56 | 7168 | 1.0e-4 | 4M |
| OPT-66B | 64 | 72 | 9216 | 0.8e-4 | 2M |
| OPT-175B | 96 | 96 | 12288 | 1.2e-4 | 2M |
For weight initialization the team followed the settings in the Megatron-LM codebase, using a normal distribution with standard deviation 0.006 and scaling output-layer weights by a factor based on the number of layers. They used the AdamW optimizer with betas of (0.9, 0.95) and weight decay of 0.1, a linear learning-rate warmup, and a decay to 10 percent of the peak rate over 300 billion tokens. [1]
OPT was trained on a corpus of roughly 180 billion tokens (about 800 GB of text) assembled mostly from previously used English-language datasets. The mixture concatenated data from three sources: the corpus used to train RoBERTa, a subset of the Pile, and a portion of the PushShift.io Reddit dataset. [1][3]
The RoBERTa component included BookCorpus and the CC-Stories subset of CommonCrawl, plus an updated CCNewsV2 collection of English news crawled through 28 September 2021. From the Pile, the team kept subsets such as Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics, and HackerNews, while dropping other Pile subsets that increased the risk of training instabilities. The Reddit data was converted to a document format by extracting the longest comment chain in each thread, which discarded roughly two thirds of the corpus. The authors removed near-duplicate documents across datasets using MinHashLSH and noted that the Pile in particular was full of duplicate content. The corpus is predominantly English, with a small amount of non-English text present via CommonCrawl. [1][3]
OPT-175B was trained on 992 NVIDIA A100 80GB GPUs, using Fully Sharded Data Parallel combined with Megatron-LM tensor parallelism. The team reported reaching up to 147 TFLOP/s per GPU, which Meta described as roughly 17 percent higher utilization than figures NVIDIA had published for similar hardware. Adam optimizer state was kept in FP32 and sharded across hosts while the model weights stayed in FP16, with dynamic loss scaling used to avoid numerical underflows. The run took on the order of two months. [1][2]
A notable part of the release was Meta's candor about how difficult the training was. The team published a logbook documenting day-to-day problems, and the paper describes frequent hardware failures that contributed to at least 35 manual restarts and the cycling of more than 100 hosts over the two-month run, plus an estimated 70 or more automatic restarts. The team also fought loss divergences, which they addressed by lowering the learning rate and restarting from earlier checkpoints, reducing gradient clipping from 1.0 to 0.3, briefly switching the optimizer, and moving to a newer version of Megatron. The metaseq codebase used for the run was, per the authors, the only known open-source implementation at the time capable of training a decoder-only transformer of at least 175B parameters without pipeline parallelism on NVIDIA GPUs. [1][4]
A central claim of the project was efficiency. The paper estimates the carbon emissions footprint of developing OPT-175B at about 75 tons of CO2 equivalent, compared with an estimated 500 tons for GPT-3 (citing Patterson et al.) and 380 tons for Gopher. That comparison is the basis for the abstract's statement that OPT-175B is "comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop." The authors were careful to note that these accounting methods are not standardized and not always reported, and that the 75-ton figure covers only the final training run. In a footnote they add that, once ablations, smaller baselines, and downtime are included, their own estimate of total cost is roughly twice as high, around 150 tons. [1]
The team evaluated OPT on the same prompts and experimental setup that GPT-3 used, across 16 standard NLP tasks including HellaSwag, PIQA, ARC, WinoGrande, and SuperGLUE. In zero-shot settings, OPT's average performance tracked the trend of GPT-3 reasonably well, matching GPT-3 on about 10 of 14 tasks while underperforming on a few such as the ARC Challenge and MultiRC. In one-shot and few-shot settings, OPT lagged GPT-3 more noticeably, and per-task results varied widely. The authors could not reproduce GPT-3's published numbers using the OpenAI API within their own evaluation harness, which they took as evidence of differences in evaluation methodology. [1]
The paper is unusually frank about the model's weaknesses. In its limitations section, the authors note that OPT-175B does not handle declarative instructions or point-blank interrogatives well, tends to be repetitive and can get stuck in loops, and, like other models of its era, has a high propensity to generate toxic language and reinforce stereotypes even from innocuous prompts. The team chose not to apply mitigation techniques in this first release, since the goal was to replicate GPT-3, and concluded that the technology was "still premature for commercial deployment." [1]
Meta staged the release. The smaller models, from 125M up to 66B parameters, were made available for download alongside the metaseq training and inference code and the full logbook. Access to the full OPT-175B weights was granted on request to academic researchers, people affiliated with government, civil society, and academia, and industry research laboratories, under a non-commercial use license. This was a more open posture than GPT-3's API-only access, but it stopped short of a permissive open-source license, and the same OPT-175B license terms were later applied to derivative models. [1][2]
The release drew a mixed response. The transparency of the code and logbook was widely praised as a step toward reproducible research at scale. At the same time, commentators pointed out that "transparency and openness is not the equivalent of democratizing large language models": the hardware needed to run OPT-175B (on the order of hundreds of thousands of dollars in GPUs) and the technical staff required to operate it kept the model out of reach for most labs, and the non-commercial license barred deployment in products. [5]
In December 2022, Meta released OPT-IML (OPT Instruction Meta-Learning), instruction-tuned versions of OPT-30B and OPT-175B. The accompanying paper, "OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization" (Iyer et al.), introduced OPT-IML Bench, a benchmark of roughly 2000 NLP tasks drawn from eight existing collections. The authors used this benchmark to study instruction-tuning decisions such as task scaling and held-out task selection, then trained the OPT-IML models on the results. OPT-IML outperformed the base OPT models across evaluation suites including PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG, reflecting the broader 2022 shift toward instruction tuning seen in models like InstructGPT and FLAN. The OPT-IML models were released under the same OPT-175B license. [6][7]
OPT was an important early entry in the wave of openly available large language models in 2022, alongside GPT-NeoX-20B and BLOOM, and it remains widely used as a research baseline and in studies of model behavior, quantization, and efficient inference because its full weights and training details are available. Within Meta, the engineering work on metaseq and the lessons recorded in the OPT logbook fed directly into the company's later and more influential open-weight effort, the LLaMA family, released in early 2023. The models are distributed through the Hugging Face Hub and are supported in the Transformers library and in PyTorch. [1][2][3]