gpt-oss is a family of open-weight large language models released by OpenAI on August 5, 2025. The family consists of two variants: gpt-oss-120b, with 117 billion total parameters and 5.1 billion active per token, and gpt-oss-20b, with 21 billion total parameters and 3.6 billion active per token. Both models use a mixture of experts architecture and are distributed under the Apache 2.0 license, making them freely available for commercial and research use without royalties or output-level restrictions [1][2].
The release ended a six-year gap during which OpenAI kept the weights of every new model proprietary. The previous open-weight release was GPT-2, in February 2019, which had 1.5 billion parameters and was rolled out in stages because of "misuse" concerns that look quaint by 2025 standards. Between those two events the industry shifted around OpenAI: Meta's Llama made open weights normal, DeepSeek R1 showed in early 2025 that an open-weight reasoning model could match proprietary peers on hard benchmarks, and the AI Action Plan released by the Trump administration in July 2025 explicitly called for American companies to lead in open models [1][10][11].
The two variants were positioned at developers who could not, or did not want to, send data to OpenAI's API: gpt-oss-120b runs on a single 80 GB datacenter GPU, gpt-oss-20b runs on a 16 GB consumer GPU. Both ship in MXFP4 quantization format from day one, both support a configurable reasoning effort setting, both share a 128k context window. They are text-only and English-centric, with weaker multilingual coverage than Qwen 3 and weaker world knowledge than the proprietary o-series. Sam Altman framed the release as a return to OpenAI's roots; in practice the move responded to a pile of competitive and political pressure that had built up over 2024 and 2025. Reception was warm but qualified: the models are good at math and coding, weaker on writing and general knowledge, and were quickly characterized as "math models" by reviewers who tried them on broader tasks [9][11].
OpenAI was founded in December 2015 with an explicit charter committing the lab to publishing research "in a way that benefits humanity as a whole." In practice that meant releasing model weights alongside papers. The original GPT (2018) was published with weights. GPT-2 (February 2019) was held back briefly out of concern over potential misuse, then released in stages over the year, with full 1.5B weights public in November 2019 [12].
After that, OpenAI's strategy changed. GPT-3 (2020) was released as a paper only, accessible through a closed API. The shift was justified at the time on safety grounds: a more capable model could produce more harmful content, and an API allowed inference-time filtering. Critics argued the safety framing was inseparable from commercial interest, since closed weights are required to monetize API access, and OpenAI's pivot to a capped-profit subsidiary in 2019 made API revenue strategically important. GPT-3.5 (2022), GPT-4 (March 2023), GPT-4o (May 2024), openai o1 (September 2024), and openai o3 (December 2024) were all closed. The gap stretched longer than most observers expected. Within OpenAI, the open-weight question came up periodically; several researchers had argued internally for releasing smaller models, partly to maintain research access. Those arguments did not prevail until 2025 [12].
When GPT-2 was released, open-weight language models were a hobbyist subgenre. By 2025, the picture had inverted.
Meta's Llama (February 2023) demonstrated that a 7B-parameter open model could spawn an entire derivative ecosystem. Llama 2 (July 2023), Llama 3 (April 2024), and Llama 4 (April 2025) continued the trajectory. Mistral released Mixtral 8x7B in 2023 under Apache 2.0. Alibaba's Qwen series, continuing through Qwen 3 in April 2025, became one of the most-downloaded model families on Hugging Face [11][13].
DeepSeek changed the conversation in early 2025. DeepSeek R1 (January 20, 2025) was a 671B-parameter MoE reasoning model scoring close to OpenAI's o1 on math and coding benchmarks while being released under MIT terms. The release triggered a brief stock market panic, with NVIDIA shares dropping nearly 17% in a single day on January 27, 2025, as traders absorbed the implication that capable reasoning models could be trained for what DeepSeek claimed was around six million dollars. Whether that figure accurately reflected the full training cost or only the final run is debated, but the perception that the open-weight tier had caught up had real consequences for OpenAI's positioning [11][13].
The technical case for closing weights also weakened. By 2025, enough capable open-weight models existed that the marginal harm from releasing one more was hard to quantify. If a determined misuser could already get a 70B model from Llama or a 671B MoE from DeepSeek, the safety argument for keeping a smaller OpenAI model proprietary was difficult to make in good faith.
The political environment shifted as well. In January 2025, the second Trump administration revoked the Biden executive order on AI and signaled a deregulatory, pro-industry approach. On July 23, 2025, the White House Office of Science and Technology Policy published America's AI Action Plan, a document titled Winning the Race that explicitly called for American companies to lead in open-weight models. The plan named "powerful and reliable open-weight foundation models" as a national priority and tied that priority to competition with China [10].
The gpt-oss release came thirteen days later. OpenAI's announcement post and press materials did not mention the AI Action Plan by name, but the timing was widely noted. Sam Altman had spent early 2025 publicly hedging on open weights, telling employees in February that the company was "on the wrong side of history" by not publishing weights, then walking the comment back. The administration's policy shift gave OpenAI cover to do something the company had been internally debating anyway. By August, Mistral AI was still releasing open-weight models, Anthropic was keeping weights closed, Google had Gemma alongside its closed Gemini line, and Meta's Llama 4 rollout in April had been widely considered a setback. gpt-oss arrived into a market hungry for a credible American open-weight reasoning model [10][11].
gpt-oss was announced on August 5, 2025 through a post at openai.com/index/introducing-gpt-oss/, a model card published as arXiv preprint 2508.10925, and a system safety paper from Eric Wallace and colleagues at OpenAI. Weights for both variants were made available simultaneously on Hugging Face at openai/gpt-oss-120b and openai/gpt-oss-20b. The repository at github.com/openai/gpt-oss was published the same day with reference inference code, fine-tuning examples, and a developer cookbook [1][2][3][4][15].
Day-one inference partners included Ollama, vLLM, LM Studio, Together AI, Fireworks AI, Groq, Cerebras Systems, Microsoft Azure, and Amazon AWS. Hardware partners included NVIDIA (with Blackwell optimizations), AMD (Instinct MI300X), and Cerebras (CS-3 wafer-scale systems). The breadth was itself a signal: every major inference platform had been brought in, suggesting OpenAI had been working on partner enablement for some time before the announcement [1][5].
Sam Altman marked the launch on Twitter with a thread emphasizing accessibility: gpt-oss-20b would be runnable on a phone, gpt-oss-120b on a single GPU, and both would be Apache 2.0 with no usage restrictions. The phone claim was technically defensible (gpt-oss-20b in MXFP4 fits in the memory of high-end Android devices), although in practice few users immediately deployed it that way [11].
The announcement post described the models as designed for "strong real-world performance at low cost" and emphasized suitability for agentic tasks, tool use, and deployment in latency- or privacy-sensitive environments. Documentation was substantial: a 49-page model card, a separate safety paper documenting the adversarial fine-tuning procedure, a developer cookbook, and optimized inference kernels. Notably absent: training data documentation, intermediate checkpoints, and full post-training description. OpenAI characterized this as "open-weight" rather than "open source," consistent with the Open Source Initiative's 2024 AI definition, which requires data and methodology disclosure for the "open source" label [2][7][11][18].
The contrast between the gpt-oss launch and the GPT-2 launch six years earlier is illustrative. GPT-2 was released in stages over nine months because OpenAI worried about misuse from a 1.5-billion-parameter model. gpt-oss-120b is roughly 78 times larger and was released all at once.
| Aspect | GPT-2 (Feb 2019) | gpt-oss (Aug 2025) |
|---|---|---|
| Largest variant | 1.5B parameters (dense) | 117B parameters (MoE, 5.1B active) |
| Architecture | Dense transformer | MoE transformer with reasoning |
| Release schedule | Staged over 9 months | All variants on day one |
| License | MIT | Apache 2.0 |
| Reasoning capability | None | Configurable (low/medium/high) |
| Context window | 1,024 tokens | 131,072 tokens |
| Quantization | None | Native MXFP4 |
| Inference partners at launch | Hugging Face | NVIDIA, AMD, Cerebras, Groq, Together, Fireworks, Ollama, vLLM, Azure, AWS |
| Stated rationale | Research access, public good | Competitive parity, AI Action Plan, developer demand |
| Controversy | Allegedly too dangerous to release | Too late; not open enough |
The inversion is striking. In 2019 the worry was that an open-weight model would be misused. In 2025 the worry was that an open-weight release would be too capability-restrained, too closed, or too late.
Both gpt-oss variants are autoregressive transformer models using a token-choice mixture of experts design with SwiGLU activations. In a dense transformer, every parameter participates in processing every token; in an MoE model, parameters are divided into discrete expert networks, and only a small subset is activated per token, lowering compute cost while preserving capacity [2].
gpt-oss-120b contains 128 experts and activates 4 per token (5.1B active out of 117B total). gpt-oss-20b contains 32 experts and activates 4 per token (3.6B active out of 21B total). Routing is learned during training via a top-k expert selection mechanism, with auxiliary losses encouraging balanced expert utilization [2].
This architecture is similar in principle to DeepSeek V3 and Mixtral. The token-choice design contrasts with expert-choice MoE; OpenAI chose token-choice to maintain compatibility with standard inference frameworks.
The models use an attention mechanism that alternates between full-context (dense) attention layers and locally banded sparse attention layers with a sliding window of 128 tokens. Dense layers let the model attend to any position in its context; sparse layers reduce memory and compute requirements for long sequences. This alternating pattern is described in the model card as similar to GPT-3 [2].
For positional encoding, the models use Rotary Positional Embedding (RoPE) with YaRN interpolation to extend the context window to 131,072 tokens. The tokenizer is compatible with GPT-4o and includes additional tokens for the Responses API format. The models use grouped multi-query attention with a group size of 8 (gpt-oss-120b has 64 query heads and 8 key-value heads) and incorporate learned attention sinks, a per-head learned scalar added to the softmax denominator to improve numerical stability on long contexts [2].
gpt-oss-120b has 36 transformer layers with a residual stream dimension of 2880. gpt-oss-20b has 24 layers. Both models use top-4 expert routing. Total checkpoint sizes are 60.8 GiB for the 120b model and 12.8 GiB for the 20b model in their default MXFP4 distribution format [2].
The weights are distributed in MXFP4 (microscaling 4-bit floating point) format applied to the MoE expert weights. MXFP4 is a block-scaled quantization format developed by NVIDIA, AMD, Intel, and Microsoft through the Open Compute Project's MX specification. Unlike per-tensor or per-channel quantization, MXFP4 applies a shared exponent to small blocks of values, preserving more dynamic range than naive INT4 quantization while achieving similar memory savings [2][5].
Native MXFP4 distribution means the models do not require a separate quantization step before deployment. Non-expert layers are stored in BF16 or U8; the MoE weights themselves are stored in MXFP4. This design allows gpt-oss-120b to run on a single 80 GB GPU (NVIDIA H100, H200, AMD MI300X) and gpt-oss-20b to run on a 16 GB GPU including consumer cards like the RTX 3090, 4090, or 5080. Running gpt-oss-120b in BF16 without quantization would require multiple GPUs, so the MXFP4 checkpoint materially changes the hardware tier required for deployment.
Both models are trained with the "harmony" chat format, OpenAI's multi-channel response structure. Model outputs are divided into an analysis channel and a final channel. The analysis channel carries chain-of-thought reasoning, labeled with <|channel|>analysis<|message|> in the tokenizer. The final channel contains the user-facing response. Developers who surface the analysis channel to end users are advised to filter it before display, since OpenAI deliberately left the chain-of-thought less strictly aligned than the final output to preserve reasoning quality [2][7].
The harmony format is distinct from the chat templates used by Llama or Qwen 3, and it required updates to inference frameworks at launch. Hugging Face Transformers, vLLM, and Ollama all received patches before August 5 to support the format natively.
Both models support configurable reasoning effort via a parameter that can be set to low, medium, or high. Higher reasoning effort increases latency but improves performance on complex tasks. The models were trained with reinforcement learning on chain of thought reasoning in a manner informed by techniques from openai o3, which applies extended thinking to reasoning-heavy tasks. The same parameter governs both how long the model thinks and which strategies it considers; high effort is most useful on competition mathematics, complex coding tasks, and agentic scenarios [2].
gpt-oss-120b is the larger of the two models. With 117 billion total parameters and 5.1 billion active per token, it is positioned for production deployments where accuracy and reasoning capability are priorities. OpenAI benchmarked it as matching or exceeding o4-mini on most reasoning tasks, including competition mathematics, health-related queries, and agentic tool-calling evaluations [1][2].
The model fits on a single 80 GB GPU in MXFP4 format. Fine-tuning is supported on a single H100 node. It achieves a Codeforces Elo of 2622 in high reasoning mode, placing it within the range of competitive programmers. OpenAI trained gpt-oss-120b using approximately 2.1 million H100 GPU-hours for pretraining, according to the model card. The knowledge cutoff is June 2024 [2].
gpt-oss-20b is the smaller variant, with 21 billion total parameters and 3.6 billion active per token. It is aimed at latency-sensitive deployments, on-device inference, and use cases where GPU memory is constrained. It runs on a 16 GB consumer GPU and achieves roughly 45 to 50 tokens per second on an Apple M4 MacBook, according to reports from early users [11].
Fine-tuning is supported on consumer hardware, making gpt-oss-20b accessible for research groups and individual developers who want to adapt the model to specific domains without access to data center hardware. gpt-oss-20b achieves a Codeforces Elo of 2516, an MMLU score of 85.3%, and GPQA Diamond of 71.5% (with tools, high reasoning effort). Its SWE-bench Verified score at high reasoning is 60.7%, close to the 120b model's 62.4% [2].
| Property | gpt-oss-120b | gpt-oss-20b |
|---|---|---|
| Total parameters | 117 billion | 21 billion |
| Active parameters per token | 5.1 billion | 3.6 billion |
| Number of experts | 128 | 32 |
| Experts activated per token | 4 (top-4) | 4 (top-4) |
| Transformer layers | 36 | 24 |
| Residual dimension | 2880 | 2048 |
| Context window | 131,072 tokens | 131,072 tokens |
| Default precision | MXFP4 (experts), BF16 (other) | MXFP4 (experts), BF16 (other) |
| Checkpoint size | 60.8 GiB | 12.8 GiB |
| Minimum GPU for inference | 80 GB (H100, H200, MI300X) | 16 GB (RTX 3090, 4090, 5080) |
| Pretraining compute | ~2.1M H100-hours | not disclosed |
| Codeforces Elo (high) | 2622 | 2516 |
| MMLU (with tools) | 90.0% | 85.3% |
| Knowledge cutoff | June 2024 | June 2024 |
The model card for gpt-oss is unusually thin on training methodology. OpenAI confirmed several high-level facts and was deliberately vague about others. Confirmed: training used a primarily English text corpus with strong STEM, code, and general-knowledge representation; pretraining took approximately 2.1 million H100 GPU-hours for the 120b variant; post-training included reinforcement learning on chain-of-thought reasoning and instruction hierarchy training; pretraining data was filtered to remove content on CBRN topics that could provide meaningful uplift to actors seeking mass casualties [2].
Not confirmed: the exact composition of the training corpus, the size of the synthetic data fraction, the specific RL algorithms used, or whether post-training distilled from a specific proprietary OpenAI model. The model card phrasing is that gpt-oss was "informed by techniques from o3," which is consistent with several interpretations: distillation from o3 outputs, training on o3-generated reasoning traces, or merely using o3-style RL formulations without copying outputs. OpenAI did not clarify which interpretation is correct.
Multiple third-party analyses, including those by Artificial Analysis and arXiv 2508.12461, argue that gpt-oss appears heavily optimized for verifiable tasks. The model performs unusually well on benchmarks where correctness can be checked automatically (competition mathematics, coding with unit tests, formal logic) and weakly on benchmarks requiring broad knowledge or open-ended judgment. This pattern is consistent with heavy use of RL on synthetic data with verifiable rewards. The phenomenon researchers have called "benchmark-only intelligence" is visible in particularly stark form: on AIME 2025 with tools, gpt-oss-20b scores 98.7%, higher than gpt-oss-120b's 97.9%. The smaller model outperforming the larger model on a competition mathematics benchmark is most easily explained by post-training optimization choices that favored the smaller variant on that specific evaluation [8][11].
Whether gpt-oss was distilled from openai o3 or a related proprietary model is one of the more interesting open questions about the release. OpenAI did not say explicitly. The evidence is circumstantial: gpt-oss inherits the same chain-of-thought reasoning style o3 introduced; the reasoning effort levels mirror o3's reasoning budget design; gpt-oss often produces outputs that pattern-match against o3 outputs on identical prompts. None of that is dispositive. OpenAI could have reproduced these design choices without distilling, or could have used a smaller in-house model as the teacher.
The arXiv 2508.10925 model card discusses post-training in general terms and credits techniques from "deliberative alignment" research, which is the public name for OpenAI's approach to teaching models to reason about guidelines before generating sensitive outputs. The same paper notes that some pretraining data filtering was inherited from work originally done for openai o4-mini, which suggests pipeline overlap with the proprietary o-series even if the gpt-oss models are not literal distillations [2].
The following table shows benchmark results for gpt-oss-120b and gpt-oss-20b at high reasoning effort, alongside selected comparison models. All gpt-oss scores use tool access where available. Asterisked values are taken from comparative tables in the gpt-oss model card; other values are from primary sources for those models.
| Benchmark | gpt-oss-120b | gpt-oss-20b | o4-mini | o3 |
|---|---|---|---|---|
| MMLU | 90.0% | 85.3% | 85.2%* | 88.8%* |
| GPQA Diamond (with tools) | 80.9% | 74.2% | -- | -- |
| AIME 2024 (with tools) | 96.6% | 96.0% | -- | -- |
| AIME 2025 (with tools) | 97.9% | 98.7% | -- | -- |
| SWE-bench Verified (high) | 62.4% | 60.7% | -- | -- |
| Codeforces Elo | 2622 | 2516 | -- | -- |
| HealthBench | 57.6% | 42.5% | -- | 58.7%* |
| MMMLU Average | 81.3% | 75.7% | 85.2%* | 88.8%* |
| Humanity's Last Exam | 14.9% | 10.9% | -- | -- |
| Tau-bench Retail | 67.8% | 54.8% | -- | -- |
Additional SWE-bench Verified scores across reasoning levels:
| Reasoning level | gpt-oss-120b | gpt-oss-20b |
|---|---|---|
| Low | 47.9% | 37.4% |
| Medium | 52.6% | 53.2% |
| High | 62.4% | 60.7% |
GPQA Diamond scores by reasoning level:
| Setting | gpt-oss-120b | gpt-oss-20b |
|---|---|---|
| Low | -- | 56.8% |
| Low + tools | -- | 58.0% |
| Medium | -- | 66.0% |
| Medium + tools | -- | 67.1% |
| High + tools | 80.9% | 74.2% |
Key takeaways from the benchmark table: the 120b variant approximately matches o4-mini on the cross-section of benchmarks in the model card, stronger on competition math and weaker on health-related queries. Against o3, gpt-oss-120b is meaningfully behind, but the gap is smaller than the parameter ratio would suggest.
gpt-oss-20b's results are striking. A 3.6-billion active-parameter model scoring 98.7% on AIME 2025 with tools is unusual by any standard. Independent reproduction by Artificial Analysis confirmed the AIME numbers within margin of error, although on broader benchmarks the 20b trails the 120b by larger margins than the headline figures suggest. The "with tools" qualifier does substantial work: with tool access, gpt-oss can use a Python execution environment to verify intermediate steps, compressing the gap with proprietary peers [8][11].
Both gpt-oss variants are released under the Apache 2.0 license with no commercial restrictions and no usage policy on output generation. The license includes an explicit patent grant, which extends beyond MIT, BSD, and similar permissive licenses. Developers can use, modify, distribute, and commercialize the models without paying royalties or returning derivatives [1][2].
Apache 2.0 is genuinely permissive. It is not OpenRAIL (which adds usage restrictions on output content), not the Llama Community License (which adds a 700-million-MAU cap), and not the Falcon-180B TII license (which historically required commercial users to register). Apache 2.0 imposes only standard requirements: include a copy of the license, retain copyright notices, and state changes to original files. There are no acceptable use restrictions, no monthly active user caps, no additional patent retaliation clauses beyond the standard one, and no requirement to share fine-tuned weights [11][14].
The choice over a more restrictive open-weight license was deliberate. OpenAI's announcement post and Sam Altman's launch tweet both highlighted the lack of restrictions. Dustin Carr, CTO of Darkviolet.ai, described it as a "maximally permissive license" at the time of release. The contrast with Meta's Llama license was emphasized: Llama 3 and 4 require companies with more than 700 million monthly active users to obtain a separate license, which excludes the largest cloud providers from using Llama freely. gpt-oss imposes no such cap [11].
The limit of Apache 2.0 is that it covers the weights, not the training process. Critics noted that gpt-oss is "open-weight" rather than "open source" under the Open Source Initiative's 2024 AI definition, because OpenAI did not release training data, intermediate checkpoints, or full methodology documentation sufficient for third-party reproduction [11].
The day-one distribution surface was unusually broad for an OpenAI release:
| Platform | Model availability | Pricing | Notes |
|---|---|---|---|
| Hugging Face | Both variants | Free download | openai/gpt-oss-120b and openai/gpt-oss-20b repos |
| Ollama | Both variants | Free local | One-line install, automatic quantization handling |
| vLLM | Both variants | Free self-host | Day-one MXFP4 support |
| LM Studio | Both variants | Free desktop | GUI for local inference |
| Together AI | Both variants | Per-token API | Hosted inference at competitive pricing |
| Fireworks AI | Both variants | Per-token API | Optimized inference, fine-tuning service |
| Groq | Both variants | Per-token API | Highest-throughput hosted option at launch |
| Cerebras Inference | gpt-oss-120b | Per-token API | Wafer-scale inference, low latency |
| Microsoft Azure AI | Both variants | Azure pricing | Foundry catalog deployment |
| Amazon Bedrock | Both variants | Bedrock pricing | First open-weight model from OpenAI on Bedrock |
| OpenAI Platform | Both variants | Per-token API | Same API, opt-in for hosted gpt-oss |
| GitHub | Reference code | Free | github.com/openai/gpt-oss |
The simultaneous availability across competing inference platforms was the result of pre-launch enablement work. NVIDIA, AMD, Cerebras, Groq, Together, and Fireworks all received early access to weights and worked on optimizations before the public release. The result was that on August 5, 2025, developers could pick from at least eight hosted API providers and three local inference options on day one [1][5].
The distinction between Apache 2.0 and other licenses sometimes called "open" is worth being precise about:
| License | Commercial use | Output restrictions | User base cap | Patent grant | Reproducible |
|---|---|---|---|---|---|
| Apache 2.0 (gpt-oss) | Yes | None | None | Explicit | No (weights only) |
| MIT (DeepSeek R1) | Yes | None | None | Implicit only | No (weights only) |
| Llama Community License | Yes | Acceptable use policy | 700M MAU | None | No |
| OpenRAIL (BLOOM, original Stable Diffusion) | Conditional | Usage restrictions enumerated | None | Implicit | No |
| Gemma Terms (Google) | Yes | Prohibited use policy | None | Limited | No |
| Open Source AI Definition (OSI 2024) | Yes | None | None | Required | Yes (data + weights + code) |
Under this taxonomy, gpt-oss and DeepSeek R1 sit at the most permissive end among large model releases, with Apache 2.0 offering a stronger patent grant than MIT. Neither qualifies as "open source" under the OSI definition because neither releases training data. The labels in casual press coverage often blur these categories; the actual license matters for production deployment.
NVIDIA was the most heavily involved hardware partner. Day-one optimizations included MXFP4 native support on Hopper (H100, H200) and Blackwell (B100, B200, GB200) GPUs, Flash Attention 3 integration, and TensorRT-LLM deployment paths. NVIDIA published guides showing roughly 1.5x throughput improvements on Blackwell compared to Hopper for the 120b variant, and emphasized that gpt-oss-120b's single-GPU footprint was one of the strongest demonstrations to date of MXFP4 efficiency on production hardware. CEO Jensen Huang issued a statement supporting the release. NVIDIA also collaborated on the MXFP4 specification through the Open Compute Project, so the format choice aligned with hardware roadmaps the company had been developing for several years [5].
AMD provided launch-day support on the Instinct MI300X and MI325X GPUs, which have 192 GB and 256 GB of HBM3e memory respectively. The 120b variant fits comfortably on a single MI300X, and AMD's ROCm software stack received updates to support MXFP4 inference. AMD published benchmarks showing competitive throughput against Hopper, although Blackwell's MXFP4-native design held a per-chip advantage on the 120b model [5].
Cerebras deployed gpt-oss-120b on its CS-3 wafer-scale system at launch, offering hosted inference at what the company described as the highest single-system throughput in the industry. The wafer-scale architecture, which fits the entire model on a single chip, allows for very low latency. Cerebras claimed inference speeds in the range of multiple thousands of tokens per second, exceeding anything achievable on current GPU systems for a model of comparable capability. Groq deployed gpt-oss-120b on its Language Processing Units (LPUs), continuing the company's pattern of being among the fastest hosts for new open-weight models, with low first-token latency and high steady-state throughput [11].
| Hardware | gpt-oss-120b | gpt-oss-20b | MXFP4 native | Flash Attention 3 |
|---|---|---|---|---|
| NVIDIA Hopper (H100/H200) | Yes | Yes | Yes | Yes |
| NVIDIA Blackwell (B100/B200/GB200) | Yes | Yes | Yes | Yes |
| NVIDIA Ada/Ampere (consumer) | Limited | Yes | Partial | No |
| AMD MI300X/MI325X | Yes | Yes | Software emulation | No |
| Cerebras CS-3 | Yes | No (not deployed) | Custom | N/A |
| Groq LPU | Yes | Yes | Custom | N/A |
| Apple Silicon (M3/M4) | No | Yes | Software emulation | No |
For on-device inference, gpt-oss-20b on Apple Silicon (M3 and M4 generations) became one of the most popular configurations in the days following launch. Apple's Metal Performance Shaders backend was updated to handle the harmony chat format, and the model fit comfortably within the unified memory of a 32 GB or 64 GB MacBook Pro.
During pretraining, OpenAI filtered training data to remove content related to Chemical, Biological, Radiological, and Nuclear (CBRN) topics that could provide meaningful uplift to actors seeking to cause mass casualties. The model card notes that the primary concern is not that training data itself is dangerous, but that exposure to certain technical content could improve the model's ability to synthesize actionable instructions in those domains [2][7].
During post-training, OpenAI applied instruction hierarchy training to teach the model to follow safety-relevant directives from system prompts and resist prompt injection attacks. The models were also trained with deliberative alignment, a technique in which the model is taught to reason about its guidelines before generating potentially sensitive outputs [2][7].
OpenAI evaluated gpt-oss-120b under its Preparedness Framework, which classifies models into capability tiers for biological/chemical, cybersecurity, and other risk categories. A model that reaches the "High" capability level for biological/chemical risk under the framework would not be released as open weights, because the combination of high capability and unrestricted weight access creates a risk profile that cannot be managed through inference-time filtering [7].
To test whether the model could be pushed to dangerous capability levels through fine-tuning, OpenAI ran adversarial fine-tuning experiments. gpt-oss-120b was fine-tuned on specialized biology and cybersecurity data using OpenAI's internal training infrastructure, then evaluated by internal and external evaluators. The conclusion was that the adversarially fine-tuned model did not reach the "High" capability threshold for biological/chemical risk or cybersecurity risk under the Preparedness Framework. On Collegiate CTF cybersecurity benchmarks, gpt-oss-120b ranked second behind o3 among OpenAI models, but stayed below the threshold for restricted release [7].
The gpt-oss safety methodology shares structural features with Anthropic's Responsible Scaling Policy, which also defines capability thresholds tied to deployment decisions. Both frameworks operationalize the question of when a model is too capable to release in a given way. The gpt-oss approach extends this to the open-weight case specifically, where adversarial fine-tuning becomes the relevant capability elicitation method because users can modify the model freely. The two frameworks differ in execution: the Preparedness Framework uses what the model card describes as a "holistic process," while the RSP attempts more explicit numerical specification, although in practice both frameworks involve substantial judgment [7].
The safety methodology was reviewed by METR (Model Evaluation and Threat Research), an external AI safety organization. METR submitted 17 recommendations, of which 6 were classified as high-urgency. OpenAI implemented 9 of the 17 recommendations, and METR confirmed that all 6 high-urgency items were at least partially addressed [6].
METR's review focused on the adversarial fine-tuning evaluation procedure rather than the model's default safety behavior. Key concerns included the criteria used to define capability elicitation, the choice of benchmarks, and the operationalization of the "High" capability threshold. METR noted that OpenAI described its threshold determination as a "holistic process" rather than specifying explicit numerical cutoffs, which limits reproducibility. METR's public summary, published October 23, 2025, concluded that the methodology represented a step forward relative to prior open-weight releases, which typically included no systematic adversarial capability evaluation before publication [6].
The model card identifies several safety limitations that developers should account for in production deployments.
The chain-of-thought analysis channel is intentionally less constrained than the final response channel. If surfaced to end users without filtering, reasoning traces may contain content that would be blocked in a final response. OpenAI's guidance is that developers should not expose the analysis channel directly to users without review [2][7].
Instruction hierarchy compliance is weaker than in OpenAI's proprietary models. On system-prompt extraction tasks, gpt-oss-120b scores 83.2%, compared to 99.3% for o4-mini. This means the model is more susceptible to prompt injection attacks designed to extract or override system prompts [2].
Hallucination rates are higher than in proprietary frontier models on most fact-heavy benchmarks. On SimpleQA, gpt-oss-120b achieves 78.2% accuracy, which is roughly comparable to o4-mini's 75% on the same benchmark; the model is somewhat better on focused factual QA than on broad world knowledge, where it lags substantially.
Reaction to gpt-oss at release was broadly positive within the developer community, with significant caveats. The reaction roughly split into three camps: enthusiastic adopters, skeptical reviewers, and critics of the openness claims.
Hugging Face published a blog post titled "Welcome GPT OSS, the new open-source model family from OpenAI!" emphasizing private and on-device deployment use cases. The post received over 513 upvotes and was one of the most-shared on the platform that month [5].
Developer Simon Willison called the release "really impressive" for matching OpenAI's smaller proprietary models while running on local hardware, and published demonstrations of gpt-oss-20b on a MacBook within hours of release. IEEE Spectrum described it as potentially the most significant competitive threat to Meta's previous dominance in open-weight models, particularly after Llama 4's troubled rollout [17].
The ecosystem moved fast. Within 48 hours, dozens of fine-tunes appeared on Hugging Face, including domain-specific variants for medicine, law, and code. Within a week, independent inference benchmarks confirmed the official numbers. Within a month, gpt-oss-20b was one of the top ten most-downloaded models on Hugging Face.
Not everyone was convinced. Pseudonymous AI researcher Teknium dismissed the release as "a legitimate nothing burger" and predicted it would be quickly surpassed. The argument was that the 120b variant did not advance the open-weight frontier in any direction other than being from OpenAI; comparable performance was already available from DeepSeek R1 and similar models [11].
Several reviewers characterized the models as essentially "math models." Strong on STEM, coding, and logical reasoning; weaker on creative writing, world knowledge, multilingual tasks, and anything requiring broad cultural context. Critics attributed this to heavy use of synthetic training data optimized for verifiable tasks. The pattern is visible in benchmark numbers: AIME and GPQA scores are at the frontier, MMLU is competitive, but Humanity's Last Exam (a benchmark designed to test broad expert knowledge) shows gpt-oss-120b at only 14.9%, well behind proprietary frontier models [8][11].
The arXiv 2508.12461 third-party evaluation also documented an inverse scaling phenomenon, with gpt-oss-20b outperforming gpt-oss-120b on several general benchmarks. The authors suggested this reflected post-training optimization rather than a fundamental architecture issue, but the existence of inverse scaling at this scale is unusual and gave skeptics ammunition [8].
Hanna Hajishirzi of the Allen Institute for AI argued that meaningful openness requires more than weight release. The Open Source Initiative's 2024 Open Source AI Definition requires data and methodology disclosure for the "open source" label, and gpt-oss does not meet that standard. The argument is that calling gpt-oss "open" without qualification overstates what was released and undermines efforts to push for meaningful openness in the field [11].
The criticism has weight. If "open" comes to mean "weights are downloadable," labs can release weights while retaining all the actually informative parts of the training pipeline. A researcher quoted by VentureBeat turned gpt-oss-20b into a non-reasoning base model by removing the post-training alignment, generating outputs without the RLHF-trained refusal behaviors. This kind of modification is possible precisely because the weights are openly available, and serves as a demonstration of why the safety evaluation focused on adversarial fine-tuning [11].
Altman's public framing evolved over the following weeks. In the launch tweet thread, he positioned gpt-oss as "a really good open-source model" and emphasized accessibility. In later interviews, he described open-weight releases as a continuing strategy, although he was careful not to commit to specific timelines for future releases. Whether subsequent OpenAI open-weight releases would actually arrive at scheduled intervals was, as of late 2025, an open question [11].
Within 30 days of release, gpt-oss-20b crossed one million downloads on Hugging Face, placing it in the top tier of recent open-weight model launches. The 120b variant downloaded more slowly because of the larger checkpoint size and hardware bar, but still saw substantial adoption among hosted-API providers and enterprise users. Ollama reported that gpt-oss-20b was the most-installed new model in August 2025; the one-line install and consumer-hardware support made it the default choice for hobbyists and small teams who had been waiting for an OpenAI-branded local model.
Fine-tuning activity was substantial. Within weeks, public fine-tunes appeared for medical reasoning, legal document analysis, multilingual extension (particularly Chinese, where base coverage was weak), creative writing (where users aimed to undo the math-heavy post-training), and uncensored variants. The variety of derivatives is consistent with how the Llama and Mistral ecosystems developed in 2023 and 2024.
Enterprise adoption was more cautious than hobbyist adoption, as is typical for any new model. Several factors drove interest: the Apache 2.0 license simplified procurement compared to Llama's user-base cap, the data-residency story addressed regulatory concerns, and the ability to fine-tune on proprietary data without API charges changed the cost calculus for high-volume use cases.
Several early enterprise deployments were documented. Healthcare organizations used gpt-oss-120b for internal clinical-text processing, with the model's HealthBench score (57.6%) cited as adequate for triage and document summarization tasks. Financial services firms deployed the 20b variant for research summarization and document review, keeping confidential data on-premises. Legal technology firms experimented with fine-tuned versions for document review workflows.
The trade-off was straightforward: gpt-oss imposed lower marginal costs per query but higher integration and operations costs than a proprietary API. For organizations with steady high-volume use, the math favored gpt-oss. For sporadic or experimental use, proprietary APIs remained more cost-effective.
The open weights enabled a class of research not possible on closed API models. Activation analysis, attention pattern visualization, mechanistic interpretability experiments, and adversarial robustness studies all require model weights and gradient access. Within months of release, research papers began appearing that used gpt-oss as the substrate for interpretability work, capability elicitation studies, and comparisons to proprietary models on tasks where leaked outputs could not establish equivalence [8][11].
The arXiv 2508.12461 evaluation by Junhao Song and colleagues was an early example. It compared gpt-oss to other open-weight reasoning models across 47 benchmarks, documenting the inverse scaling between 20b and 120b, strong competition-math performance, and the weakness on multilingual tasks. Researchers also used gpt-oss to study MoE routing behavior, reasoning-effort scaling dynamics, and the relationship between chain-of-thought traces and final outputs. The availability of the analysis channel as a separate tokenized stream made certain interpretability experiments substantially easier than on models that hide chain-of-thought from users.
| Metric | gpt-oss-20b | gpt-oss-120b |
|---|---|---|
| Hugging Face downloads (first 30 days) | ~1M+ | ~250K+ |
| Public fine-tunes (first 30 days) | Hundreds | Dozens |
| Hosted API providers at launch | 8+ | 8+ |
| Major enterprise deployments documented | Yes | Yes |
| Inference frameworks supported at launch | All major | All major |
Figures are approximate and based on contemporary tracking; precise official numbers were not published.
OpenAI o4-mini, released April 16, 2025, is the most direct proprietary comparison for gpt-oss-120b. Both are reasoning models with chain-of-thought training. The model card positions gpt-oss-120b as approximately matching o4-mini on most benchmarks; in practice, the comparison varies by domain.
On AIME and Codeforces, gpt-oss-120b is comparable to or slightly behind o4-mini. On HealthBench, o4-mini is meaningfully ahead. On general MMLU, o4-mini holds an edge in non-tool-augmented settings. On instruction hierarchy compliance, o4-mini scores 99.3% versus gpt-oss-120b's 83.2%, reflecting stronger post-training on safety behaviors.
The practical trade-off is operational. o4-mini is available only through OpenAI's API, with per-token pricing and inference-time content filtering controlled by OpenAI. gpt-oss-120b can be self-hosted, fine-tuned freely, and run in environments where data cannot be sent to external APIs. For most production use cases, the choice between them is determined more by deployment constraints than by raw capability differences.
Openai o3, released January 31, 2025 (mini version) and broader rollout in 2025, is meaningfully more capable than gpt-oss-120b on the benchmarks where direct comparison is available. On GPQA Diamond and Humanity's Last Exam, o3 holds a substantial lead. On HealthBench, o3 outperforms gpt-oss-120b. On Codeforces, o3 is approximately 2727 versus gpt-oss-120b's 2622.
The gap reflects that o3 represents OpenAI's frontier reasoning capability, while gpt-oss-120b is positioned a tier below: a strong open-weight model, not the strongest possible model OpenAI could release. This positioning is consistent with how Meta and Google handled their own open-weight releases, releasing models several months behind their proprietary frontier rather than at the frontier itself.
GPT-5, released August 7, 2025, two days after gpt-oss, is OpenAI's flagship proprietary model with unified reasoning and conversational capabilities. The two-day gap was almost certainly intentional. GPT-5 is meaningfully ahead of gpt-oss-120b on essentially every benchmark, with 94.6% on AIME 2025 and 74.9% on SWE-bench Verified.
The simultaneous release served a clear strategic purpose: it allowed OpenAI to position gpt-oss as a community-oriented release while maintaining a clear capability gap with the proprietary frontier. Developers who needed maximum capability could use GPT-5 through the API. Developers who needed open weights or local deployment could use gpt-oss. The two products did not directly compete with each other for the same use cases.
| Model | Release date | Open? | MMLU | AIME 2025 | SWE-bench V | Codeforces |
|---|---|---|---|---|---|---|
| gpt-oss-120b | Aug 5, 2025 | Yes (Apache 2.0) | 90.0% | 97.9% (tools) | 62.4% | 2622 |
| gpt-oss-20b | Aug 5, 2025 | Yes (Apache 2.0) | 85.3% | 98.7% (tools) | 60.7% | 2516 |
| o4-mini | Apr 16, 2025 | No | 85.2% | -- | -- | -- |
| o3 | Jan 31, 2025+ | No | 88.8% | -- | -- | ~2727 |
| GPT-5 | Aug 7, 2025 | No | -- | 94.6% | 74.9% | -- |
| GPT-4o | May 13, 2024 | No | 88.7% | -- | -- | -- |
The table is not strictly apples-to-apples because evaluation conditions varied across releases and tool access was not uniformly applied, but it gives a rough sense of the positioning. gpt-oss-120b sits in the o4-mini neighborhood, gpt-oss-20b is somewhat below, and OpenAI's frontier proprietary models remain meaningfully ahead.
DeepSeek R1, released January 20, 2025 under MIT license, is the most direct open-weight comparison for gpt-oss. R1 has 671 billion total parameters with 37 billion active, much larger in both total and active terms than gpt-oss-120b. On most benchmarks the two are competitive, with R1 having an edge on coding agentic tasks and gpt-oss-120b ahead on competition mathematics with tools.
The operational comparison favors gpt-oss-120b for single-machine deployment because of the smaller parameter count and MXFP4 quantization. R1 requires multi-GPU or distributed infrastructure to run at full precision; gpt-oss-120b fits on a single 80 GB GPU. R1's open-source story is somewhat stronger though: DeepSeek released the full R1-Zero training procedure and made cold-start data publicly available, and the DeepSeek RL methodology has been widely studied and reproduced by other labs.
Llama 4, released April 5, 2025, is Meta's first MoE model and was widely considered to have rolled out poorly. Initial benchmarks and user reports were mixed, and several Reddit and Hacker News threads documented gaps between Meta's claimed numbers and independent evaluations. Llama 4 Scout underperformed expectations on most tasks; Llama 4 Maverick performed adequately but did not advance the open-weight frontier.
By August 2025, gpt-oss-120b's positioning was stronger than Llama 4 Maverick's on most reasoning benchmarks. On MMLU, gpt-oss-120b scores 90.0% versus Llama 4 Maverick's reported scores in the mid-80s. On SWE-bench Verified, gpt-oss-120b's 62.4% was meaningfully ahead. The license comparison favors gpt-oss as well: Apache 2.0 has no MAU cap, while the Llama Community License requires a separate agreement from Meta for organizations above 700 million monthly active users.
Qwen 3, released April 29, 2025 by Alibaba, represents the primary multilingual open-weight competition to gpt-oss. The Qwen 3 family includes dense and MoE variants ranging from 0.6B to 235B parameters, with hybrid thinking modes within a single model.
On STEM benchmarks, Qwen 3 Thinking and gpt-oss-120b are roughly comparable. On GPQA Diamond, both are around 81%. On AIME 2025 with tools, Qwen 3 Thinking achieves approximately 92.3% versus gpt-oss-120b's 97.9%. The story flips on multilingual and Chinese-language tasks. The arXiv 2508.12461 evaluation found both gpt-oss variants scoring below 45% on Chinese-language tasks, while Qwen 3 was designed with extensive multilingual coverage and scores substantially higher. For deployments where multilingual capability matters, Qwen 3 is the stronger choice [8].
Llama 3.3 70B is Meta's 70-billion-parameter dense model, released in late 2024 under Meta's custom Llama license. On MMLU, it scores approximately 84%, compared to gpt-oss-120b's 90.0% and gpt-oss-20b's 85.3%. On HumanEval coding tasks, it scores approximately 83%, competitive with gpt-oss. The comparison illustrates that MoE models can match or exceed dense models several times their active parameter count: gpt-oss-20b activates 3.6 billion parameters versus Llama 3.3 70B's full 70 billion.
| Model | Release | License | Total params | Active params | MMLU | AIME 2025 (tools) |
|---|---|---|---|---|---|---|
| gpt-oss-120b | Aug 5, 2025 | Apache 2.0 | 117B | 5.1B | 90.0% | 97.9% |
| gpt-oss-20b | Aug 5, 2025 | Apache 2.0 | 21B | 3.6B | 85.3% | 98.7% |
| DeepSeek R1 | Jan 20, 2025 | MIT | 671B | 37B | ~85.0% | ~87.5% |
| DeepSeek V3 | Dec 2024 | Custom (open) | 671B | 37B | ~88.5% | -- |
| Llama 4 Maverick | Apr 5, 2025 | Llama 4 Community | 400B | 17B | ~80% | -- |
| Llama 3.3 70B | Dec 2024 | Llama 3.3 Community | 70B | 70B (dense) | ~84.0% | -- |
| Qwen 3 Thinking | Apr 29, 2025 | Apache 2.0 | up to 235B | varies | ~84.4% | ~92.3% |
| Mixtral 8x22B | Apr 2024 | Apache 2.0 | 141B | 39B | ~77.8% | -- |
Numbers for non-gpt-oss models are drawn from third-party comparisons and may reflect different evaluation conditions. The table gives a rough sense of where gpt-oss sits in the open-weight landscape: competitive with DeepSeek R1 on reasoning, ahead of Llama 4 on most benchmarks, behind Qwen 3 on multilingual tasks, and ahead of older Mixtral models on essentially everything.
| Date | Event |
|---|---|
| Aug 5, 2025 | Initial release on Hugging Face, Apache 2.0; day-one partners include Ollama, vLLM, Together, Fireworks, Groq, Cerebras, Azure, AWS |
| Aug 7, 2025 | GPT-5 launch, positioning the open-weight tier below the proprietary frontier |
| Aug 13, 2025 | Initial third-party benchmarks (Artificial Analysis) confirm model card numbers within margin of error |
| Aug 22, 2025 | arXiv 2508.12461 published; documents inverse scaling between 20b and 120b |
| Sep 2025 | First fine-tuned variants gain traction (medical, legal, multilingual) |
| Oct 23, 2025 | METR public summary of safety methodology review |
| Nov 2025 | Inference framework updates from vLLM, Hugging Face, Ollama |
| Dec 2025 | Community GGUF and AWQ conversions widely available |
| Q1 2026 | Enterprise deployments expand into healthcare, finance, legal verticals |
| Apr 2026 | Public fine-tunes total in the thousands; gpt-oss remains a default open-weight choice for English-language reasoning tasks |
The absence of a major version 2.0 release through early 2026 is notable. OpenAI did not commit publicly to a specific cadence for open-weight refreshes, and as of mid-2026 the original August 2025 weights remained the primary distribution.
Several limitations are documented in the official model card and confirmed by third-party evaluation: