Step-3
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,708 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,708 words
Add missing citations, update stale details, or suggest a clearer explanation.
Step-3 is an open-weight large multimodal mixture of experts (MoE) model released in July 2025 by StepFun, the Shanghai-based Chinese artificial intelligence startup also known as Jieyue Xingchen. It is a vision-language model with roughly 321 billion total parameters and about 38 billion parameters activated per token, and it is distinguished less by raw benchmark leadership than by its central design goal: minimizing the cost of inference decoding so that a frontier-scale model can be served cheaply at high throughput.[1][2] StepFun pursued this goal through a model-system co-design that pairs two techniques, Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), to reduce attention cost and raise GPU utilization. The accompanying research paper, titled "Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding," was submitted to arXiv on 25 July 2025, and the model weights were open-sourced on 31 July 2025 under the Apache License 2.0.[1][3][4]
Step-3 was positioned as StepFun's flagship foundation model for 2025 and as a cost-efficient challenger among Chinese open-weight frontier models such as DeepSeek-V3, Qwen, Kimi, and MiniMax. The model accepts both image and text inputs and is aimed at multimodal AI reasoning tasks, including mathematics, science, and code, alongside general visual understanding.[2][5]
The defining thesis of the project, captured in the paper's title, is that a large model need not be expensive to run. Rather than shrinking the model to cut serving costs, StepFun argued that decoding cost is governed by the interaction of three factors, attention arithmetic intensity, MoE sparsity, and the way attention and feed-forward computation are placed on hardware, and that a model co-designed around the economics of real accelerators can activate more parameters per token than rivals while still costing less to serve.[1] Step-3 activates 38 billion parameters per token, more than DeepSeek-V3 or Qwen3 MoE 235B activate, yet StepFun reports lower theoretical decoding cost on the hardware it studied.[1]
StepFun (Shanghai Jieyue Xingchen Intelligent Technology Co., Ltd.) was founded on 6 April 2023 by former Microsoft researchers, including Jiang Daxin, a former Microsoft vice president and an expert in search and natural language processing, who serves as chief executive.[6] The company is widely described as one of China's "AI Tiger" startups, a group of well-funded large-model developers, and its investors have included Tencent, Qiming Venture Partners, and Shanghai state-backed capital.[6]
StepFun has emphasized multimodal foundation models across text, image, audio, and video. At the World Artificial Intelligence Conference in July 2024 it launched Step-2, a trillion-parameter MoE language model, together with the Step-1.5V multimodal model and the Step-1X image-generation model.[6] In February 2025 it open-sourced the Step-Video-T2V text-to-video model and the Step-Audio speech model.[6] Step-3 followed in July 2025 as the company's next-generation flagship, and StepFun continued the line afterward with smaller, faster MoE variants such as Step-3.5-Flash (a 196-billion-parameter MoE with about 11 billion active parameters) released in February 2026.[6]
Step-3 is built on a sparse mixture of experts transformer design. According to the technical report, the vision-language model totals about 321 billion parameters; the language-model component comprises 316 billion parameters with 38 billion activated for each text token, and there is an additional vision encoder of roughly 5 billion parameters that handles image inputs.[1] The released model card lists 61 layers (5 of them dense), a hidden dimension of 7,168, a maximum context length of 65,536 tokens, and a reuse of the DeepSeek-V3 tokenizer.[2]
The MoE feed-forward layers use 48 routed experts with 3 experts selected per token plus 1 shared expert, a relatively fine-grained sparsity pattern.[2] The model card distributes weights in both bf16 and block-FP8 formats and recommends serving through inference engines such as vLLM and SGLang.[2] During pretraining the model processed more than 20 trillion text tokens and 4 trillion image-text mixed tokens spanning over ten languages, per StepFun.[5]
Multi-Matrix Factorization Attention is the attention mechanism at the core of Step-3's efficiency design. MFA applies low-rank matrix factorization to the query-key circuit, which lets StepFun scale both the number and the dimensionality of attention heads in a parameter-efficient way while keeping the KV cache small.[1][5] The reported configuration uses 64 query heads with a head dimension of 256 and a low-rank query dimension of 2,048.[2] StepFun states that this design reduces both KV-cache size and attention compute while preserving attention expressiveness, and reports that Step-3 uses roughly 22 percent of DeepSeek-V3's per-token attention cost.[1][5]
Attention-FFN Disaggregation is a distributed-inference system, rather than a change to the model weights, that decouples the attention layers and the feed-forward (FFN) layers into separate, specialized subsystems running on different hardware.[1] Because attention and the MoE feed-forward layers have very different compute and memory profiles, executing them together forces compromises in batching and hardware utilization. By disaggregating the two, AFD lets each subsystem be sized and scheduled independently, which StepFun reports raises decoding throughput, particularly when attention and FFN are mapped onto different accelerator types in a heterogeneous setup.[1] The paper presents AFD as the system half of a co-design in which MFA and the MoE sparsity pattern are the model half.
The central contribution of Step-3 is its analysis of decoding cost, the cost of generating output tokens, which dominates serving expense for reasoning and long-output workloads. StepFun reports a theoretical decoding-cost analysis across several accelerators, including NVIDIA H800, H20, and A800 and Huawei Ascend 910B, expressed in US dollars per million decoded tokens.[1] These per-token cost figures are theoretical estimates derived from the model and system design, not list prices; the comparisons should be read as such.
In that analysis, StepFun reports that Step-3 has lower theoretical decoding cost than both DeepSeek-V3 and Qwen3 MoE 235B, with the advantage widening at longer context. At an 8K context (using AFD on H800 and H20), the paper cites about 0.055 USD per million decoded tokens for Step-3 versus 0.068 for DeepSeek-V3 and 0.062 for Qwen3 MoE 235B; at 32K context the gap grows to roughly 0.129 for Step-3 versus 0.211 for DeepSeek-V3 and 0.193 for Qwen3 MoE 235B, corresponding to cost reductions in the range of about 19 to 39 percent against DeepSeek-V3 and about 11 to 33 percent against Qwen3 MoE 235B over those context lengths.[1] StepFun emphasizes that Step-3 attains this lower cost despite activating more parameters per token than either comparison model, which it presents as evidence that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD jointly drive cost-effectiveness.[1]
Beyond the theoretical analysis, StepFun reports a measured result: on Hopper-class GPUs, Step-3 reaches a decoding throughput of up to 4,039 tokens per second per GPU in a peak minute (with FP8 attention) under a 50-millisecond time-per-output-token service level, with a long-term average near 3,910, compared with about 2,324 tokens per second per GPU reported for DeepSeek-V3 under comparable 4K-context, FP8 conditions, an increase of roughly 74 percent.[1] All of these figures are StepFun's own.
StepFun reports that Step-3 delivers competitive multimodal and reasoning performance among open models, while noting that proprietary systems such as OpenAI's o3 and Google's Gemini 2.5 Pro score higher on some tasks.[5] The company positions Step-3 ahead of several open vision-language models, including Llama 4 Maverick, QvQ-72B, GLM-4.1V, and MiMo-VL, across many of its reported metrics.[5] The following self-reported scores are drawn from StepFun's published evaluation and have not been independently verified.
| Benchmark | Step-3 score (StepFun-reported) |
|---|---|
| MMMU (multimodal understanding) | 74.2 |
| MathVision | 64.8 |
| AIME 2025 (math) | 73.0 |
| HMMT 2025 (math) | 70.0 |
| CNMO 2024 (math) | 82.9 |
| GPQA-Diamond (science) | 67.1 |
| LiveCodeBench (Aug 2024 to May 2025) | 83.7 |
| SimpleVQA | 62.2 |
| HallusionBench | 64.2 |
| DynaMath | 50.1 |
Source: StepFun published evaluation.[5] As with all vendor-reported benchmarks, these results reflect the developer's own testing conditions and should be treated with appropriate caution.
| Attribute | Detail |
|---|---|
| Developer | StepFun (Jieyue Xingchen), Shanghai, China |
| Model type | Multimodal (vision-language) mixture of experts |
| Total parameters | About 321 billion (vision-language model); language model 316 billion |
| Active parameters | About 38 billion per token |
| Vision encoder | About 5 billion parameters |
| Experts | 48 routed (3 active per token) plus 1 shared |
| Layers | 61 (5 dense) |
| Hidden size | 7,168 |
| Context length | 65,536 tokens |
| Tokenizer | DeepSeek-V3 tokenizer |
| Precision formats | bf16, block-FP8 |
| Key techniques | Multi-Matrix Factorization Attention (MFA); Attention-FFN Disaggregation (AFD) |
| Pretraining data | More than 20T text tokens, 4T image-text tokens (StepFun-reported) |
| Release date | 31 July 2025 |
| License | Apache License 2.0 |
Step-3 was released as an open-weight model on 31 July 2025 under the permissive Apache License 2.0, with weights distributed on Hugging Face (as stepfun-ai/step3), GitHub, and ModelScope, allowing developers to download and self-host it.[2][4] The release was accompanied by support for the vLLM and SGLang serving frameworks.[2]
The significance of Step-3 lies primarily in its argument that frontier-scale capability and low serving cost are not in tension. By foregrounding decoding economics as a first-class design objective and co-designing the model (MFA, MoE sparsity) with the serving system (AFD) around the arithmetic of real accelerators, StepFun offered a counterpoint to the assumption that cheaper inference requires smaller models.[1] The work also fits the broader 2025 trend of Chinese laboratories releasing capable open-weight MoE models, alongside DeepSeek-V3, Qwen, Kimi, and MiniMax, and it extended StepFun's earlier Step-1, Step-2, and Step-1X line into a more efficiency-focused flagship. Its quantitative cost and benchmark claims, however, originate with StepFun and, as of this writing, await broad independent replication.