Step-3

AI Models Large Language Models

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,708 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Step-3 is an open-weight large multimodal mixture of experts (MoE) model released in July 2025 by StepFun, the Shanghai-based Chinese artificial intelligence startup also known as Jieyue Xingchen. It is a vision-language model with roughly 321 billion total parameters and about 38 billion parameters activated per token, and it is distinguished less by raw benchmark leadership than by its central design goal: minimizing the cost of inference decoding so that a frontier-scale model can be served cheaply at high throughput.^[1]^[2] StepFun pursued this goal through a model-system co-design that pairs two techniques, Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), to reduce attention cost and raise GPU utilization. The accompanying research paper, titled "Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding," was submitted to arXiv on 25 July 2025, and the model weights were open-sourced on 31 July 2025 under the Apache License 2.0.^[1]^[3]^[4]

Overview

Step-3 was positioned as StepFun's flagship foundation model for 2025 and as a cost-efficient challenger among Chinese open-weight frontier models such as DeepSeek-V3, Qwen, Kimi, and MiniMax. The model accepts both image and text inputs and is aimed at multimodal AI reasoning tasks, including mathematics, science, and code, alongside general visual understanding.^[2]^[5]

The defining thesis of the project, captured in the paper's title, is that a large model need not be expensive to run. Rather than shrinking the model to cut serving costs, StepFun argued that decoding cost is governed by the interaction of three factors, attention arithmetic intensity, MoE sparsity, and the way attention and feed-forward computation are placed on hardware, and that a model co-designed around the economics of real accelerators can activate more parameters per token than rivals while still costing less to serve.^[1] Step-3 activates 38 billion parameters per token, more than DeepSeek-V3 or Qwen3 MoE 235B activate, yet StepFun reports lower theoretical decoding cost on the hardware it studied.^[1]

StepFun

StepFun (Shanghai Jieyue Xingchen Intelligent Technology Co., Ltd.) was founded on 6 April 2023 by former Microsoft researchers, including Jiang Daxin, a former Microsoft vice president and an expert in search and natural language processing, who serves as chief executive.^[6] The company is widely described as one of China's "AI Tiger" startups, a group of well-funded large-model developers, and its investors have included Tencent, Qiming Venture Partners, and Shanghai state-backed capital.^[6]

StepFun has emphasized multimodal foundation models across text, image, audio, and video. At the World Artificial Intelligence Conference in July 2024 it launched Step-2, a trillion-parameter MoE language model, together with the Step-1.5V multimodal model and the Step-1X image-generation model.^[6] In February 2025 it open-sourced the Step-Video-T2V text-to-video model and the Step-Audio speech model.^[6] Step-3 followed in July 2025 as the company's next-generation flagship, and StepFun continued the line afterward with smaller, faster MoE variants such as Step-3.5-Flash (a 196-billion-parameter MoE with about 11 billion active parameters) released in February 2026.^[6]

Architecture

Step-3 is built on a sparse mixture of experts transformer design. According to the technical report, the vision-language model totals about 321 billion parameters; the language-model component comprises 316 billion parameters with 38 billion activated for each text token, and there is an additional vision encoder of roughly 5 billion parameters that handles image inputs.^[1] The released model card lists 61 layers (5 of them dense), a hidden dimension of 7,168, a maximum context length of 65,536 tokens, and a reuse of the DeepSeek-V3 tokenizer.^[2]

The MoE feed-forward layers use 48 routed experts with 3 experts selected per token plus 1 shared expert, a relatively fine-grained sparsity pattern.^[2] The model card distributes weights in both bf16 and block-FP8 formats and recommends serving through inference engines such as vLLM and SGLang.^[2] During pretraining the model processed more than 20 trillion text tokens and 4 trillion image-text mixed tokens spanning over ten languages, per StepFun.^[5]

Multi-Matrix Factorization Attention (MFA)

Multi-Matrix Factorization Attention is the attention mechanism at the core of Step-3's efficiency design. MFA applies low-rank matrix factorization to the query-key circuit, which lets StepFun scale both the number and the dimensionality of attention heads in a parameter-efficient way while keeping the KV cache small.^[1]^[5] The reported configuration uses 64 query heads with a head dimension of 256 and a low-rank query dimension of 2,048.^[2] StepFun states that this design reduces both KV-cache size and attention compute while preserving attention expressiveness, and reports that Step-3 uses roughly 22 percent of DeepSeek-V3's per-token attention cost.^[1]^[5]

Attention-FFN Disaggregation (AFD)

Attention-FFN Disaggregation is a distributed-inference system, rather than a change to the model weights, that decouples the attention layers and the feed-forward (FFN) layers into separate, specialized subsystems running on different hardware.^[1] Because attention and the MoE feed-forward layers have very different compute and memory profiles, executing them together forces compromises in batching and hardware utilization. By disaggregating the two, AFD lets each subsystem be sized and scheduled independently, which StepFun reports raises decoding throughput, particularly when attention and FFN are mapped onto different accelerator types in a heterogeneous setup.^[1] The paper presents AFD as the system half of a co-design in which MFA and the MoE sparsity pattern are the model half.

Decoding-cost efficiency

The central contribution of Step-3 is its analysis of decoding cost, the cost of generating output tokens, which dominates serving expense for reasoning and long-output workloads. StepFun reports a theoretical decoding-cost analysis across several accelerators, including NVIDIA H800, H20, and A800 and Huawei Ascend 910B, expressed in US dollars per million decoded tokens.^[1] These per-token cost figures are theoretical estimates derived from the model and system design, not list prices; the comparisons should be read as such.

In that analysis, StepFun reports that Step-3 has lower theoretical decoding cost than both DeepSeek-V3 and Qwen3 MoE 235B, with the advantage widening at longer context. At an 8K context (using AFD on H800 and H20), the paper cites about 0.055 USD per million decoded tokens for Step-3 versus 0.068 for DeepSeek-V3 and 0.062 for Qwen3 MoE 235B; at 32K context the gap grows to roughly 0.129 for Step-3 versus 0.211 for DeepSeek-V3 and 0.193 for Qwen3 MoE 235B, corresponding to cost reductions in the range of about 19 to 39 percent against DeepSeek-V3 and about 11 to 33 percent against Qwen3 MoE 235B over those context lengths.^[1] StepFun emphasizes that Step-3 attains this lower cost despite activating more parameters per token than either comparison model, which it presents as evidence that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD jointly drive cost-effectiveness.^[1]

Beyond the theoretical analysis, StepFun reports a measured result: on Hopper-class GPUs, Step-3 reaches a decoding throughput of up to 4,039 tokens per second per GPU in a peak minute (with FP8 attention) under a 50-millisecond time-per-output-token service level, with a long-term average near 3,910, compared with about 2,324 tokens per second per GPU reported for DeepSeek-V3 under comparable 4K-context, FP8 conditions, an increase of roughly 74 percent.^[1] All of these figures are StepFun's own.

Benchmarks

StepFun reports that Step-3 delivers competitive multimodal and reasoning performance among open models, while noting that proprietary systems such as OpenAI's o3 and Google's Gemini 2.5 Pro score higher on some tasks.^[5] The company positions Step-3 ahead of several open vision-language models, including Llama 4 Maverick, QvQ-72B, GLM-4.1V, and MiMo-VL, across many of its reported metrics.^[5] The following self-reported scores are drawn from StepFun's published evaluation and have not been independently verified.

Benchmark	Step-3 score (StepFun-reported)
MMMU (multimodal understanding)	74.2
MathVision	64.8
AIME 2025 (math)	73.0
HMMT 2025 (math)	70.0
CNMO 2024 (math)	82.9
GPQA-Diamond (science)	67.1
LiveCodeBench (Aug 2024 to May 2025)	83.7
SimpleVQA	62.2
HallusionBench	64.2
DynaMath	50.1

Source: StepFun published evaluation.^[5] As with all vendor-reported benchmarks, these results reflect the developer's own testing conditions and should be treated with appropriate caution.

Specifications

Attribute	Detail
Developer	StepFun (Jieyue Xingchen), Shanghai, China
Model type	Multimodal (vision-language) mixture of experts
Total parameters	About 321 billion (vision-language model); language model 316 billion
Active parameters	About 38 billion per token
Vision encoder	About 5 billion parameters
Experts	48 routed (3 active per token) plus 1 shared
Layers	61 (5 dense)
Hidden size	7,168
Context length	65,536 tokens
Tokenizer	DeepSeek-V3 tokenizer
Precision formats	bf16, block-FP8
Key techniques	Multi-Matrix Factorization Attention (MFA); Attention-FFN Disaggregation (AFD)
Pretraining data	More than 20T text tokens, 4T image-text tokens (StepFun-reported)
Release date	31 July 2025
License	Apache License 2.0

Availability and significance

Step-3 was released as an open-weight model on 31 July 2025 under the permissive Apache License 2.0, with weights distributed on Hugging Face (as stepfun-ai/step3), GitHub, and ModelScope, allowing developers to download and self-host it.^[2]^[4] The release was accompanied by support for the vLLM and SGLang serving frameworks.^[2]

The significance of Step-3 lies primarily in its argument that frontier-scale capability and low serving cost are not in tension. By foregrounding decoding economics as a first-class design objective and co-designing the model (MFA, MoE sparsity) with the serving system (AFD) around the arithmetic of real accelerators, StepFun offered a counterpoint to the assumption that cheaper inference requires smaller models.^[1] The work also fits the broader 2025 trend of Chinese laboratories releasing capable open-weight MoE models, alongside DeepSeek-V3, Qwen, Kimi, and MiniMax, and it extended StepFun's earlier Step-1, Step-2, and Step-1X line into a more efficiency-focused flagship. Its quantitative cost and benchmark claims, however, originate with StepFun and, as of this writing, await broad independent replication.

References

Wang, B., et al. (StepFun). "Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding." arXiv:2507.19427, 25 July 2025. https://arxiv.org/abs/2507.19427 ↩
"stepfun-ai/step3." Hugging Face model card. https://huggingface.co/stepfun-ai/step3 ↩
"[2507.19427] Step-3 is Large yet Affordable." arXiv listing. https://arxiv.org/abs/2507.19427 ↩
StepFun (@StepFun_ai). "Step 3 will be open-sourced on July 31st!" Announcement, July 2025. https://x.com/StepFun_ai/status/1948954102127624531 ↩
"Step3: Cost-Effective Multimodal Intelligence." StepFun official research page. https://stepfun.ai/research/en/step3 ↩
"StepFun." Wikipedia. https://en.wikipedia.org/wiki/StepFun ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

StepFun

Overview

StepFun

Architecture

Multi-Matrix Factorization Attention (MFA)

Attention-FFN Disaggregation (AFD)

Decoding-cost efficiency

Benchmarks

Specifications

Availability and significance

References

Improve this article

Related Articles

LLaMA/Model Card

Bert-base-uncased model

Foundation models

GPT

Llama 3

GPT-5

What links here

Related Articles

LLaMA/Model Card

Bert-base-uncased model

Foundation models

GPT

Llama 3

GPT-5