StepFun (Chinese: 阶跃星辰, pinyin: Jiēyuè Xīngchén), formally known as Shanghai Jieyue Xingchen Intelligent Technology Co., Ltd., is a Chinese artificial intelligence startup headquartered in Shanghai. Founded on April 6, 2023, by Jiang Daxin, a former Global Vice President and Chief Scientist at Microsoft Software Technology Center Asia, the company focuses on building large language models and multimodal AI systems. StepFun is one of China's "Six Little Tigers" (六小虎) of AI, a group of six prominent AI startups that also includes Zhipu AI, Moonshot AI, MiniMax, Baichuan Intelligence, and 01.AI.
Since its founding, StepFun has released a series of foundation models spanning language, vision, video, audio, and multimodal domains. In July 2024, it launched Step-2, which at the time was the first trillion-parameter Mixture of Experts (MoE) language model built by a Chinese startup. The company has raised over $1 billion across multiple funding rounds and was reportedly exploring a Hong Kong IPO as of early 2026.
StepFun was founded on April 6, 2023, in Shanghai's Xuhui District. The catalyst for its creation was the release of ChatGPT by OpenAI in November 2022. According to Jiang Daxin, who spent 16 years at Microsoft before departing, the launch of ChatGPT convinced him that he could build something comparable or better on his own. He recruited two fellow Microsoft alumni to join the venture: Jiao Binxing, who took on responsibility for search-related systems, and Zhu Yibo, who was brought on to lead engineering and infrastructure.
Within two months of starting operations, StepFun trained its first 100-billion-parameter model, Step-1, on what the company described as its first attempt. This rapid development attracted attention from investors, and StepFun became the only company among the "Six Little Tigers" to achieve unicorn status (a valuation exceeding $1 billion) in its initial funding round. Early backers included HongShan (formerly Sequoia China), Qiming Venture Partners, and IDG Capital.
In March 2024, StepFun began publicly releasing models in its Step series. Over the next 10 months, the company released 11 self-developed foundation models covering language, multimodal understanding, image generation, video generation, and speech.
The most significant release came in July 2024 at the World Artificial Intelligence Conference (WAIC) in Shanghai, where StepFun officially launched three models simultaneously:
By November 2024, Step-2 had climbed to the top of the LiveBench benchmark rankings among Chinese models, placing fifth globally. It trailed only models from OpenAI and Google DeepMind, scoring 58.67 in reasoning, 54.86 in data analysis, and 86.57 in instruction following.
In February 2025, StepFun and Geely Auto Group jointly announced the open-sourcing of two models: Step-Video-T2V (a 30-billion-parameter text-to-video model) and Step-Audio (a 130-billion-parameter voice interaction framework). Both were released under permissive open-source licenses and made available through the Yuewen app, Hugging Face, and GitHub.
In April 2025, StepFun released Step1X-Edit, an open-source image editing model designed to rival closed-source systems like GPT-4o and Gemini 2 Flash, under the Apache 2.0 license.
In July 2025, at the 2025 WAIC, StepFun and Geely jointly launched Agent OS, a next-generation intelligent cockpit operating system for vehicles. The Geely Galaxy M9 became the first mass-produced vehicle to feature Agent OS, with nearly 40,000 units sold within four months of launch.
Later in 2025, StepFun released Step-3, its next-generation multimodal reasoning model with 321 billion total parameters and 38 billion active parameters. The model introduced two architectural innovations: Multi-Matrix Factorization Attention (MFA), which reduces KV-cache demands to roughly 22% of DeepSeek V3's per-token attention cost, and Attention-FFN Disaggregation (AFD), which decouples attention and feed-forward network layers into specialized subsystems for improved hardware utilization.
In January 2026, StepFun closed its Series B+ funding round, raising over RMB 5 billion (approximately $700 million). Alongside this milestone, the company appointed Yin Qi as chairman. Yin is the co-founder and former CEO of Megvii (Face++), one of China's first-generation computer vision companies, and also serves as chairman of Qianli Technology, a Geely-backed autonomous driving firm. The appointment reinforced StepFun's strategy of combining AI models with physical-world deployment in devices and vehicles.
In February 2026, StepFun released Step 3.5 Flash, an open-source MoE model with 196 billion total parameters and 11 billion active parameters per token. The model supports a 262K context window and achieves 100 to 300 tokens per second in generation throughput via 3-way Multi-Token Prediction. On elite mathematics benchmarks, Step 3.5 Flash scored 99.8 on AIME 2025 and 98.0 on HMMT 2025.
Also in February 2026, Bloomberg reported that StepFun was exploring a Hong Kong initial public offering that could raise approximately $500 million. Fellow Six Little Tigers members MiniMax and Zhipu AI had already listed on the Hong Kong Stock Exchange by that time.
| Name | Role | Background |
|---|---|---|
| Jiang Daxin (姜大昕) | Co-founder and CEO | PhD in Computer Science from University at Buffalo. Spent 16 years at Microsoft (2007-2023), rising to Global Vice President and Chief Scientist of Microsoft Software Technology Center Asia. Led development of Bing, Cortana, Azure Cognitive Services, and Microsoft 365 NLU systems. |
| Zhu Yibo (朱逸博) | Co-founder and CTO | PhD from UC Santa Barbara, bachelor's from Tsinghua University. Microsoft Research PhD Fellow (2015). Former Director at ByteDance, where he led AI infrastructure. Previously worked at Google. Specializes in distributed systems and large-scale GPU clusters. |
| Zhang Xiangyu (张祥雨) | Co-founder and Chief Scientist | Co-author of ResNet ("Deep Residual Learning for Image Recognition"), the most cited paper across all fields published in the 21st century. Co-author of the Helmholtz Prize-winning paper on surpassing human-level ImageNet classification. |
| Jiao Binxing (焦斌星) | Co-founder and VP | PhD from University of Science and Technology of China (2007-2012). Former Microsoft employee responsible for search-related systems. |
| Yin Qi (印奇) | Chairman (appointed January 2026) | Co-founder and former CEO of Megvii (Face++). Chairman of Qianli Technology (Geely-backed autonomous driving). Prominent figure from China's first wave of computer vision startups. |
StepFun has released a broad portfolio of foundation models across language, vision, video, audio, and multimodal domains. The table below summarizes the major releases.
| Model | Release | Type | Parameters | Key Details |
|---|---|---|---|---|
| Step-1 | 2023 | Language | 100B | First model, trained within two months of the company's founding. Dense architecture. |
| Step-1V | Early 2024 | Multimodal | 100B+ | Multimodal large language model supporting text, image, and video understanding. |
| Step-1.5V | July 2024 | Multimodal | Not disclosed | Upgraded multimodal understanding model, launched at WAIC 2024. |
| Step-1X | July 2024 | Image generation | Not disclosed | Based on DiT (Diffusion Transformer) architecture. Shown at WAIC 2024. |
| Step-2 | July 2024 | Language | 1T+ (MoE) | First trillion-parameter MoE model from a Chinese startup. Ranked 5th on LiveBench globally (November 2024). |
| GOT-OCR 2.0 | September 2024 | OCR | 580M | Unified end-to-end OCR model handling plain text, tables, math formulas, sheet music, and more. Open source (Apache 2.0). |
| Step-Video-T2V | February 2025 | Video generation | 30B | Text-to-video model generating videos up to 204 frames. 16x16 spatial and 8x temporal compression. Open source (MIT). |
| Step-Audio | February 2025 | Audio/Speech | 130B | First production-grade open-source voice interaction framework. Supports multilingual speech, emotional tones, dialects, and rap generation. |
| Step1X-Edit | April 2025 | Image editing | Not disclosed | Open-source image editing model combining Qwen-VL and DiT. Comparable to GPT-4o and Gemini 2 Flash for image editing tasks. Apache 2.0 license. |
| Step-3 | 2025 | Multimodal reasoning | 321B total, 38B active (MoE) | Trained on 20T+ text tokens and 4T image-text tokens. Introduced MFA and AFD architectural innovations. Supports 800K context. |
| Step-Video-TI2V | March 2025 | Image-to-video | Based on T2V | Extension of Step-Video-T2V adding image-conditioned video generation. |
| Step-Audio-TTS-3B | 2025 | Text-to-speech | 3B | First TTS model capable of generating rap and humming. Trained on large-scale synthetic data. |
| Step-Audio-EditX | November 2025 | Audio editing | 3B | LLM-based reinforcement learning model for expressive audio editing (emotion, style, paralinguistics). |
| NextStep-1 | 2025 | Image generation | 14B | Autoregressive image generation using continuous tokens with a 157M-parameter flow-matching head. ICLR 2026 Oral paper. |
| Step3-VL-10B | 2025 | Vision-language | 10B | Compact model matching or surpassing open-source models 10-20x its size. |
| Step 3.5 Flash | February 2026 | Multimodal reasoning | 196B total, 11B active (MoE) | 262K context window. 3-way Multi-Token Prediction. AIME 2025: 99.8, HMMT 2025: 98.0. Open source (Apache 2.0). |
| Step-Audio 2 Mini | August 2025 | Speech-to-speech | 8B | End-to-end speech conversation model. Reported to surpass GPT-4o-Audio on benchmarks. Open source (Apache 2.0). |
Yuewen (also romanized as Yuèwèn, meaning "Leap Ask") is StepFun's consumer-facing AI assistant application, available on iOS, Android, and the web. The app functions as a multimodal chatbot, supporting text-based conversation, document understanding, image and video generation, voice interaction, and task execution. As of early 2026, Yuewen is powered by Step 3.5 Flash and also integrates DeepSeek-R1 for certain reasoning tasks.
Key features include:
StepFun operates a developer platform at platform.stepfun.com (and platform.stepfun.ai for international access), offering API access to its model family. The platform provides OpenAI-compatible and Anthropic-compatible API endpoints, making integration straightforward for developers already using those ecosystems.
The platform offers tiered subscription plans, starting at $6.99 for individual developers, with a flagship $99 plan that provides rolling limits of 5,000 prompts every five hours. Models are also available through third-party platforms including OpenRouter, NVIDIA NIM, and SiliconFlow.
StepFun has raised over $1 billion in total funding across multiple rounds since its founding in April 2023.
| Round | Date | Amount | Lead Investor(s) | Notable Co-investors |
|---|---|---|---|---|
| Series A | 2023 | Not publicly disclosed | HongShan (Sequoia China) | Qiming Venture Partners, IDG Capital |
| Series B | December 2024 | "Several hundred million dollars" | Fortera Capital (Shanghai state-owned Capital Investment Co.) | Tencent, Qiming Venture Partners, Xiaomi |
| Series B+ | January 2026 | RMB 5B+ (~$700M) | Multiple (including state-backed funds) | Tencent, 5Y Capital, Qiming Venture Partners, Pudong Venture Capital, China Life Private Equity, Hong Kong Investment Corporation, Shanghai State-owned Capital Investment Leading Fund |
StepFun achieved unicorn status (valuation exceeding $1 billion) during its earliest funding rounds, making it among the fastest Chinese AI startups to reach that milestone. The Series B round in December 2024 reportedly valued the company at approximately $2 billion. The Series B+ round in January 2026 was one of the largest single funding rounds by any private AI startup in China.
StepFun's most prominent commercial partnership is with Geely Auto Group, one of China's largest automakers. The relationship is reinforced through chairman Yin Qi, who also leads Qianli Technology, Geely's autonomous driving subsidiary. The two companies jointly developed Agent OS, an intelligent cockpit operating system that integrates StepFun's multimodal models and voice AI. The Geely Galaxy M9 was the first mass-produced vehicle to ship with Agent OS, and StepFun has set a target of surpassing one million vehicle integrations by the end of 2026.
StepFun's models power AI features on smartphones from multiple major Chinese manufacturers, with OPPO among the confirmed partners. As of late 2025, StepFun's models were deployed on over 42 million shipped devices, reaching approximately 20 million daily active users. Partners reportedly account for around 60% of China's leading smartphone brands. Terminal API calls grew nearly 170% quarter-over-quarter for three consecutive quarters through the end of 2025.
StepFun has consistently focused on multimodal AI as its core differentiator among the Six Little Tigers. While other members of the group initially concentrated on text-only language models, StepFun invested early in building models that could process and generate text, images, video, and audio within unified architectures.
From its founding, StepFun has been a proponent of the scaling law hypothesis, which holds that model performance improves predictably as model size, training data, and compute increase. Jiang Daxin has stated publicly that Chinese AI can benefit significantly from pursuing bigger models trained on more data, a philosophy reflected in StepFun's progression from the 100-billion-parameter Step-1 to the trillion-parameter Step-2 and beyond.
With Step-3, StepFun introduced two notable architectural innovations:
Step-3 achieves roughly 4,039 tokens per GPU per second under 50ms latency (4K context, FP8), approximately 70% faster than DeepSeek V3's reported throughput of 1,850 tokens per GPU per second.
StepFun has released numerous models under permissive open-source licenses (Apache 2.0 and MIT), including Step-Video-T2V, Step-Audio, Step1X-Edit, GOT-OCR 2.0, Step 3.5 Flash, Step-Audio 2 Mini, NextStep-1, and Step3-VL-10B. The company's GitHub organization (github.com/stepfun-ai) hosts the source code and model weights for these releases.
The "Six Little Tigers of Large Models" (大模型六小虎) is a collective designation coined by Chinese media and investors to describe six AI startups that emerged between 2021 and 2024 as the leading domestic challengers to global AI companies like OpenAI and Anthropic. The term is modeled after the "Four Asian Tigers" economic metaphor. All six reached unicorn status by early 2024.
| Company | Founded | Founder | Notable Backing | Primary Focus |
|---|---|---|---|---|
| Zhipu AI | June 2019 | Zhang Peng | Tencent, government funds | GLM language models, code generation |
| MiniMax | 2021 | Yan Junjie | Alibaba | Consumer chatbots (Talkie), video generation (Hailuo) |
| Baichuan Intelligence | March 2023 | Wang Xiaochuan | Alibaba, Tencent | Open-source language models |
| Moonshot AI | March 2023 | Yang Zhilin | Alibaba | Long-context models, Kimi Chat |
| StepFun | April 2023 | Jiang Daxin | Tencent, HongShan, state funds | Multimodal models, automotive AI |
| 01.AI | July 2023 | Kai-Fu Lee | Various | Yi language models |
By 2025 and 2026, the trajectories of these companies began to diverge. MiniMax and Zhipu AI pursued public listings on the Hong Kong Stock Exchange, while StepFun explored a potential IPO. Others like Moonshot AI focused on consumer applications. The competitive dynamics within the group are further shaped by the involvement of China's largest technology companies as investors, with Tencent backing both StepFun and Zhipu AI, and Alibaba backing MiniMax and Moonshot AI.
StepFun is headquartered at Lane 315, Fenggu Road, Xuhui District, Shanghai. As of 2025, the company employed approximately 400 people. The workforce is concentrated in research and engineering, reflecting StepFun's identity as a foundation model company. The company's official website is stepfun.com (Chinese) and stepfun.ai (international), while its open-source repositories are hosted at github.com/stepfun-ai.
The company name "阶跃星辰" (Jiēyuè Xīngchén) translates loosely to "Step Stars" or "Stepping Through the Stars," reflecting the company's ambition in AI research. The English brand name "StepFun" combines "Step" (from the step function concept in mathematics and the idea of incremental progress) with "Fun," signaling the company's consumer-facing aspirations alongside its research agenda.
Beyond its commercial model releases, StepFun has contributed several notable research papers and open-source tools to the broader AI community.
In September 2024, StepFun released GOT-OCR 2.0 (General OCR Theory), a unified end-to-end optical character recognition model with 580 million parameters. Unlike traditional OCR systems that rely on multi-stage pipelines, GOT-OCR 2.0 treats all artificial optical signals as "characters" and processes them through a single model. The system handles plain text, formatted documents, tables, charts, mathematical formulas, molecular structures, geometric shapes, and even sheet music. The model was released under the Apache 2.0 license and is available on Hugging Face.
NextStep-1 is a 14-billion-parameter autoregressive image generation model that works directly with continuous image tokens rather than quantizing them into discrete visual words. The model uses a causal Transformer backbone with a lightweight 157-million-parameter flow-matching head to predict the next continuous image token. This approach demonstrates that an LLM-style transformer can serve as the primary engine for image generation without relying on vector quantization or heavyweight external diffusion modules. The paper was accepted as an Oral presentation at ICLR 2026. A follow-up version, NextStep-1.1, was released in December 2025 with improved output quality through extended training and a flow-based reinforcement learning post-training paradigm.
StepFun also released Step-DeepResearch, an open-source deep research agent built on its foundation models, enabling automated multi-step information gathering and synthesis.
StepFun competes on multiple fronts. Within China, its closest competitors include the other Five Little Tigers as well as established technology companies like Baidu (with its Ernie series), Alibaba (with Qwen), ByteDance (with Doubao/Seed), and DeepSeek. Internationally, StepFun benchmarks its models against those from OpenAI, Anthropic, and Google DeepMind.
StepFun's multimodal focus and automotive partnerships give it a distinct positioning. While most Chinese AI startups initially competed on language model benchmarks, StepFun's early investment in video, audio, and image generation, combined with its deployment in Geely vehicles and smartphones, gives the company a differentiated commercial strategy centered on "AI plus terminal devices."
The competitive landscape has been shaped by several factors. U.S. export controls on advanced AI chips have constrained the compute available to Chinese AI companies, pushing them to develop more efficient training and inference techniques. StepFun's MFA and AFD innovations in Step-3 are partly a response to these hardware constraints. Meanwhile, the entry of DeepSeek as a formidable competitor in early 2025, with its open-source DeepSeek-R1 reasoning model, intensified pressure on all Chinese AI startups to differentiate through product deployment rather than benchmark scores alone.